-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Changes made to ConfigureForSnitch.cpp:
Replaced [0,40,100] tiling configuration with [0,88,50]:
if (funcOp.getName() ==
"main$async_dispatch_1_matmul_transpose_b_1x1200x400_f64") {
l1Tiles[0] = 0;
l1Tiles[1] = 88;
l1Tiles[2] = 50;
}
Changes made to LowerL1Allocations.cpp:
- I created a string stream and sent it debugging statements.
- This string stream only gets printed out when the kernel does not fit in L1, when condition
offset >= l1MemoryBytesis met. - attaching my modified
LowerL1Allocations.cppas a.txt
LowerL1Allocations.txt in case this makes my modifications clearer.
The kernel does not fit in L1:
<eval_with_key>.0 from /home/hoppip/Quidditch/venv/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:551 in wrapped:19:0: warning: Let's look at ALL the allocOps before doging ANYTHING!
allocOp with memref shape 1 1200
allocOp with memref shape 1 1232
allocOp with memref shape 1 50
allocOp with memref shape 88 50
allocOp with memref shape 88 50
allocOp with memref shape 1 1200
allocOp with memref shape 1 1200
Well, those were all the allocOps... =_=
allocOp with memref shape 1 1200
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 9600
allocOp with memref shape 1 1232
memref size is 8
allocElements is 1232
NOW memref size is 9856
offset is 19456
allocOp with memref shape 1 50
memref size is 8
allocElements is 50
NOW memref size is 400
offset is 19856
allocOp with memref shape 88 50
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 55104
allocOp with memref shape 88 50
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 90304
allocOp with memref shape 1 1200
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 99904
allocOp with memref shape 1 1200
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 109504
allocElements is 1200
memref size is 9600
offset is 109504
l1MemoryBytes is 100000, so 9504 too much
kernel does not fit into L1 memory and cannot be compiled
When the kernel does not fit L1, it's IR gets dumped to stderr, and we can see an unnecessary buffer of gets allocated (lines 99-124 of 0-88-56-build-failure-dispatch-1.mlir):
// allocate a buffer of 1x1200 elements
%25 = "arith.constant"() <{value = 0 : index}> : () -> index
%26 = "memref.view"(%0, %25) : (memref<100000xi8>, index) -> memref<1200xf64>
%27 = "memref.reinterpret_cast"(%26) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0>, static_sizes = array<i64: 1, 1200>, static_strides = array<i64: 1200, 1>}> : (memref<1200xf64>) -> memref<1x1200xf64>
%28 = "memref.alloca"() <{alignment = 64 : i64, operandSegmentSizes = array<i32: 0, 0>}>
: () -> memref<1x1200xf64, #quidditch_snitch.l1_encoding>
// set this buffer to all zeroes
%29 = "quidditch_snitch.compute_core_index"() : () -> index
%30 = "affine.apply"(%29) <{map = affine_map<()[s0] -> (s0 * 150)>}> : (index) -> index
"scf.for"(%30, %1, %1) ({
^bb0(%arg25: index):
%94 = "memref.subview"(%27, %arg25) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808>, static_sizes = array<i64: 1, 150>, static_strides = array<i64: 1, 1>}> : (memref<1x1200xf64>, index) -> memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>
"quidditch_snitch.memref.microkernel"(%94) ({
^bb0(%arg26: memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>):
%95 = "arith.constant"() <{value = 0.000000e+00 : f64}> : () -> f64
"linalg.fill"(%95, %arg26) <{operandSegmentSizes = array<i32: 1, 1>}> ({
^bb0(%arg27: f64, %arg28: f64):
"linalg.yield"(%arg27) : (f64) -> ()
}) : (f64, memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
}) : (memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
"quidditch_snitch.microkernel_fence"() : () -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
// after this point, the 1x1200 buffer allocated to %28 or equivalently %26 never gets used again!
The buffer assigned to %28 and %26 gets set to zero, and then is never used again.
-
build output attached as
0-88-56-build-failure.txt
0-88-56-build-failure.txt -
annotated mlir for the dumped dispatch attached as
0-88-56-build-failure-dispatch-1.txt
0-88-56-build-failure-dispatch-1.txt
Metadata
Metadata
Assignees
Labels
No labels