Skip to content

Unnecessary buffer allocated when padding in tile's row dimension #142

@EmilySillars

Description

@EmilySillars

Changes made to ConfigureForSnitch.cpp:
Replaced [0,40,100] tiling configuration with [0,88,50]:

        if (funcOp.getName() ==
            "main$async_dispatch_1_matmul_transpose_b_1x1200x400_f64") {
          l1Tiles[0] = 0;
          l1Tiles[1] = 88;
          l1Tiles[2] = 50;
        }

Changes made to LowerL1Allocations.cpp:

  • I created a string stream and sent it debugging statements.
  • This string stream only gets printed out when the kernel does not fit in L1, when condition offset >= l1MemoryBytes is met.
  • attaching my modified LowerL1Allocations.cpp as a .txt
    LowerL1Allocations.txt in case this makes my modifications clearer.

The kernel does not fit in L1:

<eval_with_key>.0 from /home/hoppip/Quidditch/venv/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:551 in wrapped:19:0: warning: Let's look at ALL the allocOps before doging ANYTHING! 

allocOp with memref shape 1 1200 

allocOp with memref shape 1 1232 

allocOp with memref shape 1 50 

allocOp with memref shape 88 50 

allocOp with memref shape 88 50 

allocOp with memref shape 1 1200 

allocOp with memref shape 1 1200 
Well,  those were all the allocOps... =_=

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 9600

allocOp with memref shape 1 1232 
memref size is 8
allocElements is 1232
NOW memref size is 9856
offset is 19456

allocOp with memref shape 1 50 
memref size is 8
allocElements is 50
NOW memref size is 400
offset is 19856

allocOp with memref shape 88 50 
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 55104

allocOp with memref shape 88 50 
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 90304

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 99904

allocOp with memref shape 1 1200 
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 109504

allocElements is 1200
memref size is 9600
offset is 109504
l1MemoryBytes is 100000, so 9504 too much
kernel does not fit into L1 memory and cannot be compiled

When the kernel does not fit L1, it's IR gets dumped to stderr, and we can see an unnecessary buffer of gets allocated (lines 99-124 of 0-88-56-build-failure-dispatch-1.mlir):

 // allocate a buffer of 1x1200 elements
  %25 = "arith.constant"() <{value = 0 : index}> : () -> index
  %26 = "memref.view"(%0, %25) : (memref<100000xi8>, index) -> memref<1200xf64>
  %27 = "memref.reinterpret_cast"(%26) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0>, static_sizes = array<i64: 1, 1200>, static_strides = array<i64: 1200, 1>}> : (memref<1200xf64>) -> memref<1x1200xf64>
  %28 = "memref.alloca"() <{alignment = 64 : i64, operandSegmentSizes = array<i32: 0, 0>}> 
  : () -> memref<1x1200xf64, #quidditch_snitch.l1_encoding>

  // set this buffer to all zeroes
  %29 = "quidditch_snitch.compute_core_index"() : () -> index
  %30 = "affine.apply"(%29) <{map = affine_map<()[s0] -> (s0 * 150)>}> : (index) -> index
  "scf.for"(%30, %1, %1) ({
  ^bb0(%arg25: index):
    %94 = "memref.subview"(%27, %arg25) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808>, static_sizes = array<i64: 1, 150>, static_strides = array<i64: 1, 1>}> : (memref<1x1200xf64>, index) -> memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>
    "quidditch_snitch.memref.microkernel"(%94) ({
    ^bb0(%arg26: memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>):
      %95 = "arith.constant"() <{value = 0.000000e+00 : f64}> : () -> f64
      "linalg.fill"(%95, %arg26) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg27: f64, %arg28: f64):
        "linalg.yield"(%arg27) : (f64) -> ()
      }) : (f64, memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
    }) : (memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
    "quidditch_snitch.microkernel_fence"() : () -> ()
    "scf.yield"() : () -> ()
  }) : (index, index, index) -> ()

  // after this point, the 1x1200 buffer allocated to %28 or equivalently %26 never gets used again!

The buffer assigned to %28 and %26 gets set to zero, and then is never used again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions