Delay stack size allocation until after core code generated #1104

newling · 2025-02-13T23:21:29Z

remove stack_size attribute from aie.core
rename address to stack_relative_address in amdaie.buffer and aie.buffer
compute stack size from .o / .s file.
propagate changes as needed.

This sort of works, for some reason one of the i32 batch matmul tests doesn't like this change though. Also I've hard-coded all chess and ukernel tests to stack-size = 2K, I think it should be determine the stack sizes in these cases too

jtuyls · 2025-02-14T09:31:50Z

compiler/plugins/target/AMD-AIE/aie/AIEOps.td

@@ -484,7 +484,7 @@ def AIE_BufferOp: AIE_Op<"buffer", [
  let arguments = (
    ins Index:$tile,
    OptionalAttr<StrAttr>:$sym_name,
-    OptionalAttr<I32Attr>:$address,
+    OptionalAttr<I32Attr>:$stack_relative_address,


Is this assuming that the stack is located at address 0 and use offsets relative to that? If so, I don't think we should make that assumption. To avoid bank conflicts, I will soon want to be able to assign buffers strategically across the memory banks at specific addresses and an address relative to the stack makes that impossible.

So previously if we had stackSize = 1024, address = 1040 Now we have stack_relative_address = 16 and the stack size is set later. We can set stack size to be any value larger than what the final code actually needs (largest offset from stack pointer in object dump).

So we stack size to 1024, the program is completely unchanged from before this PR. The stack is at address 0, I don't see any reason to change that. If you mean you'd like some additional memory left as a gap between the stack and the first buffer, that is easy to insert.

These are the lines where the addresses of buffers are set:

https://github.com/nod-ai/iree-amd-aie/pull/1104/files#diff-5a1e140ca056a489d6cf9766431fb5529180ec410b726425dc8a4208f8af477eL57

So before address was assigned incrementally starting at stackSize. Now, stack_relative_address is assigned incrementally starting at 0.

Can you provide more information of what you have in mind? Also, do you see this PR as moving further from your goal than what we currently have?

The problem with assigning buffers relative to the stack address is that we won't be able to control the exact locations of other buffers, for example:

stack size is 4096, buffer A0 (ping) has size 8192, buffer A1 (pong) has size 8192, buffer B0 has size 8192 etc

Addresses assigned will be:

Stack: 0 (bank 1) A0: 4096 -> 12228 (bank 1) A1: 12288 -> 20480 (bank 1 and 2) B0: 20480 -> 28672 (bank 2) B1: 28672 -> 36864 (bank 2 and 3)

Now, this will result in two DMAs operating on the same bank 2 at some points + the vector processor as well and this will result in bank conflicts.

What I would like to achieve is something like:

A0: 0 -> 8192 (bank 1) A1: 8192 -> 16384 (bank 1) B0: 16384 -> 24576 (bank 2) B1: 24576 -> 32768 (bank 2)

This assumes the stack is located somewhere else, but even if you would assume the stack is located at address 0 and we would like to use banks 2/3 for the above buffers, that can't be achieved with an address relative to the stack:

stack: 0 (bank 1) A0: 16384 -> 24576 (bank 2) A1: 24576 -> 32768 (bank 2) B0: 32768 -> 40960 (bank 3) B1: 40960 -> 49152 (bank 3)

I know this is the case as well with the current assignment as the stack is located at address 0 and buffers are appended after that and we will need better assignment strategies to avoid that. But we can't avoid these bank conflicts without granular control of where to locate buffers.

Got it. I think placing the stack somewhere other than at address 0 will be a hard task, it will be easier (for now) to start assigning buffers from the end of the memory (64K?). And the we start having banks too. Let me play around with this

Ideally, we would place all buffers after we know all sizes, but that doesn't seem easy right now as the buffer addresses are provided as an input to peano. Alternatively, we can put an upper bound on the stack size and error out if that would be exceeded, which would be an improvement already. Taking that further, we could place the stack buffer at the largest unused data memory block available and error out if that was not enough. This would work as well with more intelligent buffer placement strategies.

jtuyls reviewed Feb 14, 2025

View reviewed changes

newling added 8 commits February 18, 2025 08:34

first commit

9382355

first commit

dc1aeb2

test updates

e79acff

gate chess

f94c8d1

last attempt

bc9a193

rm

4d44d2a

squish

96a6a3c

tidy

7acf46b

newling force-pushed the jit_stack branch from 5087230 to 7acf46b Compare February 19, 2025 00:31

newling mentioned this pull request Feb 19, 2025

Function to estimate stack size #1116

Closed

generalize padd

e1cd157

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay stack size allocation until after core code generated #1104

Delay stack size allocation until after core code generated #1104

newling commented Feb 13, 2025 •

edited

Loading

jtuyls Feb 14, 2025

newling Feb 14, 2025

jtuyls Feb 14, 2025

newling Feb 14, 2025

jtuyls Feb 17, 2025

Delay stack size allocation until after core code generated #1104

Are you sure you want to change the base?

Delay stack size allocation until after core code generated #1104

Conversation

newling commented Feb 13, 2025 • edited Loading

jtuyls Feb 14, 2025

Choose a reason for hiding this comment

newling Feb 14, 2025

Choose a reason for hiding this comment

jtuyls Feb 14, 2025

Choose a reason for hiding this comment

newling Feb 14, 2025

Choose a reason for hiding this comment

jtuyls Feb 17, 2025

Choose a reason for hiding this comment

newling commented Feb 13, 2025 •

edited

Loading