@lenary @erisceleste @eniko yeah, though even for a contemporary x86 uarch i think it's better/not worse to `sub rsp` in the prolog and poke arguments into the frame (which is how gcc/clang codegen frames) unless you really super care about code size over speed
@joe @lenary @erisceleste @eniko there’s a stack address predictor that makes it no better, in general, but yeah, it’s also not worse.
@steve @lenary @erisceleste @eniko i was also curious how much "shadow stack" load/store forwarding makes a difference, if at all
@joe @steve I did get some of the conversation cut off, so I assumed that the discussion was about function calls. A huge limiter of performance I disregarded as irrelevant is deserialisation which is impossible with back to back dependent instructions. The numbers of instruction that can be executed in parallel is largely irrelevant if the scheduler is constantly waiting for the last instruction to finish.