@lenary @erisceleste @eniko yeah, though even for a contemporary x86 uarch i think it's better/not worse to `sub rsp` in the prolog and poke arguments into the frame (which is how gcc/clang codegen frames) unless you really super care about code size over speed
@joe @lenary @erisceleste @eniko there’s a stack address predictor that makes it no better, in general, but yeah, it’s also not worse.
@joe @steve from a performance perspective the only thing that really should matter is predictability. Morden super scaler CPUs will break apart the instruction flow rename all the registers, and memory locations, schedule the micro ops and retire them. Assuming no resource is over subscribed the one thing that limit through put is the branch predicter, on most morden CPUs that is one taken branch per cycle with a misprediction penalty of 15 cycles.
@joe @steve I did get some of the conversation cut off, so I assumed that the discussion was about function calls. A huge limiter of performance I disregarded as irrelevant is deserialisation which is impossible with back to back dependent instructions. The numbers of instruction that can be executed in parallel is largely irrelevant if the scheduler is constantly waiting for the last instruction to finish.