**Joe Groff** @joe@f.duriansoftware.com · Oct 30, 2024, 20:14

**Joe Groff** @joe@f.duriansoftware.com · Oct 30, 2024, 20:14

Joe Groff @joe@f.duriansoftware.com

Oct 30, 2024, 20:14

@lenary @erisceleste @eniko yeah, though even for a contemporary x86 uarch i think it's better/not worse to `sub rsp` in the prolog and poke arguments into the frame (which is how gcc/clang codegen frames) unless you really super care about code size over speed

**Steve Canon** @steve@discuss.systems · Oct 30, 2024, 21:33

**Steve Canon** @steve@discuss.systems · Oct 30, 2024, 21:33

Oct 30, 2024, 21:33

Steve Canon @steve@discuss.systems

@joe @lenary @erisceleste @eniko there’s a stack address predictor that makes it no better, in general, but yeah, it’s also not worse.

**Joe Groff** @joe@f.duriansoftware.com · Oct 30, 2024, 21:56

**Joe Groff** @joe@f.duriansoftware.com · Oct 30, 2024, 21:56

Oct 30, 2024, 21:56

Joe Groff @joe@f.duriansoftware.com

@steve @lenary @erisceleste @eniko i was also curious how much "shadow stack" load/store forwarding makes a difference, if at all

**Zoe** @ekg@librem.one · 2024-10-30T22:08:19Z

Zoe @ekg@librem.one

@joe @steve from a performance perspective the only thing that really should matter is predictability. Morden super scaler CPUs will break apart the instruction flow rename all the registers, and memory locations, schedule the micro ops and retire them. Assuming no resource is over subscribed the one thing that limit through put is the branch predicter, on most morden CPUs that is one taken branch per cycle with a misprediction penalty of 15 cycles.

Oct 30, 2024, 22:08 · Librem Social · · ·

**Zoe** @ekg@librem.one · Oct 30, 2024, 22:53

**Zoe** @ekg@librem.one · Oct 30, 2024, 22:53

Oct 30, 2024, 22:53

Zoe @ekg@librem.one

@joe @steve I did get some of the conversation cut off, so I assumed that the discussion was about function calls. A huge limiter of performance I disregarded as irrelevant is deserialisation which is impossible with back to back dependent instructions. The numbers of instruction that can be executed in parallel is largely irrelevant if the scheduler is constantly waiting for the last instruction to finish.