Very true. Long gone are the days when you could calculate execution time by hand. Even profiling is of little use unless you have a wide range of CPUs to run tests on.UrbanCyborg wrote: ↑Wed Oct 19, 2022 8:02 pm Once the processors started doing on-the-fly optimizations, sometime around the x686, as I recall, any calculations you could do for clock cycles and latency went out the window, because you had no idea of what the CPU was going to do to your finely-crafted assembly code.
And things are going to get ever more disconnected as CPUs deploy more pipelines and start running with deeper speculative execution rather than "mere" branch prediction.
It obviously used to be the case that an addition was always cheaper than a multiplication or that 32 bits was cheaper than 64 but steadily everything is moving to a uniform do arithmetic as fast as is theoretically possible state of affairs. So I now code on the assumption that every operation has the same cost except for memory access. Hence what I said about look-up tables. Older CPU will still take longer to do a double precision floating point op than a single precision op but I think it's not really worth worrying about this because in just a few years time things will have moved on.
Something I've been thinking about recently is how the Java Virtual Machine might increasingly be able to automatically adapt to varying CPUs. Hotspot is over a million lines of C++ so I wouldn't want to hazard a guess at how rapidly it will improve but in principle a JIT compiler ought to be able to optimize to whatever the host CPU is. This gives Java (and JVM languages like Kotlin etc) a potentially huge advanatage over native languages like C++ as native language optimization is frozen in time at the point the compiler runs on the developer's machine so can't adapt to improvements in user's machines.