The previous commit corrected the RNG behavior for the lightbaker but
also made it significantly slower on high core count systems. Due to the
vector of states being physically close together in RAM we force a cache
synchronization across all cores whenever we call for the next random
number to be generated.
This will create a temporary local copy of the RNG state before entering
the loop and then saving it back to the global state when done. This
will preserve the per-thread RNG state (and random number quality) while
significantly improving performance.
On my 16 thread box it saves 3 minutes baking the Sponza scene, bringing
performance back in line to before the various RNG fixes were
introduced, being slightly faster than the first implementation.
In our previous attempts to fix the lightmapper we may have
inadvertently introduced the same issue we were trying to fix. It
appears that rand() will on some platforms introduce a mutex making it
slower and on others may have a per-thread state that would need to be
initialized with srand() on each thread. This slows down the lightbaking
further.
This sets up a separate rng state for each OpenMP thread by calling
rand() only in the single-threaded part of the code. We then keep a
vector of states.
I believe this solves our problems.
Due to memory contraints in other places in Godot it is unlikely that
anything higher than 1024 will actually work. When/if we improve memory
management for vectors we can increase this limit again
Based off of perf-based prediction misses these seem to be the
lowest-hanging fruit for quick (albeit small) improvements. These are
based on:
* baking a complex lightmap
* running platformer 3d
* running goltorus
On higher threadcount systems this allows for better utilization. On my
16 thread box CPU use goes from 10 - 11 threads to a steady 15 threads
on the Sponza scene.
Baking time goes from ~10:00 to ~07:30 for me. On lower threadcount
systems I expect some improvement also but likely a little less.
This speeds up the lightmapper by about 10% with no visible impact. A
comparison is up here:
https://tmm.cx/nextcloud/s/Log1eAXen1dJzBz
AMD Ryzen 7 1700 Eight-Core Processor
Sponza scene
pcg32
256/256/high 00:10:13
256/256/medium 00:02:50
256/256/low 00:01:11
xorshift
256/256/high 00:09:32
256/256/medium 00:02:34
256/256/low 00:01:05