I was just looking through the code base and was thinking would any of the thread local's benefit from per cpu caches instead of per thread? This would of course only apply on Linux as i don't think any other os's support it. it may not be worth it though since rayon is mostly used which means you normally have the same number of threads as cpus which means there is no memory benefit. But i believe it can be faster then thread locals due to cache line contention.
more info here on what i mean.
https://google.github.io/tcmalloc/rseq.html