Using something like the eve library could work well to allow writing code that works on different architectures. I'm specifically interested in an SSE4.1 or AVX implementation.
I could probably look into working on this sometime next month if I have time.