"Some benchmarks I have seen shows 2.8 times improvement
"
Yep, very misleading, I picked best case
But for Neural networking and many other ML/CV 8 bit can be used in the NEON.
16x8 bit SIMD = 128bits, the trick is to keep as much as possible in the Aarch 64bit registers.
Putting stuff into NEON is twice as many cycles in 32bit Arm.
Understand, generic applications is not the reason to go to 64bit.
It may end up going to hand tuned assembly routines for CV/ML/NN to get best performance for a single purpose application.
Think single purpose Intelligent AI for IoT, not general purpose AI.
If I have to use assembly, I would rather learn to do it in 64bit mode.
The simpler 64 bit instruction set should be easier to learn and there are twice as many registers.
If everything can be kept in the cache memory external memory access speed is less relevant.
Not sure what 64bit floating point is needed for?
Plus I have to figure in my age, by the time I actually learn to use it, 90-99% of Pi's made will be 64 bit
Not sure is I have time left to learn the more complex 32bit, it is just a hobby sort of
Hopefully a 64bit Zero becomes available too, A35's use less power, have smaller die and should be cheaper.