TC12 x86 gcc 10 lua-5.4.4
vanilla "make linux-readline" with -O2
tc@box:/tmp/lua-5.4.4$ time src/lua -e 'for i=1,1000000 do end'
real    0m 0.05s
user    0m 0.05s
sys     0m 0.00s
tc@box:/tmp/lua-5.4.4$ ls -l src/liblua.a src/lua src/luac
-rw-r--r--    1 tc       staff       406946 Sep 17 10:22 src/liblua.a
-rwxr-xr-x    1 tc       staff       331708 Sep 17 10:22 src/lua
-rwxr-xr-x    1 tc       staff       228748 Sep 17 10:22 src/luac
MYCFLAGS= -fno-unwind-tables -fno-asynchronous-unwind-tables
tc@box:/tmp/lua-5.4.4$ time src/lua -e 'for i=1,1000000 do end'
real    0m 0.07s
user    0m 0.06s
sys     0m 0.00s
tc@box:/tmp/lua-5.4.4$ ls -l src/liblua.a src/lua src/luac
-rw-r--r--    1 tc       staff       319930 Sep 17 10:29 src/liblua.a
-rwxr-xr-x    1 tc       staff       245692 Sep 17 10:29 src/lua
-rwxr-xr-x    1 tc       staff       179596 Sep 17 10:29 src/luac
CFLAGS= -Os -Wall -Wextra -DLUA_COMPAT_5_3 $(SYSCFLAGS) $(MYCFLAGS)
MYCFLAGS= -fno-unwind-tables -fno-asynchronous-unwind-tables
tc@box:/tmp/lua-5.4.4$ time src/lua -e 'for i=1,1000000 do end'
real    0m 0.07s
user    0m 0.07s
sys     0m 0.00s
tc@box:/tmp/lua-5.4.4$ ls -l src/liblua.a src/lua src/luac
-rw-r--r--    1 tc       staff       265170 Sep 17 10:31 src/liblua.a
-rwxr-xr-x    1 tc       staff       203164 Sep 17 10:31 src/lua
-rwxr-xr-x    1 tc       staff       144480 Sep 17 10:31 src/luac
 
As we can see everything is all right in Lua 5.4.
tc@box:/tmp/lua-5.4.4$ src/luac -l -
for i=1,1000000 do end
main <stdin:0,0> (7 instructions at 0x9510490)
0+ params, 4 slots, 1 upvalue, 4 locals, 1 constant, 0 functions
        1       [1]     VARARGPREP      0
        2       [1]     LOADI           0 1
        3       [1]     LOADK           1 0     ; 1000000
        4       [1]     LOADI           2 1
        5       [1]     FORPREP         0 0     ; exit to 7
        6       [1]     FORLOOP         0 1     ; to 6
        7       [1]     RETURN          0 1 1   ; 0 out
In Lua 5.3 implementation of FORPREP or FORLOOP bytecodes includes some stuff making gcc 10 optimizations insane. Maybe this can be explored and allow to find what exactly cause the trouble. But gcc 11 works better, at least concerning the described circumstances.
@Rich, what's Your opinion, should we dive deeper? Will it be useful?