Tiny Core Linux

Tiny Core Extensions => TCE Tips & Tricks => Topic started by: Vaguiner on July 18, 2025, 01:05:25 AM

Title: Faster packages
Post by: Vaguiner on July 18, 2025, 01:05:25 AM
I've discovered some interesting things recently. First and certainly most relevant is why tinycore programs are mostly noticeably slower than the same programs in other distributions, and the villain is:

Code: [Select]
export CFLAGS="-mtune=generic -Os -pipe"
export CXXFLAGS="-mtune=generic -Os -pipe"

generic is basically like telling the compiler to avoid anything above SSE2
which, in tests, the mere use of mtune=generic became redundant and irrelevant, since even SSE2 is not used when using -Os
So yes, -Os is extremely overkill.
It prevented auto vectorization using 16 bytes (SSE2, popularized since 2001). And it prevented even MMX (8 bytes, popularized since 1997).

At the same time, I discovered how relatively easy it is to get github to compile your tcz. For this reason, I'm gradually recompiling everything relevant to performance that I like, since I use tiyncore on a daily basis.
https://github.com/vaguinerg/tinycore-autoupdated-extensions
Title: Re: Faster packages
Post by: nick65go on July 18, 2025, 04:41:02 AM
Thank you! this should be in TC wiki in bold.
Maybe this info is common knowledge for most people, but the details were not known by me.
I understand that TinyCore "must" be compatible with 486 CPU and to be small size. But I assumed that the compiled binaries have inside a fork path to chose better CPU instructions (SSE2, MMX, etc) because -march and -mtune parameters.

One solution could be to do it like OPEN SUSE did, to use GCC multiple compiled libraries (in different sub-folders), or like CLEAR Linux with multiple paths inside the same executable/libraries. So the programs could auto-chose for which CPU to load the optimized *.so libraries. --> I could bet it will not happen in TC because limited man-hours available (free of charge) of TC team.

Missing those advanced instructions, if you are right, is a bottleneck for any normal/modern CPU made in the last 5-10 years (assuming you buy a new machine in less than 10 years - because capitalism intentional  planing for obsolescence / oblivion).
Title: Re: Faster packages
Post by: Vaguiner on July 18, 2025, 06:45:05 AM
Missing those advanced instructions, if you are right, is a bottleneck for any normal/modern CPU made in the last 5-10 years (assuming you buy a new machine in less than 10 years - because capitalism intentional  planing for obsolescence / oblivion).

This is a bottleneck for processors launched since 2003. I also forgot to mention that I meant the 64-bit version of tinycore.

The first 64-bit processor marketed worldwide, launched in 2003, came with SSE2. SSE2 is being ignored by -Os.

It's a performance reduction with absolutely no real gain, increasing tcz's block size by a few bytes will reduce tcz's size more than -Os.
Title: Re: Faster packages
Post by: nick65go on July 18, 2025, 07:37:07 AM
@Vaguiner: Maybe you are right, but you bark at the wrong tree.
It depends of what you most value in life. Ex: for me is time, to use the tool/machine instead of tune-up it, patch/debug. In year 2025 a medium range laptop PRICE is .. not significant [if you are employed] if PC does not crash in 5-8 years. [measure it as how many downtown dinners or beers you do not drink to buy a brand new PC today].

With a better CPU (P+E cores today) you have more CPU cache L1/L2. Plus faster SDD/Nvme for HDD, plus faster DDRAM timing for RAM memory, plus faster USB3, PCIe5 bus, etc. And using a linux kernel like from CachyOS you are "in business" in few minutes. When SPEED is your goal, why would you use TC if you have a "modern" machine? I do not even talk about audit / security, or "reproducibility" of packages if you use it as a money making machine.

Vice-versa, if you have a slow/old machine, like 486-686 range, any optimization you did to it will not make significant speed to old hardware (if it still functions after decades). TC is very good to LEARNING and play.
Title: Re: Faster packages
Post by: nick65go on July 18, 2025, 11:51:18 AM
When you talk about "faster" packages, do you think about MEASURED speed increase in TC, as small as 0.5%-4%?

or big like 4x-20x:
https://www.phoronix.com/forums/forum/software/general-linux-open-source/1551009-squashfs-tools-4-7-released-20-to-more-than-ten-times-faster 

or something like this:
https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1561904-new-ffmpeg-avx-512-optimizations-hit-up-to-36x-the-performance-of-plain-c-code/page2
Q.E.D.
Title: Re: Faster packages
Post by: Vaguiner on July 18, 2025, 03:51:02 PM
When SPEED is your goal, why would you use TC if you have a "modern" machine?
Because:
(https://i.imgur.com/gUXS6o6.png)
Speed is my goal in the subject Speed. Speed is not my goal in general regarding my use of the system.

...do you think about MEASURED speed increase...

Unfortunately I don't have any synthetic numbers to offer you at the moment. The only thing I know is that the gain, for me, is VISUALLY noticeable in the responsiveness of the programs.
Title: Re: Faster packages
Post by: Vaguiner on July 18, 2025, 09:41:08 PM
It's also worth remembering that

http://tinycorelinux.net/dCore/x86_64/import/src/kernel-4.8.17/README

dCore was compiled correctly, with -O2, which allows the correct use of mtune=generic.

gcc -Os == tcc


While at http://tinycorelinux.net/16.x/x86_64/release/src/toolchain/compile_tc16_x86_64

We see exaggeratedly repeated mentions of the “-mtune=generic -Os” sequence
Title: Re: Faster packages
Post by: CNK on July 19, 2025, 03:56:32 AM
Unfortunately I don't have any synthetic numbers to offer you at the moment. The only thing I know is that the gain, for me, is VISUALLY noticeable in the responsiveness of the programs.

I see you've built MicroPython. There are benchmark scripts like this one (https://github.com/shaoziyang/micropython_benchmarks) you could try with it.

It's not a problem for me personally, I like that TCL keeps things as small as possible. But I still run TCL on an Intel Core 2 and nothing nearly as new as Intel Alder Lake.
Title: Re: Faster packages
Post by: Vaguiner on July 19, 2025, 12:34:34 PM
you could try with it...

For unrealistic synthetic tests I definitely wouldn't use micropython. However, I present you with a more interesting battery of tests.

Code: [Select]
gcc -o test test.c -Os -mtune=generic -fopt-info-vec -lm ; du test ; time ./test
20.0K   test
Test               | Instructions     | Cycles          | IPC
--------------------+------------------+------------------+--------
Vector Add         |         60000073 |         28784599 | 2.08
Vector Multiply    |         60000070 |         29766303 | 2.02
FMA Operation      |         70000072 |         34251292 | 2.04
Multi Operations   |         90000071 |         34479365 | 2.61
Polynomial         |        150000072 |         41241200 | 3.64
Linear Interp      |        100000073 |         34798249 | 2.87
Sqrt Approx        |        180000193 |        112369036 | 1.60
Scale Offset       |         70000070 |         25323119 | 2.76
Horizontal Add     |         60000071 |         31063022 | 1.93
Vector Min         |         70000071 |         31168168 | 2.25
Mask Blend         |        135010074 |         47303963 | 2.85
Memory Bandwidth   |         80000080 |        120310011 | 0.66
Conditional Sum    |         74990073 |         41280248 | 1.82
Loop Fusion        |        120000075 |         54537656 | 2.20
Prefix Sum         |         60000074 |        120051800 | 0.50
Mul Add Const      |         70000070 |         24822415 | 2.82
Linear Blend       |        100000072 |         34598789 | 2.89
--------------------+------------------+------------------+--------
TOTAL              |       1550001354 |        846149235 | 1.83
real    0m 1.42s
user    0m 1.36s
sys     0m 0.02s

Code: [Select]
gcc -o test test.c -O2 -s -mtune=generic -fopt-info-vec -lm ; du test ; time ./test
test.c:73:1: optimized: loop vectorized using 16 byte vectors
test.c:79:1: optimized: loop vectorized using 16 byte vectors
test.c:85:1: optimized: loop vectorized using 16 byte vectors
test.c:91:1: optimized: loop vectorized using 16 byte vectors
test.c:97:1: optimized: loop vectorized using 16 byte vectors
test.c:104:1: optimized: loop vectorized using 16 byte vectors
test.c:116:1: optimized: loop vectorized using 16 byte vectors
test.c:122:1: optimized: loop vectorized using 16 byte vectors
test.c:130:1: optimized: loop vectorized using 16 byte vectors
test.c:136:1: optimized: loop vectorized using 16 byte vectors
test.c:167:1: optimized: loop vectorized using 16 byte vectors
test.c:167:1: optimized: loop vectorized using 16 byte vectors
test.c:185:1: optimized: loop vectorized using 16 byte vectors
test.c:213:23: optimized: loop vectorized using 16 byte vectors
20.0K   test
Test               | Instructions     | Cycles          | IPC
--------------------+------------------+------------------+--------
Vector Add         |         15000074 |         25512282 | 0.59
Vector Multiply    |         15000072 |         26700252 | 0.56
FMA Operation      |         17500072 |         32290420 | 0.54
Multi Operations   |         22500073 |         32629001 | 0.69
Polynomial         |         37500075 |         21941139 | 1.71
Linear Interp      |         25000074 |         32907018 | 0.76
Sqrt Approx        |        120000078 |        110427209 | 1.09
Scale Offset       |         17500075 |         21367514 | 0.82
Horizontal Add     |         35000073 |         30905302 | 1.13
Vector Min         |         15000072 |         26653637 | 0.56
Mask Blend         |         30000078 |         32798693 | 0.91
Memory Bandwidth   |         80000083 |        120064720 | 0.67
Conditional Sum    |         70000073 |         32663454 | 2.14
Loop Fusion        |         30000079 |         47302674 | 0.63
Prefix Sum         |         50000067 |         31552173 | 1.58
Mul Add Const      |         17500072 |         21308674 | 0.82
Linear Blend       |         25000074 |         32729043 | 0.76
--------------------+------------------+------------------+--------
TOTAL              |        622501264 |        679753205 | 0.92
real    0m 1.07s
user    0m 1.04s
sys     0m 0.01s

Code: [Select]
gcc -o test test.c -Ofast -march=native -fopt-info-vec-optimized -fmerge-all-constants -fno-semantic-interposition -ftree-vecto
rize -fipa-pta -funroll-loops -floop-nest-optimize -lm ; du test ; time ./test
test.c:104:1: optimized: loop vectorized using 32 byte vectors
test.c:185:1: optimized: loop vectorized using 32 byte vectors
test.c:167:1: optimized: loop vectorized using 32 byte vectors
test.c:136:1: optimized: loop vectorized using 32 byte vectors
test.c:130:1: optimized: loop vectorized using 32 byte vectors
test.c:122:1: optimized: loop vectorized using 32 byte vectors
test.c:116:1: optimized: loop vectorized using 32 byte vectors
test.c:104:1: optimized: loop vectorized using 32 byte vectors
test.c:97:1: optimized: loop vectorized using 32 byte vectors
test.c:91:1: optimized: loop vectorized using 32 byte vectors
test.c:85:1: optimized: loop vectorized using 32 byte vectors
test.c:79:1: optimized: loop vectorized using 32 byte vectors
test.c:73:1: optimized: loop vectorized using 32 byte vectors
test.c:110:1: optimized: loop vectorized using 32 byte vectors
test.c:47:12: optimized: basic block part vectorized using 8 byte vectors
test.c:213:23: optimized: loop vectorized using 32 byte vectors
28.0K   test
Test               | Instructions     | Cycles          | IPC
--------------------+------------------+------------------+--------
Vector Add         |          4218816 |         26519003 | 0.16
Vector Multiply    |          4218812 |         27630546 | 0.15
FMA Operation      |          5468814 |         33934769 | 0.16
Multi Operations   |          4218812 |         27554213 | 0.15
Polynomial         |          6718814 |         21462602 | 0.31
Linear Interp      |          6718815 |         33417605 | 0.20
Sqrt Approx        |          9218813 |         21445641 | 0.43
Scale Offset       |          4218813 |         21344567 | 0.20
Horizontal Add     |          4218818 |         13024590 | 0.32
Vector Min         |          4218812 |         27421033 | 0.15
Mask Blend         |          5468815 |         33027008 | 0.17
Memory Bandwidth   |         53750070 |        120068496 | 0.45
Conditional Sum    |         49380064 |         34571463 | 1.43
Loop Fusion        |          6875064 |         41591792 | 0.17
Prefix Sum         |         23333394 |         31672422 | 0.74
Mul Add Const      |          4218812 |         21295353 | 0.20
Linear Blend       |          6718815 |         33483167 | 0.20
--------------------+------------------+------------------+--------
TOTAL              |        203183173 |        569464270 | 0.36
real    0m 0.89s
user    0m 0.85s
sys     0m 0.01s

There was no change in the size of the binary or the number of compatible devices between compiling with -Os and -O2 -s, but there was a significant performance increase.
Title: Re: Faster packages
Post by: Rich on July 19, 2025, 05:02:04 PM
Hi Vaguiner

Please use  Code Tags  when posting commands and responses seen in a terminal. To use  Code Tags  click on the  #  icon
above the reply box and paste your text between the  Code Tags  as shown in this example:

Quote
[code][   36.176529] pcm512x 1-004d: Failed to get supply 'AVDD': -517
[   36.176536] pcm512x 1-004d: Failed to get supplies: -517
[   36.191753] pcm512x 1-004d: Failed to get supply 'AVDD': -517[/code]

It will appear like this in your post:
Code: [Select]
[   36.176529] pcm512x 1-004d: Failed to get supply 'AVDD': -517
[   36.176536] pcm512x 1-004d: Failed to get supplies: -517
[   36.191753] pcm512x 1-004d: Failed to get supply 'AVDD': -517

Code Tags  serve as visual markers between what you are trying to say and the information you are posting. They also preserve
spacing so column aligned data displays properly. Code tags also automatically add horizontal and or vertical scrollbars
to accommodate long lines and listings.