Tiny Core Linux

General TC => General TC Talk => Topic started by: Leee on October 17, 2023, 02:20:30 PM

Title: grep.tcz vs busybox grep speed
Post by: Leee on October 17, 2023, 02:20:30 PM: I have a long-running script that contains an outer loop and an inner loop where the outer loop process each of a continuous stream of integers and the inner loop performs an ever-increasing number of tests on the integer currently under consideration.

It's clear that the whole mess will eventually decay into uselessness as the number of iterations of the inner loop reaches some as yet undetermined value that will slow it down too much. FWIW, yes, I know I could have written the whole mess in C w/o much effort but this was more of a "recreational scripting" project than a "gotta get results" project. Still, knowing that the inner loop would be the critical performance point, I did put a little thought into making sure I wasn't do anything extravagant in the inner loop.

The script keeps it's results in a text file and the terminal output is really just so that, during otherwise idle times, I can tell at a glance if the system has hung (*) so the script can be stopped and started at my whim and it's progress persists across reboots.

But, with that terminal output in view, I noticed that sometimes the script runs noticeably faster than at other times. That got my curiosity up and I tracked it down to "runs faster when compiletc is loaded" then narrowed it down to "runs faster when grep.tcz is loaded. That got me back to the code in my script...

grep is used only in the outer loop and is called one, two, or three times per iteration (looking for a success but giving up if no success by the third try (**) ). grep is never used in the inner loop, yet even with that, metrics that I added showed that the entire script runs just over ten times faster when using the grep binary as opposed to the busybox grep. This implies that in raw "grep performance" the grep binary is likely a lot more than ten times faster. Given that the grep binary is about one fifth the size of the busybox binary, its not surprising to me that the grep binary is faster to use, but it is surprising how much faster.

*) system hangs - I think this is a hardware issue but it's intermittent. (For safety, backups every ten minutes, if anything has changed, via cron)

**) there's probably a way to do this in one go with a regex, but I hadn't had enough coffee and it is, after all, in the outer loop. :)
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 17, 2023, 03:39:19 PM: Hi Leee
The aim of busybox is to pack the maximum number of commands
into the smallest binary possible by sharing as much code as
possible. It is optimized for size, not speed.

The GNU versions of busybox commands are normally faster, sometimes
a lot faster.

If you post a copy of your script, maybe someone can offer some
suggestions to speed it up.
Title: Re: grep.tcz vs busybox grep speed
Post by: jazzbiker on October 17, 2023, 03:40:03 PM: Hi Leee!

One day I've noticed too that members of the coreutils toolset are implemented in the busybox in the way far from aiming certain performance. Let's rememer that the benefit of using busybox is saving space. And yes, the loss of the performance may appear to be incredible :-)

I guess Your grep patterns include wildcards?
Title: Re: grep.tcz vs busybox grep speed
Post by: Leee on October 18, 2023, 06:23:02 AM: Hi Rich, I didn't mean for that to sound like any kind of complaint - only to say that I was surprised at the amount of difference, speed wise.

If busybox were not optimized for size, what would be the point of it?

On the other hand, since I'm usually not unduly resource-constrained, I think I'll now be a little less reluctant to load the GNU versions utilities.

As far as my script is concerned, it's just over 95 percent of the way to where I'm going to shut it off anyway, so I'm not too worried about speeding it up - And it has already proved worth while by providing both a few hours of amusement and by prompting me to better grok Tiny Core.

---
@jazzbiker - Yes - the grep patterns involve wild cards, matching the full line of the input except for the last character of the input:
Code: [Select]
grep -nE "^${X}.$" ${INPUTFILE} |tail -1
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 18, 2023, 08:57:44 AM: Hi Leee
Quote from: Leee on October 18, 2023, 06:23:02 AM
... I didn't mean for that to sound like any kind of complaint - only to say that I was surprised at the amount of difference, speed wise. ...
I didn't take it as a complaint. When I first realized GNU utilities were faster, I
too was surprised by how much faster some of the commands were.

Quote
... @jazzbiker - Yes - the grep patterns involve wild cards, matching the full line of the input except for the last character of the input:
Code: [Select]
grep -nE "^${X}.$" ${INPUTFILE} |tail -1
So it appears you are searching for something and are only interested
in the most recent result. Well, I see one possibility for speeding that up.
If INPUTFILE is constantly having content appended to it, you are searching
the same sections of the file over and over again. One of the keys to speed is
don't compute the same thing over and over again. Save the result for later
use instead.

tail -n +N lets you skip the first N lines of a file.

Considering your grep command includes returning the line number, you could
use that to set the value of N with error checking for an empty string.

Or you could us wc -l to find the length of the file prior to searching it. Then
after searching the file, save the file length for the next search:
Code: [Select]
CurrentLine=1 # Other code loop LineCount=$(wc -l $INPUTFILE) tail -n +$CurrentLine | grep -nE "^${X}.$" ${INPUTFILE} | tail -n 1 CurrentLine=$LineCount # more code end loopIf you need the actual line number of the file that grep returns:
Code: [Select]
FileLinenumber=$CurrentLine + $GrepLinenumber - 1
Title: Re: grep.tcz vs busybox grep speed
Post by: Leee on October 19, 2023, 11:25:03 PM: Using tail is a good idea in general but in this case the target value is not at (nor even near) the end of the file.

I guess I should have posted the whole script per your earlier suggestion, but I wasn't (and again am not) logged in from the same machine and I was too busy/lazy to go fetch it.

The script is calculates primes and saves the results to a text file. The target value of the grep is the largest prime (from the list) that is smaller than the square root of the current prime_candidate. While not critical to the determination of a number's primeness itself, the position (line number) of the target value will indicate the highest number of tests that might be required to determine if the current prime_candidate is or is not actually prime.

Since the grep search is, by its nature, a file operation, I had thought that the opening and handling of the file would be a performance limiter but never dreamed that grep itself would be a factor. So I'm really glad I happened across that little tidbit.

In the mean time, I had arbitrarily decided to shut it off once it found the first prime over one million. I wasn't paying attention so it went surprisingly far beyond that before I shut it down. Of course, now I'm thinking of resuming it to test various tweaks for speed to see what else I might learn... including implementing the whole thing in C and or MUMPS.

Speaking of MUMPS - any chance we could bring gtm5.tce forward from the 1.x repo? ;D (Just kidding)
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 20, 2023, 12:04:18 AM: Hi Leee
Quote from: Leee on October 19, 2023, 11:25:03 PM
... Speaking of MUMPS - any chance we could bring gtm5.tce forward from the 1.x repo? ;D (Just kidding)
I found gtm_V53003_linux_i686_pro.tar.gz in the 2.x src directory
if you are interested.
Title: Re: grep.tcz vs busybox grep speed
Post by: gadget42 on October 20, 2023, 02:02:12 AM: mostly for future readers of this thread, please review the following thread with respect to tcl 1.x/2.x/etc:

https://forum.tinycorelinux.net/index.php/topic,25955.0.html

curiosity got the best of me and a forum search for "mumps" gives the last-most-recent mention of it as this thread from 2011:

https://forum.tinycorelinux.net/index.php/topic,11678.0.html

20231020-0110am-cdt-usa-modified-added comment regarding search and an additional forum thread link
Title: Re: grep.tcz vs busybox grep speed
Post by: Leee on October 20, 2023, 05:34:11 PM: Thanks Rich, but I'm looking more eagerly into O'Kane mumps these days because of the very thing mentioned in one of the links gadget42 just mentioned... O'Kane seems to allow for using the leading '#' as a comment indicator, hence enabling the use of a normal(ish) shebang and using mumps as a regular scripting language. While I don't think O'Kane mumps has quite the name recognition that GTM has, this one thing makes it more suitable for my purposes.

I'll probably play with O'Kane mumps for a few weeks then, assuming it works the way I hope it will, package it up as a .tcz.

edit: BTW, there's now (at least) a GTM version 6 and a fork called YottaDB.
Title: Re: grep.tcz vs busybox grep speed
Post by: jazzbiker on October 21, 2023, 07:04:27 AM: Hi Leee,

I am curious about the M language, are there standalone implementations? I've downloaded GT.M 7.10 tar from sourceforge. What can be done with it? It is very big - almost 100M unpacked. Why so big? Briefly looked at yottadb - they describe M only in their own environment. Does M makes sense alone?

Regards
Title: Re: grep.tcz vs busybox grep speed
Post by: jazzbiker on October 21, 2023, 07:16:04 AM: Quote from: Leee on October 18, 2023, 06:23:02 AM
Code: [Select]
grep -nE "^${X}.$" ${INPUTFILE} |tail -1
Very simple pattern. Moreover it is anchored both to the beginning and end.I took the brief glance on busybox grep source and in my opinion they implement straightforward use of standard libc regex toolset. So the difference against separate grep utility is probably achieved with grep dedicated optimizations. In other words it is not busybox grep slow, but GNU grep is fast :-) I guess Lua is faster.
Title: Re: grep.tcz vs busybox grep speed
Post by: patrikg on October 21, 2023, 10:51:08 AM: Hello @Leee

I know you can sometimes use awk instead of grep, have you tested awk performance in busybox ?

Happy hacking.
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 21, 2023, 12:24:45 PM: Hi Leee
I've found awk can be significantly faster than grep.

Something like this should work:
Code: [Select]
awk 'BEGIN {RS="\n"} /'"^$X\.$"'/{ print $0 }' "$INPUTFILE" | tail -n 1
GNU awk (gawk.tcz) should be faster than the busybox awk.
Title: Re: grep.tcz vs busybox grep speed
Post by: jazzbiker on October 21, 2023, 01:10:43 PM: Quote from: Rich on October 21, 2023, 12:24:45 PM
I've found awk can be significantly faster than grep.

busybox awk uses the same regex functions as busybox grep do, so the only bottleneck left is input file reading ...
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 21, 2023, 02:25:27 PM: Hi jazzbiker
Actually, I should have said awk is faster in more complex operations:
Quote from: Rich on March 22, 2023, 10:24:08 PM
Hi GNUser
Quote from: GNUser on March 22, 2023, 08:18:15 PM
... Given how quickly GNU awk is able to sort provides.db, I'd say this problem is more than solved. The problem is crushed.
There's a reason roberts liked to inject awk snippets into his scripts. When it
comes to data manipulation, it can be wicked fast.

I've had a few instances were I found the execution time of a script unacceptable
and was forced to add an awk function. None of my techniques could even touch
the speed of awk.

Since this appears to be a fairly simple search, I decided to run
some benchmarks. The backslashes in the search term are to
escape the forward slashes so awk does not throw an error. The
search term is a few entries before the end of the provides file.
Code: [Select]
tc@E310:~/onboot$ export X="usr\/local\/bin\/zvbi-atsc-cc" tc@E310:~/onboot$ time busybox grep "^$X$" ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc real 0m 0.53s user 0m 0.44s sys 0m 0.02s tc@E310:~/onboot$ time busybox awk 'BEGIN {RS="\n"} /'"^$X$"'/{ print $0 }' ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc real 0m 0.63s user 0m 0.48s sys 0m 0.12s tc@E310:~/onboot$ time grep "^$X$" ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc real 0m 0.17s user 0m 0.05s sys 0m 0.03s tc@E310:~/onboot$ time awk 'BEGIN {RS="\n"} /'"^$X$"'/{ print $0 }' ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc real 0m 0.53s user 0m 0.35s sys 0m 0.10s tc@E310:~/onboot$So it appears for a simple search like this grep is faster.
Title: Re: grep.tcz vs busybox grep speed
Post by: jazzbiker on October 21, 2023, 03:04:09 PM: Quote from: Rich on October 21, 2023, 02:25:27 PM
Since this appears to be a fairly simple search, I decided to run
some benchmarks.

Hi Rich, just for more detailed picture may I ask You to provide the tests You've already run with the pattern closer to Leee's one? Upon the same environment using wildcards and anchors, something like

Code: [Select]
tc@E310:~/onboot$ export X="usr\/local\/bin\/zvbi-atsc-c." tc@E310:~/onboot$ export X="^usr\/local\/bin\/zvbi-atsc-c." tc@E310:~/onboot$ export X="^usr\/local\/bin\/zvbi-atsc-c.$"
Please
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 21, 2023, 03:25:06 PM: Hi jazzbiker
I take it he's searching for a literal period at the end of the line.
I added a period to that entry in the provides file and repeated
the search:
Code: [Select]
tc@E310:~/onboot$ export X="usr\/local\/bin\/zvbi-atsc-cc" tc@E310:~/onboot$ time busybox grep "^$X.$" ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc. real 0m 0.49s user 0m 0.38s sys 0m 0.03s tc@E310:~/onboot$ time busybox awk 'BEGIN {RS="\n"} /'"^$X\.$"'/{ print $0 }' ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc. real 0m 0.62s user 0m 0.49s sys 0m 0.09s tc@E310:~/onboot$ time grep "^$X.$" ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc. real 0m 0.21s user 0m 0.03s sys 0m 0.03s tc@E310:~/onboot$ time awk 'BEGIN {RS="\n"} /'"^$X\.$"'/{ print $0 }' ../Scripting/LddCheck/provides-10.x-x86.db usr/local/bin/zvbi-atsc-cc. real 0m 0.52s user 0m 0.35s sys 0m 0.08s tc@E310:~/onboot$
Title: Re: grep.tcz vs busybox grep speed
Post by: jazzbiker on October 21, 2023, 03:39:20 PM: Hi Rich,

Thanks for this test, as we can see everything the same. Sorry, I was not attentive enough and missed that You've already added anchors while using the pattern.

Quote from: Rich on October 21, 2023, 03:25:06 PM
I take it he's searching for a literal period at the end of the line.
Probably the period meant "any symbol", but anyway benchmark results are the same. thank You very much!
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 21, 2023, 05:31:51 PM: Hi jazzbiker
I see. I'm used to using a question mark to match a single character:
Code: [Select]
tc@E310:~$ ls -l J?.jpg -rw-r--r-- 1 tc staff 34836 Feb 2 2022 J1.jpg -rw-r--r-- 1 tc staff 52988 Feb 2 2022 J2.jpg -rw-r--r-- 1 tc staff 66236 Feb 2 2022 J3.jpg tc@E310:~$
But grep seems to need a period:
Code: [Select]
tc@E310:~$ ls -l | grep "J.\.jpg" -rw-r--r-- 1 tc staff 34836 Feb 2 2022 J1.jpg -rw-r--r-- 1 tc staff 52988 Feb 2 2022 J2.jpg -rw-r--r-- 1 tc staff 66236 Feb 2 2022 J3.jpg tc@E310:~$
Title: Re: grep.tcz vs busybox grep speed
Post by: jazzbiker on October 21, 2023, 06:45:46 PM: Hi Rich,

Those "?" and "*" You mean are not regex but globbing symbols. As far as I understand the strings in the command line including such symbols are expanded by the shell interpreter according to the files existence (omg). If You'd be lucky to :-)

Here is the program to show its arguments:
Code: [Select]
tc@box:/tmp/glob$ cat ggg.c #include <stdio.h> int main(int argc, char *argv[]) { while (--argc) printf("%s\n", argv[argc]); return 0; }
Let compile it:
Code: [Select]
$ tcc -o ggg ggg.cand add some spam:
Code: [Select]
$ touch ggh ggk
Then:
Code: [Select]
tc@box:/tmp/glob$ ./g* ./ggk ./ggh ./ggg.c
And:
Code: [Select]
tc@box:/tmp/glob$ ./ggg ./g* ./ggk ./ggh ./ggg.c ./ggg
But:
Code: [Select]
tc@box:/tmp/glob$ ./ggg "./g*" ./g*
:-)

As far as I know regexes are used by ed, grep, sed, find, vi (maybe some others?). Regexes may be basic, extended, perl-style, lua-style, and maybe some others.

Have a nice regex!
Title: Re: grep.tcz vs busybox grep speed
Post by: Rich on October 21, 2023, 09:43:09 PM: Hi jazzbiker
Quote from: jazzbiker on October 21, 2023, 06:45:46 PM
... Then:
Code: [Select]
tc@box:/tmp/glob$ ./g* ./ggk ./ggh ./ggg.c ...
That's pretty slick. First ggg gets expanded invoking the program, then
the remaining 3 filenames get expanded and passed to the program.
Title: Re: grep.tcz vs busybox grep speed
Post by: patrikg on October 22, 2023, 06:44:10 AM: Just a thought with this, the shell globbing feature has some limits, like the command line max char length.

In my shell in my Arch, i can get the max value when executing this command.
This command is new to me..and when using option -a i get all values being set.
Learning every day something new. :)
Don't know if this also contain the env vars.

Code: [Select]
getconf ARG_MAX 2097152What i can see it's set to 2MB.

Code: [Select]
echo $((1024*1024*2)) 2097152
If i google a bit of this i also find that you can get this from xargs iike this.
Code: [Select]
xargs --no-run-if-empty --show-limits </dev/null
Code: [Select]
Size of command buffer we are actually using: 131072 :(:(:(

Happy hacking on your keys.