Author Topic: grep.tcz vs busybox grep speed (Read 10649 times)

Leee · « **on:** October 17, 2023, 02:20:30 PM »

I have a long-running script that contains an outer loop and an inner loop where the outer loop process each of a continuous stream of integers and the inner loop performs an ever-increasing number of tests on the integer currently under consideration.

It's clear that the whole mess will eventually decay into uselessness as the number of iterations of the inner loop reaches some as yet undetermined value that will slow it down too much. FWIW, yes, I know I could have written the whole mess in C w/o much effort but this was more of a "recreational scripting" project than a "gotta get results" project. Still, knowing that the inner loop would be the critical performance point, I did put a little thought into making sure I wasn't do anything extravagant in the inner loop.

The script keeps it's results in a text file and the terminal output is really just so that, during otherwise idle times, I can tell at a glance if the system has hung (*) so the script can be stopped and started at my whim and it's progress persists across reboots.

But, with that terminal output in view, I noticed that sometimes the script runs noticeably faster than at other times. That got my curiosity up and I tracked it down to "runs faster when compiletc is loaded" then narrowed it down to "runs faster when grep.tcz is loaded. That got me back to the code in my script...

grep is used only in the outer loop and is called one, two, or three times per iteration (looking for a success but giving up if no success by the third try (**) ). grep is never used in the inner loop, yet even with that, metrics that I added showed that the entire script runs just over ten times faster when using the grep binary as opposed to the busybox grep. This implies that in raw "grep performance" the grep binary is likely a lot more than ten times faster. Given that the grep binary is about one fifth the size of the busybox binary, its not surprising to me that the grep binary is faster to use, but it is surprising how much faster.

*) system hangs - I think this is a hardware issue but it's intermittent. (For safety, backups every ten minutes, if anything has changed, via cron)

**) there's probably a way to do this in one go with a regex, but I hadn't had enough coffee and it is, after all, in the outer loop.

Rich · « **Reply #1 on:** October 17, 2023, 03:39:19 PM »

Hi Leee
The aim of busybox is to pack the maximum number of commands
into the smallest binary possible by sharing as much code as
possible. It is optimized for size, not speed.

The GNU versions of busybox commands are normally faster, sometimes
a lot faster.

If you post a copy of your script, maybe someone can offer some
suggestions to speed it up.

jazzbiker · « **Reply #2 on:** October 17, 2023, 03:40:03 PM »

Hi Leee!

One day I've noticed too that members of the coreutils toolset are implemented in the busybox in the way far from aiming certain performance. Let's rememer that the benefit of using busybox is saving space. And yes, the loss of the performance may appear to be incredible :-)

I guess Your grep patterns include wildcards?

Leee · « **Reply #3 on:** October 18, 2023, 06:23:02 AM »

Hi Rich, I didn't mean for that to sound like any kind of complaint - only to say that I was surprised at the amount of difference, speed wise.

If busybox were not optimized for size, what would be the point of it?

On the other hand, since I'm usually not unduly resource-constrained, I think I'll now be a little less reluctant to load the GNU versions utilities.

As far as my script is concerned, it's just over 95 percent of the way to where I'm going to shut it off anyway, so I'm not too worried about speeding it up - And it has already proved worth while by providing both a few hours of amusement and by prompting me to better grok Tiny Core.

---
@jazzbiker - Yes - the grep patterns involve wild cards, matching the full line of the input except for the last character of the input:

Code: [Select]

grep -nE "^${X}.$" ${INPUTFILE} |tail -1

Rich · « **Reply #4 on:** October 18, 2023, 08:57:44 AM »

Hi Leee

Quote from: Leee on October 18, 2023, 06:23:02 AM

... I didn't mean for that to sound like any kind of complaint - only to say that I was surprised at the amount of difference, speed wise. ...

I didn't take it as a complaint. When I first realized GNU utilities were faster, I
too was surprised by how much faster some of the commands were.

Quote

... @jazzbiker - Yes - the grep patterns involve wild cards, matching the full line of the input except for the last character of the input:
Code: [Select]
grep -nE "^${X}.$" ${INPUTFILE} |tail -1

So it appears you are searching for something and are only interested
in the most recent result. Well, I see one possibility for speeding that up.
If INPUTFILE is constantly having content appended to it, you are searching
the same sections of the file over and over again. One of the keys to speed is
don't compute the same thing over and over again. Save the result for later
use instead.

tail -n +N lets you skip the first N lines of a file.

Considering your grep command includes returning the line number, you could
use that to set the value of N with error checking for an empty string.

Or you could us wc -l to find the length of the file prior to searching it. Then
after searching the file, save the file length for the next search:

Code: [Select]

CurrentLine=1

# Other code

loop
	LineCount=$(wc -l $INPUTFILE)
	tail -n +$CurrentLine | grep -nE "^${X}.$" ${INPUTFILE} | tail -n 1
	CurrentLine=$LineCount

# more code

end loop

If you need the actual line number of the file that grep returns:

Code: [Select]

FileLinenumber=$CurrentLine + $GrepLinenumber - 1

Leee · « **Reply #5 on:** October 19, 2023, 11:25:03 PM »

Using tail is a good idea in general but in this case the target value is not at (nor even near) the end of the file.

I guess I should have posted the whole script per your earlier suggestion, but I wasn't (and again am not) logged in from the same machine and I was too busy/lazy to go fetch it.

The script is calculates primes and saves the results to a text file. The target value of the grep is the largest prime (from the list) that is smaller than the square root of the current prime_candidate. While not critical to the determination of a number's primeness itself, the position (line number) of the target value will indicate the highest number of tests that might be required to determine if the current prime_candidate is or is not actually prime.

Since the grep search is, by its nature, a file operation, I had thought that the opening and handling of the file would be a performance limiter but never dreamed that grep itself would be a factor. So I'm really glad I happened across that little tidbit.

In the mean time, I had arbitrarily decided to shut it off once it found the first prime over one million. I wasn't paying attention so it went surprisingly far beyond that before I shut it down. Of course, now I'm thinking of resuming it to test various tweaks for speed to see what else I might learn... including implementing the whole thing in C and or MUMPS.

Speaking of MUMPS - any chance we could bring gtm5.tce forward from the 1.x repo?

(Just kidding)

Rich · « **Reply #6 on:** October 20, 2023, 12:04:18 AM »

Hi Leee

Quote from: Leee on October 19, 2023, 11:25:03 PM

... Speaking of MUMPS - any chance we could bring gtm5.tce forward from the 1.x repo? (Just kidding)

I found gtm_V53003_linux_i686_pro.tar.gz in the 2.x src directory
if you are interested.

gadget42 · « **Reply #7 on:** October 20, 2023, 02:02:12 AM »

mostly for future readers of this thread, please review the following thread with respect to tcl 1.x/2.x/etc:

https://forum.tinycorelinux.net/index.php/topic,25955.0.html

curiosity got the best of me and a forum search for "mumps" gives the last-most-recent mention of it as this thread from 2011:

https://forum.tinycorelinux.net/index.php/topic,11678.0.html

20231020-0110am-cdt-usa-modified-added comment regarding search and an additional forum thread link

Leee · « **Reply #8 on:** October 20, 2023, 05:34:11 PM »

Thanks Rich, but I'm looking more eagerly into O'Kane mumps these days because of the very thing mentioned in one of the links gadget42 just mentioned... O'Kane seems to allow for using the leading '#' as a comment indicator, hence enabling the use of a normal(ish) shebang and using mumps as a regular scripting language. While I don't think O'Kane mumps has quite the name recognition that GTM has, this one thing makes it more suitable for my purposes.

I'll probably play with O'Kane mumps for a few weeks then, assuming it works the way I hope it will, package it up as a .tcz.

edit: BTW, there's now (at least) a GTM version 6 and a fork called YottaDB.

jazzbiker · « **Reply #9 on:** October 21, 2023, 07:04:27 AM »

Hi Leee,

I am curious about the M language, are there standalone implementations? I've downloaded GT.M 7.10 tar from sourceforge. What can be done with it? It is very big - almost 100M unpacked. Why so big? Briefly looked at yottadb - they describe M only in their own environment. Does M makes sense alone?

Regards

jazzbiker · « **Reply #10 on:** October 21, 2023, 07:16:04 AM »

Quote from: Leee on October 18, 2023, 06:23:02 AM

Code: [Select]
grep -nE "^${X}.$" ${INPUTFILE} |tail -1

Very simple pattern. Moreover it is anchored both to the beginning and end.I took the brief glance on busybox grep source and in my opinion they implement straightforward use of standard libc regex toolset. So the difference against separate grep utility is probably achieved with grep dedicated optimizations. In other words it is not busybox grep slow, but GNU grep is fast :-) I guess Lua is faster.

patrikg · « **Reply #11 on:** October 21, 2023, 10:51:08 AM »

Hello @Leee

I know you can sometimes use awk instead of grep, have you tested awk performance in busybox ?

Happy hacking.

Rich · « **Reply #12 on:** October 21, 2023, 12:24:45 PM »

Hi Leee
I've found awk can be significantly faster than grep.

Something like this should work:

Code: [Select]

awk 'BEGIN {RS="\n"} /'"^$X\.$"'/{ print $0 }' "$INPUTFILE" | tail -n 1
GNU awk (gawk.tcz) should be faster than the busybox awk.

jazzbiker · « **Reply #13 on:** October 21, 2023, 01:10:43 PM »

Quote from: Rich on October 21, 2023, 12:24:45 PM

I've found awk can be significantly faster than grep.

busybox awk uses the same regex functions as busybox grep do, so the only bottleneck left is input file reading ...

Rich · « **Reply #14 on:** October 21, 2023, 02:25:27 PM »

Hi jazzbiker
Actually, I should have said awk is faster in more complex operations:

Quote from: Rich on March 22, 2023, 10:24:08 PM

Hi GNUser
Quote from: GNUser on March 22, 2023, 08:18:15 PM
... Given how quickly GNU awk is able to sort provides.db, I'd say this problem is more than solved. The problem is crushed.
There's a reason roberts liked to inject awk snippets into his scripts. When it
comes to data manipulation, it can be wicked fast.

I've had a few instances were I found the execution time of a script unacceptable
and was forced to add an awk function. None of my techniques could even touch
the speed of awk.

Since this appears to be a fairly simple search, I decided to run
some benchmarks. The backslashes in the search term are to
escape the forward slashes so awk does not throw an error. The
search term is a few entries before the end of the provides file.

Code: [Select]

tc@E310:~/onboot$ export X="usr\/local\/bin\/zvbi-atsc-cc"
tc@E310:~/onboot$ time busybox grep "^$X$" ../Scripting/LddCheck/provides-10.x-x86.db
usr/local/bin/zvbi-atsc-cc
real    0m 0.53s
user    0m 0.44s
sys     0m 0.02s
tc@E310:~/onboot$ time busybox awk 'BEGIN {RS="\n"} /'"^$X$"'/{ print $0 }' ../Scripting/LddCheck/provides-10.x-x86.db
usr/local/bin/zvbi-atsc-cc
real    0m 0.63s
user    0m 0.48s
sys     0m 0.12s
tc@E310:~/onboot$ time grep "^$X$" ../Scripting/LddCheck/provides-10.x-x86.db
usr/local/bin/zvbi-atsc-cc
real    0m 0.17s
user    0m 0.05s
sys     0m 0.03s
tc@E310:~/onboot$ time awk 'BEGIN {RS="\n"} /'"^$X$"'/{ print $0 }' ../Scripting/LddCheck/provides-10.x-x86.db
usr/local/bin/zvbi-atsc-cc
real    0m 0.53s
user    0m 0.35s
sys     0m 0.10s
tc@E310:~/onboot$

So it appears for a simple search like this grep is faster.

Tiny Core Linux

News:

Author Topic: grep.tcz vs busybox grep speed (Read 10649 times)

Leee

grep.tcz vs busybox grep speed

Rich

Re: grep.tcz vs busybox grep speed

jazzbiker

Re: grep.tcz vs busybox grep speed

Leee

Re: grep.tcz vs busybox grep speed

Rich

Re: grep.tcz vs busybox grep speed

Leee

Re: grep.tcz vs busybox grep speed

Rich

Re: grep.tcz vs busybox grep speed

gadget42

Re: grep.tcz vs busybox grep speed

Leee

Re: grep.tcz vs busybox grep speed

jazzbiker

Re: grep.tcz vs busybox grep speed

jazzbiker

Re: grep.tcz vs busybox grep speed

patrikg

Re: grep.tcz vs busybox grep speed

Rich

Re: grep.tcz vs busybox grep speed

jazzbiker

Re: grep.tcz vs busybox grep speed

Rich

Re: grep.tcz vs busybox grep speed