grep.tcz vs busybox grep speed

General TC > General TC Talk

(1/5) > >>

Leee:
I have a long-running script that contains an outer loop and an inner loop where the outer loop process each of a continuous stream of integers and the inner loop performs an ever-increasing number of tests on the integer currently under consideration.

It's clear that the whole mess will eventually decay into uselessness as the number of iterations of the inner loop reaches some as yet undetermined value that will slow it down too much. FWIW, yes, I know I could have written the whole mess in C w/o much effort but this was more of a "recreational scripting" project than a "gotta get results" project. Still, knowing that the inner loop would be the critical performance point, I did put a little thought into making sure I wasn't do anything extravagant in the inner loop.

The script keeps it's results in a text file and the terminal output is really just so that, during otherwise idle times, I can tell at a glance if the system has hung (*) so the script can be stopped and started at my whim and it's progress persists across reboots.

But, with that terminal output in view, I noticed that sometimes the script runs noticeably faster than at other times. That got my curiosity up and I tracked it down to "runs faster when compiletc is loaded" then narrowed it down to "runs faster when grep.tcz is loaded. That got me back to the code in my script...

grep is used only in the outer loop and is called one, two, or three times per iteration (looking for a success but giving up if no success by the third try (**) ). grep is never used in the inner loop, yet even with that, metrics that I added showed that the entire script runs just over ten times faster when using the grep binary as opposed to the busybox grep. This implies that in raw "grep performance" the grep binary is likely a lot more than ten times faster. Given that the grep binary is about one fifth the size of the busybox binary, its not surprising to me that the grep binary is faster to use, but it is surprising how much faster.

*) system hangs - I think this is a hardware issue but it's intermittent. (For safety, backups every ten minutes, if anything has changed, via cron)

**) there's probably a way to do this in one go with a regex, but I hadn't had enough coffee and it is, after all, in the outer loop. :)

Rich:
Hi Leee
The aim of busybox is to pack the maximum number of commands
into the smallest binary possible by sharing as much code as
possible. It is optimized for size, not speed.

The GNU versions of busybox commands are normally faster, sometimes
a lot faster.

If you post a copy of your script, maybe someone can offer some
suggestions to speed it up.

jazzbiker:
Hi Leee!

One day I've noticed too that members of the coreutils toolset are implemented in the busybox in the way far from aiming certain performance. Let's rememer that the benefit of using busybox is saving space. And yes, the loss of the performance may appear to be incredible :-)

I guess Your grep patterns include wildcards?

Leee:
Hi Rich, I didn't mean for that to sound like any kind of complaint - only to say that I was surprised at the amount of difference, speed wise.

If busybox were not optimized for size, what would be the point of it?

On the other hand, since I'm usually not unduly resource-constrained, I think I'll now be a little less reluctant to load the GNU versions utilities.

As far as my script is concerned, it's just over 95 percent of the way to where I'm going to shut it off anyway, so I'm not too worried about speeding it up - And it has already proved worth while by providing both a few hours of amusement and by prompting me to better grok Tiny Core.

---
@jazzbiker - Yes - the grep patterns involve wild cards, matching the full line of the input except for the last character of the input:

--- Code: ---grep -nE "^${X}.$" ${INPUTFILE} |tail -1
--- End code ---

Rich:
Hi Leee

--- Quote from: Leee on October 18, 2023, 06:23:02 AM --- ... I didn't mean for that to sound like any kind of complaint - only to say that I was surprised at the amount of difference, speed wise. ...
--- End quote ---
I didn't take it as a complaint. When I first realized GNU utilities were faster, I
too was surprised by how much faster some of the commands were.

--- Quote --- ... @jazzbiker - Yes - the grep patterns involve wild cards, matching the full line of the input except for the last character of the input:

--- Code: ---grep -nE "^${X}.$" ${INPUTFILE} |tail -1
--- End code ---

--- End quote ---
So it appears you are searching for something and are only interested
in the most recent result. Well, I see one possibility for speeding that up.
If INPUTFILE is constantly having content appended to it, you are searching
the same sections of the file over and over again. One of the keys to speed is
don't compute the same thing over and over again. Save the result for later
use instead.

tail -n +N lets you skip the first N lines of a file.

Considering your grep command includes returning the line number, you could
use that to set the value of N with error checking for an empty string.

Or you could us wc -l to find the length of the file prior to searching it. Then
after searching the file, save the file length for the next search:

--- Code: ---CurrentLine=1

# Other code

loop
LineCount=$(wc -l $INPUTFILE)
tail -n +$CurrentLine | grep -nE "^${X}.$" ${INPUTFILE} | tail -n 1
CurrentLine=$LineCount

# more code

end loop
--- End code ---
If you need the actual line number of the file that grep returns:

--- Code: ---FileLinenumber=$CurrentLine + $GrepLinenumber - 1
--- End code ---

Navigation

[0] Message Index

[#] Next page

Go to full version