Author Topic: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt") (Read 7333 times)

mocore · « **on:** November 17, 2024, 05:27:39 AM »

reading trough "How to find which extension provides a file" topic

Quote from: patrikg on November 14, 2024, 05:24:28 PM

Maybe "some one"=me right now have little time to make a script that unsquashfs -l tcefile.tce list the tce files and make a db of that.
It goes very fast with sqlite3 to search.

which appears to be reducible to

0) mk index
1) search index (... with regexp?

https://swtch.com/%7Ersc/regexp/regexp1.html ftr a link worth saving

)
--
which i think (abstractly) is the answer to the question "How to find which extension provides a file"

see also this : https://jvns.ca/blog/2015/03/05/how-the-locate-command-works-and-lets-rewrite-it-in-one-minute/
and other references within
( eg : https://www2.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-148.pdf - Finding Files Fast (1983) . a (e)paper about locate’s design )

Code: [Select]

#So a slightly less space-efficient version of this whole locate business would be to create a database 
#with this Very Sophisticated Command:

sudo find / > database.txt

#Then we can more or less reproduce locate’s functionality by just doing

grep fooBar database.txt

which is also afaik similar to how provides.sh works (awk + db.txt )
... with similar trade off wrt db size
(fwiw a topic i have given some thought and think likely with alittle more awk many db.txt repartitions could be reduced)

https://en.wikipedia.org/wiki/Locate_(Unix)

an explanation of (less-new?) alternative implementations (from locate to mlocate || slocate)
can be found hear https://unix.stackexchange.com/questions/273182/difference-between-locate-and-mlocate

and plocate , which seams to have "evolved" a little

Quote

plocate works by creating an inverted index over trigrams (combinations of three bytes) in the search strings, which allows it to rapidly narrow down the set of candidates to a very small list, instead of linearly scanning through every entry. It does nearly all I/O asynchronously using io_uring if available (Linux 5.1+), which reduces the impact of seek latency on systems without SSDs. Like mlocate and slocate, the returned file set is user-dependent, ie. a user will only see a file if find(1) would list it (all directories from the root have +rx permissions).

it would be interesting (in2024 (are'nt these times just.. btw))
to compare
- patrikg mksquashfs + sql script db with plocate
as both likely have many optimizations / improvements
not shore how hw dependent those years of optimization might be tbh!

- both sql/plocate with provides.sh and/or (as yet un implemented) alternative provides.db structure(smaller.db.txt) and modified/updated awk script

nick65go · « **Reply #1 on:** November 17, 2024, 08:30:59 AM »

@mocore: Nice ideas, thank you for the links. I am also curios about implementation (if ever tested) of "alternative provides.db structure".
Usually (but not always): simple is easier to change but not necesary faster. My understanding is that curaga had prefered not-compiled solutions (if I remember correctly).

Plus soon will be year 2025, implemetation is hardware dependently maybe, then compatibility with old 486 CPU (the i486 was introduced in 1989, so 36 years ago, wow; and x86_64 in 1999 !). My experience is that MOST (not all) hardware will fail after 5-8 years from new purchase, because planned for obsolence of the capitalism profitability.

PS: Personaly I am more interested on the fastest solution posible, not compatibility, because I will never buy again old hardware, even if sold as cheap/free. For me the hasle to make it work is not worth it. I prefer to use the computer for work/entertainment, not to spend time debugging it because glitches, spare parts, maintenance.

mocore · « **Reply #2 on:** November 19, 2024, 07:20:26 AM »

Quote from: nick65go on November 17, 2024, 08:30:59 AM

@mocore: Nice ideas, thank you for the links. I am also curios about implementation (if ever tested) of "alternative provides.db structure".
Usually (but not always): simple is easier to change but not necesary faster. My understanding is that curaga had prefered not-compiled solutions (if I remember correctly).

afair ...(looking at provides db) deep dir paths repeat the sub dir's
...
q)how much repetition?
a) awk ; count path depth ( split fields with "/" , count occurrences in "field depth" array )

Quote

eg: 14.x/x86 provides.db.gz
#awk -F'/' '{a[NF]++} END{ for( i in a) print i,a}' ./provides.db
0 2537
1 2545
2 8
3 89
4 2503
5 13529
6 29861
7 74329
8 54369
9 69432
10 30629
11 19964
12 8431
13 1978
14 773
15 146
16 40
17 1

something like
if "previous field" depth[n] = "current feild" depth[n] print "\t" ( or "," if csv is more applicable ?)
and always print the last field

gives a similar data/representation and reduces repetition

so

Quote

blah/blah/blah/
blah/blah/blah/foo
blah/blah/blah/bar

may be reduced to

Quote

blah/blah/blah/
\t\t\tfoo
\t\t\tbar

not shore if this format is more or less readable
perhaps that depends on the size/resolution of your screen !

also relevant (wrt how much is repeated) might be the length of the fields!

Code: [Select]

 # count field length stats in array 
awk -F'/' '{ for( x=1;x<NF;x++) { fl=length($x); if(fl > 0) fla[fl]++  } } {a[NF]++} END{ print "path depths"; for( i in a) print i,a[i] ; print "field length" ;for(y in fla) print y,fla[y] }' ./provides.db

Quote

field length
1 45514
2 23637
3 537433
4 103562
5 600283
6 102094
7 194277
8 71831
9 77091
10 60348
11 22838
12 32888
13 33783
14 10955
15 2816
16 3382
17 3430
18 1174
19 825
20 1126
21 295
22 2215
23 894
24 4698
25 549
26 146
27 354
28 79
29 57
30 89
31 660
32 90
33 29
34 6
35 22
36 25
37 14
38 99
39 110
44 2
49 2

*snow emoji*

ps my-mind is now boggling how to cross reference path depth and field size

yvs · « **Reply #3 on:** November 21, 2024, 03:07:50 AM »

I want to add that format and tools also depend on what kind of information is needed.

for example
If I'm looking for a package(extension) that contains some definite executable (for shell hints), I use once collected text file with

Quote

extension /exec1/exec2/exec3/...

lines, which easy to parse with simple grep/sed (that quite faster than awk I suppose).

In general case of filepaths' storage I think existed std tools like plocate/mlocate are enough and fast for lookups

I.e. goals define format and tools.

mocore · « **Reply #4 on:** November 21, 2024, 06:30:52 AM »

on the topic of exec speed ( and awk )
i happened to read this
( Lila: a Lil Interpreter in POSIX AWK https://beyondloom.com/blog/lila.html )
comparison between c / js and various awk's

Quote from: https://beyondloom.com/blog/lila.html

Performance Anxiety

Computers, one often hears, are pretty fast. Lila’s interpreter, however, is doing quite a bit of indirection and string acrobatics with each expression. Is it still usably performant, or are we looking at speeds rivaled by dead snails in molasses?

An experiment was performed on my 2020 M1 Macbook Air comparing the C and JS-based reference implementations of Lilt with Lila running under a variety of AWK implementations: mawk, gawk, goawk, and the BSD awk which ships with MacOS (v20200816 in this case). As my benchmarks, I used Mandel.lil, a naive Mandelbrot set renderer designed to hammer the Lil interpreter and garbage collector, and the Lil integration test suite:

Interpreter    Mandel    Tests
c-lilt (release)    0.08s    0.27s
c-lilt (debug)    0.36s    0.79s
js-lilt    0.42s    1.58s
mawk    3.61s    0.95s
gawk    6.58s    2.47s
goawk    10.66s    4.32s
awk    15.09s    162.97s

Of the AWKs tested, mawk makes a truly impressive showing. The “one-true” AWK fares much poorer, particularly with the integration suite. I ran a similar experiment on my 9-year old intel-based Macbook Air which provides awk version 20070501, and- curiously- even on that far humbler machine I was able to drastically outstrip the M1 with a more “modern” AWK release, completing the integration suite successfully in about 16 seconds.

Your mileage may vary, but overall I found that even on my slowest computers and their most ancient AWKs Lila offers a viably interactive REPL experience.

There aren’t many monolithic AWK scripts on the internet of a similar scale and complexity to lila.awk. If you’re the maintainer of an AWK- or you’re thinking of becoming one- Lila might be a useful tool for investigating performance and correctness.

exec speed aside , an often ignored metric is "speed of edit / test" ( ftr previously mentioned @ https://forum.tinycorelinux.net/index.php/topic,26145.msg168029.html#msg168029 )
another interesting instance of this observation

Quote from: https://github.com/raygard/wak/issues/7#issuecomment-2321967561

Also, thank you muchly for the tip on using tcc, it's amazing!
I am using your "script" trick and it's really nice to be able to tweak the code and run it immediately with no need for a separate compile.

the "trick" in question ftr ...

Quote from: https://github.com/raygard/wak/issues/7#issuecomment-2081189661

It works quite well with tcc too, I've just been putting

#!/usr/local/bin/tcc -run

at the top of the monolithic source file and running it like a script

( which wander how updated ttc in the repo is )

mocore · « **Reply #5 on:** November 21, 2024, 06:35:54 AM »

Quote from: yvs on November 21, 2024, 03:07:50 AM

I want to add that format and tools also depend on what kind of information is needed.
...
I.e. goals define format and tools.

compression ( currently core uses gzip for provides.db ) likely dose abetter job of de-duplication that altering the structure of the db

but if considering altering structure some of the plocate modifications (eg n-gram search) might be interesting to attempt to implement in awk (just for fun; )

yvs · « **Reply #6 on:** November 21, 2024, 01:42:29 PM »

> compression ( currently core uses gzip for provides.db ) likely dose abetter job of de-duplication that altering the structure of the db
> but if considering altering structure some of the plocate modifications (eg n-gram search) might be interesting to attempt to implement in awk (just for fun; )
>
Different goals.
In case of looking up for executable->extension pair in once collected text file (it's not provides.db and not compressed)
and using simple sed it takes too little time to get result, so that if even that takes 100-500msec I'd not bother to use db/sql or awk coding for a onetime query.
But the goal was only finding out extension that provides some executable, and no more. That what I had in mind.

If that's kindof complicated queries with many options on big amount of data, yes totally agree plocate is good.

nick65go · « **Reply #7 on:** November 22, 2024, 06:24:26 AM »

@mocore:

your reminder about tcc (tiny C compiler) is a sample of the reasons I still (from time of time) check this forum. Even if I focus (mainly) on productivity/efficiency, sometime I use time won by my efficiency for intellectual challenges (no money involved).

even to "old" tools could be improved without user intervention: https://www.phoronix.com/news/zlib-2.2-RC1
( Zlib-ng 2.2 Speeds Up Compression By ~12% On x86_64 CPUs) if not binding to restricted architecture.

mocore · « **Reply #8 on:** November 30, 2024, 03:01:50 AM »

Quote from: nick65go on November 17, 2024, 08:30:59 AM

My understanding is that curaga had prefered not-compiled solutions (if I remember correctly).

i wander for what reasons this might be?!

just happened to found some interesting "gray area" of not-compiled
https://github.com/udem-dlteam/pnut - A Self-Compiling C Transpiler Targeting Human-Readable POSIX Shell

/wtf

no clue how useful / performant / bug-free it might be
interesting nonetheless

mocore · « **Reply #9 on:** December 12, 2024, 06:43:45 AM »

wrt performance / awk / "data" ect https://benhoyt.com/writings/goawk-compiler-vm/

Quote

Why are virtual machines faster than tree-walking?

It’s not immediately obvious why compiling to virtual instructions and then executing them with a virtual machine is faster than evaluating a syntax tree (“tree-walking”).

It’s actually more work up-front: instead of just lexing and parsing into a syntax tree, we now also have a compile step. That said, virtual machine compilers (including GoAWK’s) are usually very simple and non-optimizing, so that step is fast.

One reason it’s faster to execute is this: RAM – which stands for Random Access Memory – is not actually random access on modern processors. Memory blocks are loaded into fast CPU caches as needed, so when you have to access a new block, it takes about 10x as long as if it’s in the cache. Peter Norvig’s table of timings for various operations on a typical CPU shows how fetching from level 1 cache takes about 0.5 nanosecond, fetching from level 2 cache 14x that long, and fetching from main memory another 14x!

Programming with this in mind is called “data-oriented design”. I was reminded of how much impact this makes when watching Andrew Kelley’s excellent talk, [1] A Practical Guide to Applying Data-Oriented Design. Andrew is the creator of the Zig programming language, and his talk describes how he significantly sped up the Zig compiler by applying data-oriented design techniques. That talk was what pushed me to think about this for GoAWK. But back to why a virtual machine is faster than tree-walking…

[1] "Andrew Kelley Practical Data Oriented Design " @ https://www.youtube.com/watch?v=IroPQ150F6c

other topics descending into hw spec's minucia
Re: Oldest Pc @ https://forum.tinycorelinux.net/index.php/topic,3216.15.html
Open Source Firmware Conference : I have come to bury the BIOS, not to open it @ https://forum.tinycorelinux.net/index.php/topic,25959.msg166511.html#msg166511

mocore · « **Reply #10 on:** December 26, 2024, 04:11:09 AM »

Quote from: mocore on November 30, 2024, 03:01:50 AM

Quote from: nick65go on November 17, 2024, 08:30:59 AM
My understanding is that curaga had prefered not-compiled solutions (if I remember correctly).

i wander for what reasons this might be?!

idk perhaps some part of the answer might be
to do with decide-ability
https://blog.reverberate.org/2013/08/parsing-c-is-literally-undecidable.html

yvs · « **Reply #11 on:** December 26, 2024, 04:34:12 AM »

>> prefered not-compiled solutions
> wander for what reasons this might be?!
>
if script execution time for some particular daily task is negligible, what could be a practical reason to use compiled solutions ...

mocore · « **Reply #12 on:** December 26, 2024, 06:48:49 AM »

Quote from: yvs on December 26, 2024, 04:34:12 AM

>> prefered not-compiled solutions
> wander for what reasons this might be?!
>
if script execution time for some particular daily task is negligible, what could be a practical reason to use compiled solutions ...

certainly an interesting question

attempting to articulate a coherent reply i happened to read

"Julia is not an interpreted Language" @ https://github.com/ninjaaron/administrative-scripting-with-julia/issues/7

Quote

Compilation matters, but not nearly as much as language semantics.

the discussion of definitions and related terms
and distinction between interpreted / complied
suggest each case to have its much of its own nuance

+ related
"A read–eval–print loop (REPL)" https://forum.tinycorelinux.net/index.php/topic,26193.0.html

see also : “Software is getting slower more rapidly than hardware becomes faster” https://forum.tinycorelinux.net/index.php/topic,26683.msg171907.html#msg171907

Rich · « **Reply #13 on:** December 26, 2024, 10:31:30 AM »

Hi yvs

Quote from: nick65go on November 17, 2024, 08:30:59 AM

... My understanding is that curaga had prefered not-compiled solutions (if I remember correctly). ...

I won't speak for curaga, but I can give a few reasons for this.

Transparency: The source code is the executable so it is present in
the environment it runs in. It can be examined if someone wants
to see what it's doing or how it's doing it.

I have personally learned a lot by examining the scripts roberts
created and observing how he accomplished various tasks.

Participation: A result of "Transparency" is accessibility. Anyone with
an editor can offer bug fixes or enhancements. No toolchain required.
It's as easy as:
1. Edit.
2. Save.
3. Run.
4. Repeat steps 1-3 until satisfied with the results.

Size: "Small" compiled programs (C) tend to have a minimum achievable
size when compiled. In TC10-x86 the size of waitforX is 1932 bytes. It's
about a dozen lines of code and mostly sleeps waiting for a display to be
created. With each new compiler version, the "minimum achievable size"
seems to increase. The linker scripts seem to add more padding too.
More info on gcc and program size:
https://forum.tinycorelinux.net/index.php/topic,23134.msg144804.html#msg144804

A shell script is the size of the text it is written in, always.

I'm sure there are other reasons, these are just the ones that came to mind.

yvs · « **Reply #14 on:** December 26, 2024, 03:04:38 PM »

Hi Rich,

> I'm sure there are other reasons, these are just the ones that came to mind.
> list of reasonable reasons

There's another one possible reason that sometimes forgotten: practical.
For example in scopes of this topic:
Let's say my goal is to find a package that provides a _file_. Note: not fullname, nor directoryname, just a basename. (I suppose it's the most common case).
Because I'm not interested in names that a part of directory trees, then format could be very simple "packagename flat-list-of-basenames" per line.
sed'ing time for /basename/ through text with so simple format is negligible, in my case it was much faster than I type 'sed' word.
In this case I'd not bother to think of databases at all. I.e. there was no need in a real db, and sed was more than fast for this particular practical task.

Tiny Core Linux

News:

Author Topic: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt") (Read 7333 times)

mocore

(m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

nick65go

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

mocore

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

yvs

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

mocore

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

mocore

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

yvs

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

nick65go

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

mocore

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

mocore

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

mocore

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

yvs

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

mocore

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

Rich

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

yvs

Re: (m|s|p) locate / sql ("db") vs grep / awk ("db.txt")