Off-Topic > Off-Topic - Tiny Tux's Corner

(m|s|p) locate / sql ("db") vs grep / awk ("db.txt")

(1/3) > >>

mocore:
reading trough "How to find which extension provides a file" topic

--- Quote from: patrikg on November 14, 2024, 05:24:28 PM ---Maybe "some one"=me right now :) have little time to make a script that unsquashfs -l tcefile.tce list the tce files and make a db of that.
It goes very fast with sqlite3 to search.


--- End quote ---

which appears to be reducible to

0) mk index
1) search index (... with regexp? ???  https://swtch.com/%7Ersc/regexp/regexp1.html  ftr a link worth saving 8)  )
--
which i think (abstractly) is the answer to the question "How to find which extension provides a file"

see also this : https://jvns.ca/blog/2015/03/05/how-the-locate-command-works-and-lets-rewrite-it-in-one-minute/
and other references within
( eg : https://www2.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-148.pdf - Finding Files Fast (1983) . a (e)paper about locate’s design )


--- Code: ---#So a slightly less space-efficient version of this whole locate business would be to create a database
#with this Very Sophisticated Command:

sudo find / > database.txt

#Then we can more or less reproduce locate’s functionality by just doing

grep fooBar database.txt



--- End code ---

which is also afaik similar to how provides.sh works (awk + db.txt )
... with similar trade off wrt db size
(fwiw a topic i have given some thought and think likely with alittle more awk many db.txt repartitions could be reduced)


https://en.wikipedia.org/wiki/Locate_(Unix)

an explanation of (less-new?) alternative implementations (from locate to mlocate || slocate)
can be found hear https://unix.stackexchange.com/questions/273182/difference-between-locate-and-mlocate

and  plocate , which seams to have "evolved" a little

--- Quote ---plocate works by creating an inverted index over trigrams (combinations of three bytes) in the search strings, which allows it to rapidly narrow down the set of candidates to a very small list, instead of linearly scanning through every entry. It does nearly all I/O asynchronously using io_uring if available (Linux 5.1+), which reduces the impact of seek latency on systems without SSDs. Like mlocate and slocate, the returned file set is user-dependent, ie. a user will only see a file if find(1) would list it (all directories from the root have +rx permissions).

--- End quote ---

it would be interesting (in2024 (are'nt these times just.. btw))
to compare
- patrikg mksquashfs + sql script db with plocate
  as both likely have many optimizations / improvements
  not shore how hw dependent those years of optimization might be tbh!  ::)

-  both sql/plocate with provides.sh and/or (as yet un implemented) alternative provides.db structure(smaller.db.txt) and modified/updated awk script

nick65go:
@mocore: Nice ideas, thank you for the links. I am also curios about implementation (if ever tested) of "alternative provides.db structure".
Usually (but not always): simple is easier to change but not necesary faster. My understanding is that curaga had prefered not-compiled solutions (if I remember correctly).

Plus soon will be year 2025, implemetation is hardware dependently maybe, then compatibility with old 486 CPU (the i486 was introduced in 1989, so 36 years ago, wow; and x86_64 in 1999 !). My experience is that MOST (not all) hardware will fail after 5-8 years from new purchase, because planned for obsolence of the capitalism profitability.

PS: Personaly I am more interested on the fastest solution posible, not compatibility, because I will never buy again old hardware, even if sold as cheap/free. For me the hasle to make it work is not worth it. I prefer to use the computer for work/entertainment, not to spend time debugging it because glitches, spare parts, maintenance.

mocore:

--- Quote from: nick65go on November 17, 2024, 08:30:59 AM ---@mocore: Nice ideas, thank you for the links. I am also curios about implementation (if ever tested) of "alternative provides.db structure".
Usually (but not always): simple is easier to change but not necesary faster. My understanding is that curaga had prefered not-compiled solutions (if I remember correctly).


--- End quote ---

afair ...(looking at provides db) deep dir paths repeat the sub dir's
...
q)how much repetition?
a) awk ; count path depth ( split fields with "/" , count occurrences in "field depth" array )

--- Quote ---  eg: 14.x/x86 provides.db.gz
 #awk -F'/' '{a[NF]++} END{ for( i in a) print i,a}' ./provides.db
0 2537
1 2545
2 8
3 89
4 2503
5 13529
6 29861
7 74329
8 54369
9 69432
10 30629
11 19964
12 8431
13 1978
14 773
15 146
16 40
17 1


--- End quote ---

something like
if "previous field" depth[n] = "current feild" depth[n]  print "\t" ( or "," if csv is more applicable ?) 
and always print the last field

gives a similar data/representation and reduces repetition

so

--- Quote ---blah/blah/blah/
blah/blah/blah/foo
blah/blah/blah/bar   

--- End quote ---
may be reduced to 

--- Quote ---blah/blah/blah/
\t\t\tfoo
\t\t\tbar

--- End quote ---

not shore if this format is more or less readable
perhaps that depends on the size/resolution of your screen !

also relevant (wrt how much is repeated) might be the length of the fields!


--- Code: --- # count field length stats in array
awk -F'/' '{ for( x=1;x<NF;x++) { fl=length($x); if(fl > 0) fla[fl]++  } } {a[NF]++} END{ print "path depths"; for( i in a) print i,a[i] ; print "field length" ;for(y in fla) print y,fla[y] }' ./provides.db
--- End code ---


--- Quote ---field length
1 45514
2 23637
3 537433
4 103562
5 600283
6 102094
7 194277
8 71831
9 77091
10 60348
11 22838
12 32888
13 33783
14 10955
15 2816
16 3382
17 3430
18 1174
19 825
20 1126
21 295
22 2215
23 894
24 4698
25 549
26 146
27 354
28 79
29 57
30 89
31 660
32 90
33 29
34 6
35 22
36 25
37 14
38 99
39 110
44 2
49 2

--- End quote ---

*snow emoji*

ps my-mind is now boggling how to cross reference path depth and field size ??? 

yvs:
I want to add that format and tools also depend on what kind of information is needed.

for example
If I'm looking for a package(extension) that contains some definite executable (for shell hints), I use once collected text file with

--- Quote ---extension /exec1/exec2/exec3/...

--- End quote ---
lines, which easy to parse with simple grep/sed (that quite faster than awk I suppose).

In general case of filepaths' storage I think existed std tools like plocate/mlocate are enough and fast for lookups

I.e. goals define format and tools.

mocore:
on the topic of exec speed ( and awk )
i happened to read this
( Lila: a Lil Interpreter in POSIX AWK https://beyondloom.com/blog/lila.html )
 comparison between c / js and various awk's


--- Quote from: https://beyondloom.com/blog/lila.html ---Performance Anxiety

Computers, one often hears, are pretty fast. Lila’s interpreter, however, is doing quite a bit of indirection and string acrobatics with each expression. Is it still usably performant, or are we looking at speeds rivaled by dead snails in molasses?

An experiment was performed on my 2020 M1 Macbook Air comparing the C and JS-based reference implementations of Lilt with Lila running under a variety of AWK implementations: mawk, gawk, goawk, and the BSD awk which ships with MacOS (v20200816 in this case). As my benchmarks, I used Mandel.lil, a naive Mandelbrot set renderer designed to hammer the Lil interpreter and garbage collector, and the Lil integration test suite:

Interpreter    Mandel    Tests
c-lilt (release)    0.08s    0.27s
c-lilt (debug)    0.36s    0.79s
js-lilt    0.42s    1.58s
mawk    3.61s    0.95s
gawk    6.58s    2.47s
goawk    10.66s    4.32s
awk    15.09s    162.97s

Of the AWKs tested, mawk makes a truly impressive showing. The “one-true” AWK fares much poorer, particularly with the integration suite. I ran a similar experiment on my 9-year old intel-based Macbook Air which provides awk version 20070501, and- curiously- even on that far humbler machine I was able to drastically outstrip the M1 with a more “modern” AWK release, completing the integration suite successfully in about 16 seconds.

Your mileage may vary, but overall I found that even on my slowest computers and their most ancient AWKs Lila offers a viably interactive REPL experience.

There aren’t many monolithic AWK scripts on the internet of a similar scale and complexity to lila.awk. If you’re the maintainer of an AWK- or you’re thinking of becoming one- Lila might be a useful tool for investigating performance and correctness.

--- End quote ---

exec speed aside , an often ignored metric is  "speed of edit / test" ( ftr previously mentioned @ https://forum.tinycorelinux.net/index.php/topic,26145.msg168029.html#msg168029 )
another interesting instance of this observation

--- Quote from: https://github.com/raygard/wak/issues/7#issuecomment-2321967561 ---Also, thank you muchly for the tip on using tcc, it's amazing!
 I am using your "script" trick and it's really nice to be able to tweak the code and run it immediately with no need for a separate compile.

--- End quote ---
the "trick" in question ftr ...

--- Quote from: https://github.com/raygard/wak/issues/7#issuecomment-2081189661 ---It works quite well with tcc too, I've just been putting

#!/usr/local/bin/tcc -run

at the top of the monolithic source file and running it like a script

--- End quote ---
( which wander how updated ttc in the repo is )

Navigation

[0] Message Index

[#] Next page

Go to full version