WelcomeWelcome | FAQFAQ | DownloadsDownloads | WikiWiki

Author Topic: [Solved]: how to sort provides.db?  (Read 3113 times)

Offline GNUser

  • Wiki Author
  • Hero Member
  • *****
  • Posts: 1528
[Solved]: how to sort provides.db?
« on: March 22, 2023, 04:49:51 PM »
Hello, my smart friends and Rich the wizard.

I need help solving a shell scripting problem I've been mulling over:

If I append a custom package name (and list of its contents) to the bottom of provides.db, how to then sort provides.db so that what I appended is moved to the correct place in the file? (Using awk-speak, how to sort provides.db by the first field--namely, by extension name--where each record is an extension name + its contents?)

I can think of various kludges (e.g., concatenating each extension name and its contents onto a single line, then sorting the lines by first column) but not an efficient and elegant (i.e., one-liner) solution.
« Last Edit: March 22, 2023, 08:53:56 PM by Rich »

Online Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 11667
Re: how to sort provides.db?
« Reply #1 on: March 22, 2023, 05:30:26 PM »
Hi GNUser
I think  awk  is probably the right tool for this.
This may contain some clues:
https://stackoverflow.com/questions/17048188/how-to-use-awk-sort-by-column-3
They talk about  lines (or rows)  and  columns. That's the same as  records  and  fields
just laid out differently and with different delimiters.

I think you want to sort on field 1 ($1) and print records ($0). I think some versions
of awk have  asorti  which does a lot of the work for you:
https://opensource.com/article/19/11/how-sort-awk

Offline GNUser

  • Wiki Author
  • Hero Member
  • *****
  • Posts: 1528
Re: how to sort provides.db?
« Reply #2 on: March 22, 2023, 08:18:15 PM »
Hi Rich. As usual, you're the man with the answers. The asorti function (available in GNU awk) was the key.

Code: [Select]
$ tce-load -wi gawk
...

$ cat sort.awk
#!/usr/local/bin/gawk -f
# usage: ./sort.awk -v var=FIELD FILE

BEGIN {
FS="\n";
RS="";
IGNORECASE=1;
}

{ # dump each field into an array
ARRAY[$var] = $R;
}

END {
asorti(ARRAY,SARRAY);
# get length
j = length(SARRAY);
   
for (i = 1; i <= j; i++) {
printf("%s\n\n", ARRAY[SARRAY[i]])
}
}

$ time ./sort.awk -v var=1 ./provides.db >provides.db-sorted
real 0m 0.17s
user 0m 0.12s
sys 0m 0.06s

Given how quickly GNU awk is able to sort provides.db, I'd say this problem is more than solved. The problem is crushed.


Offline CNK

  • Wiki Author
  • Sr. Member
  • *****
  • Posts: 287
Re: [Solved]: how to sort provides.db?
« Reply #3 on: March 22, 2023, 09:28:54 PM »
Ahh rats, you got there before me, and the Awk solution is much faster that the 1-2s that mine takes on my system.

Also I forgot about the appended entries and assumed the task was adding a separate extension.tcz.list to the standard provides.db.

But I'll at least claim bonus points for using only Busybox tools.

Code: [Select]
#!/bin/ash
# Usage: insert_provides.sh [extension]
# Creates new_provides.db

extension=$1
provides=/etc/sysconfig/tcedir/provides.db
headings="`grep -n '^[^/]*\.tcz$' $provides`"
line="`echo -e \"$headings\n:$extension\" | sort -t : -k 2 -f | grep -A 1 :$extension | tail -n 1 | cut -d : -f 1`"
sed -e ${line}i\\"$extension\\n`sed 's/$/\\\/g' $extension.list`"$'\n' $provides > new_provides.db

Oh, the forum's actually going to let me post a script with all those slashes in it though, so that's a win!

Online Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 11667
Re: [Solved]: how to sort provides.db?
« Reply #4 on: March 22, 2023, 10:24:08 PM »
Hi GNUser
... Given how quickly GNU awk is able to sort provides.db, I'd say this problem is more than solved. The problem is crushed.
There's a reason roberts liked to inject awk snippets into his scripts. When it
comes to data manipulation, it can be wicked fast.

I've had a few instances were I found the execution time of a script unacceptable
and was forced to add an awk function. None of my techniques could even touch
the speed of awk.

Online Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 11667
Re: [Solved]: how to sort provides.db?
« Reply #5 on: March 22, 2023, 10:39:17 PM »
Hi CNK
... But I'll at least claim bonus points for using only Busybox tools. ...
Sorry, busybox tools includes awk:
Code: [Select]
tc@E310:~$ busybox
BusyBox v1.29.3 (2018-12-19 15:29:37 UTC) multi-call binary.
BusyBox is copyrighted by many authors between 1998-2015.
Licensed under GPLv2. See source distribution for detailed
copyright notices.

 ----- Snip -----

Currently defined functions:
        [, [[, addgroup, adduser, adjtimex, ar, arp, arping, ash, awk,
 ----- Snip -----

Bonus points denied.  ::)

Offline CNK

  • Wiki Author
  • Sr. Member
  • *****
  • Posts: 287
Re: [Solved]: how to sort provides.db?
« Reply #6 on: March 22, 2023, 10:52:36 PM »
Hi CNK
... But I'll at least claim bonus points for using only Busybox tools. ...
Sorry, busybox tools includes awk:
Code: [Select]
tc@E310:~$ busybox
BusyBox v1.29.3 (2018-12-19 15:29:37 UTC) multi-call binary.
BusyBox is copyrighted by many authors between 1998-2015.
Licensed under GPLv2. See source distribution for detailed
copyright notices.

 ----- Snip -----

Currently defined functions:
        [, [[, addgroup, adduser, adjtimex, ar, arp, arping, ash, awk,
 ----- Snip -----

Bonus points denied.  ::)

Ah but I made sure to check that mine was compatible with the Busybox versions of all the tools, whereas:
Quote
The asorti function (available in GNU awk) was the key.

Bonus points reinstated?

Offline GNUser

  • Wiki Author
  • Hero Member
  • *****
  • Posts: 1528
Re: [Solved]: how to sort provides.db?
« Reply #7 on: March 23, 2023, 08:15:32 AM »
I like both solutions.

Rich's helped me solved a problem I've found daunting for a long time. Very satisfying, especially considering how fast awk dispatches the job.

I like your script, CNK, especially these three ideas:
1. use only busybox tools
2. put the .list file contents in the correct place to begin with (rather than append then sort)
3. grep -A 1

CNK, I want to add two ideas:
1. If you have a local mirror and have already added new extension to info.lst and sorted info.lst, there is no need for this:
headings="`grep -n '^[^/]*\.tcz$' $provides`"
2. The r command to sed I think would work nicely here.

Here is my variation on your idea:

Code: [Select]
#!/bin/sh
# Usage: insert_provides.sh [extension]
# Creates new_provides.db
# Example usage: insert_provides.sh foo.tcz
# This script assumes info.lst already contains foo.tcz in the correct place

extension=$1
info=/path/to/info.lst
provides=/path/to/provides.db

next_tcz=$(grep -A 1 "$1" $info | tail -1)
next_tcz_line=$(grep -n "^$next_tcz$" $provides | cut -d: -f1)
target_line=$(( next_tcz_line - 1 ))

echo "$1" >/tmp/insertion.txt
cat $1.tcz.list >>/tmp/insertion.txt
echo "" >>/tmp/insertion.txt

sed "$target_line r /tmp/insertion.txt" $provides >new_provides.db
« Last Edit: March 23, 2023, 08:21:06 AM by GNUser »

Online Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 11667
Re: [Solved]: how to sort provides.db?
« Reply #8 on: March 23, 2023, 02:52:58 PM »


Hi CNK
While busybox awk does not natively include a sort function, it can do
the heavy lifting of converting record and field separators to a format
that sort can handle and then back to their original values after sorting
has completed:
Code: [Select]
#!/bin/busybox ash

. /etc/init.d/tc-functions
useBusybox

PATH="/bin:/sbin:/usr/bin:/usr/sbin"
export PATH

# busybox sort
awk 'BEGIN {FS="\n";RS=""} {OFS=";"; $1=$1; print $0}' provides-10.x-x86.db | sort -fd | awk 'BEGIN {FS=";";RS="\n"} {OFS="\n"; ORS="\n\n"; $1=$1; print $0}' > tmp.txt

# coreutils sort
#awk 'BEGIN {FS="\n";RS=""} {OFS=";"; $1=$1; print $0}' provides-10.x-x86.db | /usr/local/bin/sort -fd | awk 'BEGIN {FS=";";RS="\n"} {OFS="\n"; ORS="\n\n"; $1=$1; print $0}' > tmp.txt

# awk without sort
#awk 'BEGIN {FS="\n";RS=""} {OFS=";"; $1=$1; print $0}' provides-10.x-x86.db | awk 'BEGIN {FS=";";RS="\n"} {OFS="\n"; ORS="\n\n"; $1=$1; print $0}' > tmp.txt

This is the timing using busybox sort:
Code: [Select]
tc@E310:~/awksort$ time ./awksort.sh
real    0m 4.87s
user    0m 4.52s
sys     0m 0.56s

This is the timing using coreutils sort:
Code: [Select]
tc@E310:~/awksort$ time ./awksort.sh
real    0m 1.44s
user    0m 0.97s
sys     0m 0.52s

This is the timing without sort in the pipeline:
Code: [Select]
tc@E310:~/awksort$ time ./awksort.sh
real    0m 1.06s
user    0m 1.16s
sys     0m 0.35s

Running a  diff  on the 2 files shows 2 small differences:
libev-dev.tcz came before libevdev.tcz in the original but sort swapped them.
gst-ffmpeg.tcz  had 2 blank lines preceding it in the original instead of 1.

... Bonus points reinstated?
Nope, reallocated to myself.  :P

Online Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 11667
Re: [Solved]: how to sort provides.db?
« Reply #9 on: March 23, 2023, 05:31:42 PM »
Hi CNK
Here are a couple of quicksort examples for busybox awk:
Code: [Select]
# http://www.victor-notes.com/qppro/awk.html

BEGIN { RS = ""; FS = "\n" }

      { A[NR] = $0 }

END   {
qsort(A, 1, NR)
for (i = 1; i <= NR; i++) {
print A[i]
if (i == NR) break
print ""
}
      }

# QuickSort
# Source: "The AWK Programming Language", by Aho, et.al., p.161
function qsort(A, left, right,   i, last) {
if (left >= right)
return
swap(A, left, left+int((right-left+1)*rand()))
last = left
for (i = left+1; i <= right; i++)
if (A[i] < A[left])
swap(A, ++last, i)
swap(A, left, last)
qsort(A, left, last-1)
qsort(A, last+1, right)
}
function swap(A, i, j,   t) {
t = A[i]; A[i] = A[j]; A[j] = t
}

Timing:
Code: [Select]
tc@E310:~/awksort$ time /usr/bin/awk -f quicksort.sh provides-10.x-x86.db > sorted.txt
real    0m 1.74s
user    0m 1.22s
sys     0m 0.52s

And a faster version:
Code: [Select]
# http://www.victor-notes.com/qppro/awk.html

BEGIN { RS = ""; FS = "\n" }

      { A[NR] = $0 }

END   {
qsort(A, 1, NR)
for (i = 1; i <= NR; i++) {
print A[i]
if (i == NR) break
print ""
}
      }


# https://community.hpe.com/t5/languages-and-scripting/sort-in-awk-is-very-slow/m-p/2720834#M2904
# qsort expects 3 arguments
# 1 - the array to be sorted
# 2 - the array lower bound (1)
# 3 - the number of elements in the array
# Note : The other arguments are local variables
#        and are not passed as parameters       

function qsort(arry,lo,hi,           i,j,tmp_x,tmp_x2)
{
  i = lo + 0
  j = hi + 0
  tmp_x = arry[int(int(lo + hi) / 2)]
  do
    {
      while (arry[i] < tmp_x) ++i
      while (tmp_x < arry[j]) --j
      if (i <= j)
        {
          tmp_x2 = arry[i]
          arry[i] = arry[j]
          arry[j] = tmp_x2
          ++i
          --j
        }
    }
  while (i <= j)
  if (lo < j) qsort(arry,lo,j)
  if (i < hi) qsort(arry,i,hi)
  return(0)
} # qsort

And timing:
Code: [Select]
tc@E310:~/awksort$ time /usr/bin/awk -f quicksort2.sh provides-10.x-x86.db > sorted2.txt
real    0m 1.03s
user    0m 0.76s
sys     0m 0.27s

Offline CNK

  • Wiki Author
  • Sr. Member
  • *****
  • Posts: 287
Re: [Solved]: how to sort provides.db?
« Reply #10 on: March 23, 2023, 06:26:11 PM »
Thanks Rich for the Awk tips. I try not to deal with too many languages because I don't have the memory for it, so if I care about performance (for something like this I wouldn't, personally) then I'll switch straight from shell scripting to C. Indeed I spent 90% of my time messing about with Sed doing the script from before because I'd forgotten a little too much of the syntax for the insertion command with that.

As it is, between Shell/Bash, C, and PHP, I find myself back at PHP 101 after a few months not touching any such code and I'm only really saved by the fact that it's so much like C (until it isn't). Awk is way too much of a niche for me to bother with (EDIT: even though it's like C as well).

@GNUser: I never knew of info.lst (info.lst.gz is found in the tcz directory on the mirrors, I see). Yep that saves lots of Sed witchcraft which had me confused earlier.
« Last Edit: March 23, 2023, 06:30:09 PM by CNK »

Online Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 11667
Re: [Solved]: how to sort provides.db?
« Reply #11 on: March 23, 2023, 08:53:29 PM »
Hi CNK
... I try not to deal with too many languages because I don't have the memory for it, ...
Anytime I need to use  sed  or  awk  I wind up using Google for help getting the syntax right.
And if I'm having trouble with a shell script, I'll often peruse roberts scripts for ideas.