site-josuah

/usr/josuah
Log | Files | Refs

commit e1935709a25ce4e1f6d4a3c50d20110cb8531532
parent 8387919d4d9add463004e4edf9536dcd147ace2e
Author: Josuah Demangeon <me@josuah.net>
Date:   Sun, 19 Apr 2020 15:29:38 +0200

wiki/awk: add implementation of various useful functions

Diffstat:
Mwiki/awk/index.md | 225+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 192 insertions(+), 33 deletions(-)

diff --git a/wiki/awk/index.md b/wiki/awk/index.md @@ -8,7 +8,7 @@ input in fields by default. Not everything is parsed efficiently with AWK, Type-Length-Value for instance, but many things are. I use it for multiple projects: - * [[NotWiki]], featuring a (not)markdown [[parser]] that does two passes on + * [[NotWiki]], featuring a (not)markdown parser that does two passes on to easen-up the parsing, * [[ics2txt]], a basic iCal to TSV or plain text converter (two directions), @@ -22,35 +22,6 @@ but many things are. I use it for multiple projects: [parser]: //code.z0.is/git/notwiki/files/ [jj]: /wiki/jj/ -Local variables in functions ----------------------------- -By default, all awk variables are global, which is inconvenient for writing -functions. The solution is to add an extra function argument at the end for -each local variable we need. - -Functions can be called with fewer arguments than they have. - - $ awk ' - function concat3(arg1, arg2, arg3, - loc) - { - loc = arg1 arg2 arg3 - return loc - } - - BEGIN { - loc = 1 - print(concat3("a", "w", "k")) - print(loc) - } - ' - awk - 1 - -I learned this with the [jj] project. - -[jj]: https://github.com/aaronNGi/jj/ - CSV fields with header ---------------------- @@ -70,7 +41,7 @@ without breaking the script. $F["domain_name"] ~ /\.com$/ { print $F["expiry_date"], $F["owner"], $F["domain_name"] } - ' + ' input.txt 2020-03 me nowhere.com 2020-04 you perdu.com @@ -135,7 +106,6 @@ for instance, to extract an abstract out of a basic iCal file: CATEGORIES:Internet LOCATION:Janson END:VEVENT - $ awk ' BEGIN { FS = ":" } { F[$1] = $2 } @@ -143,7 +113,196 @@ for instance, to extract an abstract out of a basic iCal file: print F["SUMMARY"] " - " F["DESCRIPTION"] print F["DTSTART"], "(" F["TZID"] ")" } - ' + ' input.txt State of the Onion - Building usable free software to fight surveillance and censorship. 20200201T170000 (Europe-Brussels) + +Edit variables passed to functions +---------------------------------- +For languages that support references, pointers, or objects, it is possible to +edit the variable passed to a function, so that the variable also gets edited +in the function that called it. + + void increment(int *i) { (*i)++; } + +Awk does not support changing integers or strings, but supports editing the +fields of an array: + + function increment_first(arr) { arr[i]++ } + + +Local variables in functions +---------------------------- +By default, all awk variables are global, which is inconvenient for writing +functions. The solution is to add an extra function argument at the end for +each local variable we need. + +Functions can be called with fewer arguments than they have. + + $ awk ' + function concat3(arg1, arg2, arg3, + local1) + { + local1 = arg1 arg2 arg3 + return local1 + } + + BEGIN { + local1 = 1 + print(concat3("a", "w", "k")) + print(local1) + } + ' + awk + 1 + +I learned this with the [jj] project. + +[jj]: https://github.com/aaronNGi/jj/ + + +A sort() function +----------------- +A very convenient feature lacking to awk is support for sorting members of an +array. Is possible to implement sort() in awk (this is a quicksort): + + function swap(array, a, b, + tmp) + { + tmp = array[a] + array[a] = array[b] + array[b] = tmp + } + + function sort(array, beg, end) + { + if (beg >= end) # end recursion + return + + a = beg + 1 # 1st is the pivot, so +1 + b = end + while (a < b) { + while (a < b && array[a] <= array[beg]) # beg: skip lesser + a++ + while (a < b && array[b] > array[beg]) # end: skip greater + b-- + swap(array, a, b) # found 2 misplaced + } + + if (array[beg] > array[a]) # put the pivot back + swap(array, beg, a) + + sort(array, beg, a - 1) # sort lower half + sort(array, a, end) # sort higher half + } + +This sorts the values of the array using integers keys: array[1], array[2], ... +It sorts element between array[beg] and array[end] included, so you can choose +array starting at 0, at 1, or sort just a part of the array. + +Example usage: with the both function above: + + { + LINES[NR] = $0 + } + + END { + sort(LINES, 1, NR) + for (i = 1; i <= NR; i++) + print(LINES[i]) + } + +Performance is far from terrible! + + $ od -An /dev/urandom | head -n 1000000 | time ./test.awk >/dev/null + real 0m 19.23s + user 0m 17.90s + sys 0m 0.12s + + $ od -An /dev/urandom | head -n 1000000 | time sort >/dev/null + real 0m 4.39s + user 0m 3.00s + sys 0m 0.10s + + +A gmtime() function +------------------- +POSIX awk as well as many implementations lack the [time functions][tf] present in +GNU awk. This gmtime() function split an epoch integer value (1587302158) into the +fields year, mon, mday, hour, min, sec (2020-04-19T15:15:58Z): + +[tf]: https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html + + function isleap(year) + { + return (year % 4 == 0) && (year % 100 != 0) || (year % 400 == 0) + } + + function mdays(mon, year) + { + return (mon == 2) ? (28 + isleap(year)) : (30 + (mon + (mon > 7)) % 2) + } + + function gmtime(sec, tm) + { + tm["year"] = 1970 + while (sec >= (s = 86400 * (365 + isleap(tm["year"])))) { + tm["year"]++ + sec -= s + } + + tm["mon"] = 1 + while (sec >= (s = 86400 * mdays(tm["mon"], tm["year"]))) { + tm["mon"]++ + sec -= s + } + + tm["mday"] = 1 + while (sec >= (s = 86400)) { + tm["mday"]++ + sec -= s + } + + tm["hour"] = 0 + while (sec >= 3600) { + tm["hour"]++ + sec -= 3600 + } + + tm["min"] = 0 + while (sec >= 60) { + tm["min"]++ + sec -= 60 + } + + tm["sec"] = sec + } + +The tm array will be filled with field names following the [[gmtime]] +function as you can see above. + +[gmtime]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/gmtime.html + + +A localtime() function +---------------------- +For printing functions in the user's favorite timezone, gmtime's time needs +to be shifted. This can also be done in standard awk by calling the date(1) +command: + + function localtime(sec, tm, + tz, h, m) + { + if (!TZOFFSET) { + "date +%z" | getline tz + close("date +%z") + h = substr(tz, 2, 2) + m = substr(tz, 4, 2) + TZOFFSET = substr(date, 1, 1) (h * 3600 + m * 60) + } + return gmtime(sec + TZOFFSET, tm) + } + +Note that date(1) will only be called the first time localtime() is called, and +the TZOFFSET global variable will be used for the next calls.