Linux 2 — Advanced command line use and Bash programming

Manipulating text

Regular expressions

In it’s simplest form, a regular expression is a string of symbols to match “as is”.

Regex Matches
abc abcdef
234 12345
grep '234'

Basic regexes vs. extended regexes

The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command. The only supported quantifiers are . (dot), ^ (caret), $ (dollar), and * (star). To match these characters literally, escape them with a \ (backslash).

Most modern regex flavors are extensions to the BRE flavor, thus called ERE flavor. By today’s standard, the POSIX ERE flavor is rather bare bones. We will be using extended regexes, so:

alias grep='grep --color=auto -E'

Quantifiers

To match several characters you need to use a quantifier:

Regex Matches
23*4 1245, 12345, 123345
23?4 1245, 12345
23+4 12345, 123345

Does grep '23*4' match ‘123456’, ‘122456’ and/or ‘1233334456’?

Regexes are greedy

By default, regexes are greedy. They match as many characters as possible, as found out in the previous question. You can define how many instances of a match you want by using ranges:

Special characters

A lot of special characters are available for regex building. Here are some of the more usual ones:

Especially anchors ^ and $ are often useful. How would you match an empty line, a line that contains no characters, or only spaces?

Regex Matches Does not match
1.3 1234, 1z3, 0133 13
1.*3 13, 123, 1zdfkj3  
\w+@\w+ a@a, email@oy.ab ,.-!”#€%&/
^1.*3$ 13, 123, 1zdfkj3 x13, 123x, x1zdfkj3x

Character classes

You can group characters by putting them between square brackets. This way, any character in the class will match any one character in the input.

Can you find three string examples that match and three examples that do not match [^ab], ^[1-9][0-9]*$, [0-9]*[,.]?[0-9]+? Can you construct any tricky ones?

Grouping and alternatives

It might be necessary to group things together, which is done with parentheses ( and ).

Regex Matches Does not match
(ab) ab, abab, aabb aa, bb

Grouping itself usually does not do much, but combined with other features turns out to be very useful. The OR operator | may be used for alternatives.

Regex Matches Does not match
(aa|bb) aa, bbaa, aabb abab

You can grep all lines that have either word ‘Juha’ and ‘Thomas’ with grep -e Juha -e Thomas, by giving grep two separate regexes to match. How can you achieve the same with a single regex?

Subexpressions

With parentheses, you can also define subexpressions to store the match after it has happened and then refer to it later on.

Regex Matches Does not match
(ab)\1 ababcdcd ab, abcabc
(ab)c.*\1 abcabc, abcdefabcdef abc, ababc

Command sed uses quite similar mechanism for search and replace.

Some practical(?) examples

Check for a valid format for email address:

grep -E '\w[A-Za-z0-9._+-]+[^.]@\w[A-Za-z0-9.-]+\.[A-Za-z]{2,}'

Here

Does ‘firstname.lastname+noreply@helsinki.fi’ qualify as a valid email address? Can you send email to yourself, if you add ‘+something’ to the local part (the part before ‘@’) of your email address? Can you send email from command line, and if yes, with what command?

Text utilities

What we will work over

You can download the short text files used in the exercise with either of the commands below

wget https://jlento.github.io/linux-2/handson/{count,sheep}.txt
curl -O -O https://jlento.github.io/linux-2/handson/{count,sheep}.txt

Adding files side-by-side: paste

paste [-d del -s] file1 file2 [file3 …]

Merges lines of several input files.

Let’s try the following:

paste count.txt sheep.txt > counting_sheep_tab.txt
paste -d ' ' count.txt sheep.txt > counting_sheep_space.txt

What command can you use to make spaces and tabulators “visible”, for example replacing them with _ and > in the previous paste examples?

Trimming files: cut

cut [-d del -f no -s] file1 file2 …

Extracts fields/columns from each line of files.

How would you cut away the counting numbers column from counting_sheep_tab.txt?

Counting lines [and sheep]: wc

wc [-l -w -m -c] file1 [file2 …]

Counts lines, words as well as characters or bytes in a file (wc stands for word count):

wc -l counting_sheep_space.txt

Combining files end to start: cat

cat [-n -E -v -T] file1 file2 …

Concatenates files and prints to stdout.

cat -n counting_sheep_space.txt > sheep_lines.txt
cat -T -E counting_sheep_tab.txt

Extracting beginning and end of files: head and tail

head [-n N] file1 [file2 …]

Extracts head of files.

tail [-n N -f --pid PID] file1 [file2 …]

Extracts tail of files

How would you extract all lines after the 3rd line?

Bringing order into files: sort

sort [-d -f -g ] file1 [file2 …]
$ sort -d counting_sheep_space.txt
$ sort -g sheep_lines.txt

Removing redundancy in files: uniq

 uniq [-c -f -s -w ] file1 [file2 …]

Filters adjacent matching (redundant) files.

Skips the first column (the previously inserted numbers) and matches in max. 10 characters (i.e., avoiding the later columns) and prefixes the number of occurrence (hint: try with –f 2):

$ uniq -c -f 1 -w 10 sheep_lines.txt

Because uniq only compares adjacent lines, you may need to sort the text first. But is there an option in sort that drops redundant lines?

Awk - text processing

Awk was developed at Bell Labs in 1977 by Aho (not Esko, but Alfred Vainö!), Weinberger, and Kernighan. It is a versatile scripting language which resembles C (surprise! - Kernighan & Ritchie). It is powerful with spread-sheet type / tabulated data.

Typical usage is in one-liners with matching / reordering / formatting / calculating fields from the existing tables of data, but any longer programs are very possible.

The general idea of Awk program is that it reads text line by line, checks which of the patterns given in the Awk program match the line, and runs the associated blocks of code if the line matches the pattern. In general something like this:

pattern1{block-of-awk-commands}
pattern2{block-of-awk-commands}
...

Patterns can be a regular expression, or many of the other kind of expressions understood by Awk. If pattern is omitted, the block of commands is executed for every line.

Remember to check man awk!

awk - command line

To print the second column (word) of a file, type the following to the terminal:

$ awk '{print $2}' /etc/mime.types

By default you assume that the file is separated by blank spaces. If you look at file /etc/passwd, for example, you see that the fields are separated by colons :. How do you tell Awk to use : as field separator, and print the second column from /etc/passwd?

When you construct your own tabular data text files, which characters are “good” field separators, and why?

What do the file suffix .csv or .tsv abbreviations mean?

awk pattern matching

Awk commands allow to test the input against regular expression (enclosed in / /):

awk '/regexp/ { action }' file

An exclamation mark inverts the match:

awk '!/regexp/ { action }' file

For example, we want to print all relevant lines in /etc/mime.types, i.e., exclude all comment-lines that start with #:

awk '!/^#/' /etc/mime.types | less

awk scripts

You can save your awk program to a separate file. Why should you?

Script is read using option -f:

awk -f myscript.awk inputfile.txt > outputfile.txt

Special patterns

You can run actions before and after the text file is parsed. This is achieved by special patterns BEGIN and END. BEGIN is often used to initialize variables before the first input line has been read in, and END is usually used to print some summary information after input has been finished.

Let’s write a script to display all nologin accounts in the system. Use your favourite text editor and create a new file called nologin.awk. Fill it with the following contents and save thereafter:

BEGIN {
  x=0
}
/nologin/ {
  x=x+1
  print x, " ...", $1
}
END {
  print "------------------"
  print "nologins=", x
}

Use -f option to launch the script:

awk -f nologin.awk /etc/passwd

How to get all users with login accounts are shown?

How can you produce a similar result as with grep?

Field separator

You can use regex to specify multiple sepration characters. For example, FS is either colon : or comma , (NF is number of fields in the current line):

echo "0 1:2,3 4" | awk -F"[:,]" '{print "entries:" NF " last column: " $NF}'

spot the difference with not using regexp:

echo "0 1:2,3 4" | awk -F":," '{print "entries:" NF " last column:" $NF}'

or also including a blank:

echo "0 1:2,3 4" | awk -F"[:, ]" '{print "entries:" NF " last column:" $NF}'

Counters of columns, rows and records

Awk fields are accessed through variables $1, $2, …, $(NF-1), $(NF). NF (Number of Fields) is the number of fields on each line (# columns in row).

echo "0 1:2,3 4" | awk -F"[:, ]" '{print "entries:" NF " first:" $1 " last:" $NF}'

Variable $0 refers to the whole input row.

awk -F":" '{printf "user: %s\n    whole line: %s\n", $1, $0}' /etc/passwd

printf enables formatted printout - we will discuss in more details later. NR (Number of Records) is the number of input records (lines):

awk 'END {print NR}' /etc/passwd

Much simpler still: wc -l /etc/passwd

Loops in awk

Arithmetic loops in awk are very much C-style:

for (countervar=initvalue; condition of validity; increment) {action}

e.g., displaying single fields in row:

awk -F: '{for (i=1; i<=NF; i++) {print i, $i}; print " "}' /etc/passwd

or to invert

awk -F: '{for (i=NF; i>=1; i--) {print i, $i}; print " "}' /etc/passwd

or only odd lines

awk -F: '{for (i=1; i<=NF; i=i+2) {print i, $i}; print " "}' /etc/passwd

There is a second style of for loops, that is useful with array variables.

Output in awk

Generic print just takes either strings or variables.

awk -F: '{print "string", $2, $NF, NF, NR}' /etc/passwd

Alternatively, printf offers a wide range of C-style formatting capabilities, e.g.:

date | awk -F"[ :]" '{printf("Time=%2d hours and %2d minutes\n", $4, $5)}'

Remember not to forget to supply the newline \n in printf! The generic print already adds that for you automatically. Formats are %d for integer, %f for floats,%e for scientific, %s for string, etc. Lengths can be prescribed:

  $ echo "1234.5678 910.16" | awk '{printf "%4.2f %1.3e \n", $1, $2}'

All in all, quite the same as printf in C or Bash.

Variables in awk

In addition to the already mentioned awk internal variables NR,NF,$1,$2,..., Awk has user defined variables. User defined variables are conventionally written in lowercase.

Variables can be set inside script/command line

awk 'BEGIN{myvar="Hello !"; a=1; b=2; print myvar, a, "+", b "=", a+b}'

or can be passed to awk from outside,

awk -F: -v n=1 '{print $n}' /etc/passwd

Why everything in in inside BEGIN section in the first example?

In the second example, try with n=2,3,.... What exactly does the $n refer to?

Array variables

We can use arrays in awk,

awk 'BEGIN{t[1,1]=1; t[1,2]=2; i=1; print t[1,2], t[i,i], t[i,1]}'

Awk arrays are in fact 1 dimensional associative arrays (hash tables). The index into an array does not have to be an integer number, it can be (actually is) a string,

awk 'BEGIN{car["sweden"]="volvo"; car["russia"]="lada"; car["usa"]="pontiac"; //
for (i in car) {print i, ":", car[i]}}'

NB: // at the end tells bash to continue the line - you may type that in one row.

Let’s say you give arguments to your Awk program like you would give arguments to your shell program,

awk '{prog}' arg1 arg2 arg3 ...

How do you loop conveniently over all array elements?

Can you write an Awk program prog that prints it’s command line arguments?

Built-in functions

Awk has the usual numerical functions: int, exp, log, sin, cos, sqrt,

$ for ((x=1; x<=180; x++)); { echo $x; } > angles.dat
$ awk '{print $1, cos($1*3.1415927/180.0)}' < angles.dat | tee cosine.dat

string functions, tolower, toupper, sprintf, match, …, etc.

$ awk '{print toupper($0)}' /etc/group

and bit manipulation functions: and, or, xor, …

$ awk 'BEGIN{printf "and(1,0)=%x or(1,0)=%x \n", and(1,0), or(1,0)}'

For more details, see e.g. gawk manual pages

Control statements

The if-else statement (save into sign.awk):

{
  printf "cos(%f)=%2.2f, ", $1, $2
  if ($2 > 0) {print " positive"}\\
  else {print "negative"}
}
awk -f sign.awk cosine.dat

Logical operators: and &&, or ||

# write awk script sign_product.awk
BEGIN {print "enter 2 numbers separated by space (end with CTRL+D)"}
{
    if (($1 == 0) || ($2 == 0)) {
        sign="zero"
    }
    else if ( (($1 < 0) && ($2 > 0)) || (($1 > 0) && ($2 < 0)) ) {
        sign="negative"
    } else {
        sign="positive"
    }
    printf "product of %f x %f is %s\n", $1, $2, sign
}
awk -f sign_product.awk

Further resources

Like always, man-pages:

man awk
info awk

Awk web-manual by GNU https://www.gnu.org/software/gawk/manual The Internet, e.g.: https://stackoverflow.com