awk

Linux

1. Awk, gawk, mawk, nawk

Awk is a kind of programming language designed for advanced text processing. Awk needs an input, performs some actions and delivers the result to standard output.

The GNU implementation of awk is called gawk, but for the end-user calling awk interpreter is transparent and is a symlink to gwak.

While "mawk" is a fast, minimal implementation focusing on speed, which is actually used on debian Linux, "gawk" offers more features and extensions.

nawk : During the development of the AWK language, the creators launched a new version (which is why it's referred to as new awk) to prevent any mix-ups.

2. Records and fields

Awk processes input data by dividing it into records and then further splitting each record into fields.

2.1 Records: Awk treats its input as a sequence of records, and processes by handling a single record at a time until it has traversed the entire input. Each line of the input file is considered a separate record, with the newline character (\n) serving as the record separator. This default is stored in the built-in variable RS. The RS variable can be changed to any other character or regular expression to define a different record separator, allowing for multi-line records or records delimited by specific patterns.

2.2 Fields : Each record is systematically divided into segments referred to as fields.
By default, whitespace (spaces and tabs) acts as the field separator, meaning fields are typically individual words or sequences of characters separated by whitespace. This default is stored in the built-in variable FS. The FS variable can be changed to use a different character or regular expression as the field separator, enabling parsing of structured data like CSV files (e.g., FS=","). Fields are accessed using dollar sign ($) followed by their numerical position, starting from 1 (e.g., $1 for the first field, $2 for the second). The entire record is represented by $0. Example :

fdisk -l   
Device      Boot         Start     End    Sectors       Size        Id      Type

/dev/sdb2              629153595 1953520064 1324366470 631,5G       83       Linux
<----$1---><-$2->   <---$3-----><-----$4------><-----$5-----><----$6---><-$7-><-$8 ($NF)->      Fields
<------------------------------------------------------$0------------------------------------------------------>     Records

3. Awk program : patterns and actions

In order to process text using awk,we can write a program that directs the command on how to proceed. This program includes a series of rules and user-defined functions.

Each rule consists of a pattern paired with an action. Rules are distinguished by newline characters or semi-colons (;)

The basic syntax of an awk command is :

awk [options] 'pattern {action}' inputfile

A pattern is a condition that must be met for the associated action to be executed on a given record, i.e if the pattern matches the record, awk performs the specified action on that record.

An awk action is encapsulated in braces ({}) and is made up of various statements. Each statement indicates the operation that should be carried out.

An action can include multiple statements, which are separated by newlines or semicolons (;)

  1. Print Statement:

    print: Prints the entire current record ($0) by default, or specified fields/variables. Example: awk '{print $1, $3}' filename (prints the first and third fields of each line).

  2. Control Flow Statements:

    if (condition) {action}: Conditional execution. for (init; condition; increment) {action}: Looping with initialization, condition, and increment. while (condition) {action}: Looping while a condition is true. next: Skips the rest of the current record's processing and moves to the next record. exit: Terminates the AWK program.

Common types of patterns include:

Regular expressions: Enclosed in slashes (e.g., /pattern/), they match records containing the specified text.

Relational expressions: Comparisons involving field values (e.g., $3 > 10).

Range patterns: Two patterns separated by a comma (e.g., /start_pattern/, /end_pattern/), which match records from the one matching the first pattern up to and including the one matching the second.

Special patterns: BEGIN (executed before any input is processed) and END (executed after all input is processed)

Patterns can be combined using logical operators (&& for AND, || for OR, ! for NOT).

4. Simple examples

The following examples will use two files, a csv type called bank.csv and a text one called students.txt

bank.csv :

operation,date,amount,currency,balance
withdrawal,25-08-10,150,USD,2600
withdrawal,15-08-07,100,TND,2630
purchase,25-08-07,60,TND,2650
payment,25-08-02,900,TND,2950

students.txt :

ID name forename grade result
120 weslati firas 85 success
140 dridi mohamed 80 success
145 yacoubi salah 95 success
147 benrejeb wissal 90 success
148 mekki ryan 75 success
152 mbanebe philippe 80 success
153 sako aminata 70 fail
4.1 print the whole file :
awk '{ print $0 }' students.txt

The command is the same for bank.csv and is similar to : cat students.txt To have a line-number count for each line,use the NR built-in variable:

awk '{ print NR, $0 }' students.txt

4.2 Print a specific field :

awk '{ print $2 }' students.txt

This will print the second field of all records for students.txt. For the csv file, the command needs the -F option, since white space acts as the default field separator, setting -F"," changes the FS built-in variable to" ," :

awk -F"," '{print $2}' bank.csv

To print the first and the third column :

awk -F"," '{print $1, $3}' bank.csv

4.3 Using regex patterns:

The syntax will look like :

awk '/regex pattern/{action}' students.txt

For example let's print the first field of each record that contains "USD"

awk -F","  '/USD/ { print $1 }' bank.csv

Print the first field if the record starts with 12 :

awk '/^12/ { print $1 }' students.txt

_Print the first field of all records whose third field is greater than 100:

awk -F"," '$3 > 100 { print $1 }' bank.csv

Print the second and third field of all records if the fourth field is greater or equal to 75:

awk '$4 >=75 { print $2, $3 }' students.txt

4.4 Range patterns :

A range pattern is made of two patterns separated by a comma, in the form ‘begpat, endpat’. The first pattern, begpat, controls where the range begins, while endpat controls where the pattern ends.

awk '/dridi/,/mekki/ { print $1 }' students.txt

The command above prints the first field of all records starting from the record containing "dridi" until "mekki".

4.5 Using logical operators :

Awk provides three logical (or Boolean) operators for combining patterns and expressions :

  • && (AND): This operator returns true only if both expressions on either side of the && are true. If the first expression is false, the second expression is not evaluated (short-circuiting).

  • || (OR): This operator returns true if at least one of the expressions on either side of the || is true. If the first expression is true, the second expression is not evaluated (short-circuiting).

  • ! (NOT): This operator negates the truth value of the expression that follows it. If the expression is true, ! makes it false, and vice-versa.

awk '$1 >=140 && $1 < 150 && $4 >=75 { print $2, $3 }' students.txt

The above command will print name and forename of students whose ID is between 140 and 149 and whose grade is equal or over 75.

awk '$4 >= 75 || $5 ="success" {print $2, $3}' students.txt

The above command will print name and forename of students whose grade is over 75 or whose result is success.

awk '$1 >= 140 && $1 < 150 ' students.txt

The above command prints all records starting from the one whose first field is equal to 140 until the one whose first field is less to 150.

awk ' !($5 == "fail") { print $0 } ' students.txt

The above command will print the whole record for students whose result is NOT fail.

4.6 Special patterns :

Awk provides two special patterns, BEGIN and END, which are used to execute actions at specific phases of the Awk script's execution:

The action associated with the BEGIN pattern is executed before any input records are read or processed.

awk '

 BEGIN {
        print "--- Start of Report ---"
        FS = "," # Set field separator to comma
    }

$1 == "purchase" || $1 == "withdrawal"
{ print $1, $3 }

' bank.csv

The action associated with the END pattern is executed after all input records have been read and processed.

awk '
  $1 == "purchase" || $1 == "withdrawal" { expenses += $3 }
   END {
        print "Total :", expenses
        print "--- End of Report ---"
    }
' bank.csv

The same example where BEGIN and END are combined in an awk script called expense.awk :

BEGIN {
    FS = ","  # Set field separator to comma
    expenses = 0
    print "---expenses Report ---"
    print "operation\tamount"
    print "--------------------------"
}
$1 == "purchase" || $1 == "withdrawal" {

    amount = $3 + 0               # Ensure numeric addition (force conversion)

    expenses += amount

    print $1 "\t" amount

}

END {
    print "--------------------------"
    print "Total : \t" expenses
    print "--- End of Report ---"
}

Then to run this script :

awk -f expense.awk bank.csv

4.7 Assign Variable :

The -v option allows for the assignment of a value to a variable before the awk program begins execution. The assigned variable is accessible throughout the entire awk script (in BEGIN, pattern/action blocks, and END).

It provides a convenient way to pass values from the shell environment or other sources into an awk script. The -v option can only set one variable, however, it can be utilized multiple times to assign different variables with each invocation: ' awk -v x=1 -v y=2 … '.

awk -F',' -v var="Amount:" '{print var, $3}' bank.csv

4.8 Arithmetic operations :

Sum :

Sum all the values of third record:

awk -F',' '{ sum += $3 } END { print sum }' bank.csv

Average :

Calculate the average of the digital values of fourth field:

awk '{ aver += $4; n++ } END { print aver / n }' students.txt

Maximum value :

Find the maximum of the digital values of fourth field :

awk 'NR == 2 || ($4+0) > max { max = $4+0 } END { print max }' students.txt

Since all the records are processed, the first record contains string values which cannot be added to numerical values, so we have to force number and avoid first record :

  • NR == 2 → initialize max with the first true numeric value (record 2).

  • $4+0 → force $4 to be a nimber (important in case of string).

  • END { print max } → displays max.

    Another option using regex : (take only numeric values of fourth field):

awk '$4 ~ /^[0-9]+$/ { if (NR==2 || $4 > max) max=$4 } END { print max }' students.txt

Minimum value :

Find the minimum of the digital values of fourth field :

awk 'NR == 2 || ($4+0) < min { min = $4+0 } END { print min }' students.txt

Conditional average

Calculate the average of the records where the grade is > 75

awk 'NR > 1 && ($4 + 0) > 75 { total += ($4 + 0); n++ } END { if (n > 0) printf "%.2f\n", total / n; else print "no value > 75" }' students.txt
  • NR > 1 : ignore the header.

  • ($4 + 0) > 75 : force la conversion to number

  • total += ($4 + 0) : numerical addition.

  • printf "%.2f" : prints the average with two digits.

5. Output statements : print and printf.

We used the print function since the beginning of this article, but the need for more control over the output format in the last example related to average calculation led us to use the printf statement, so let's dive into the syntax and uses of both functions.

5.1. print :
  • Automatic Newline: It automatically appends a newline character to the end of its output, unless the ORS (Output Record Separator) variable is changed.

  • Output Field Separator (OFS): When printing multiple arguments separated by commas, print uses the OFS variable (defaulting to a space) to separate them.

  • Default Behavior: print by itself (without arguments) prints the entire current input record.

When printing several items, it is necessary to separate them using commas. For instance:

awk '{ print $1, $4, $5 }' students.txt

If we omit commas, the items will be concatenated with no spaces :

awk '{ print $1 $4 $5 }' students.txt

IDforenameresult
120firassuccess
140mohamedsuccess
145salahsuccess
147wissalsuccess
148ryansuccess
152philippesuccess
153aminatafail

To print a "names" before the 2nd field, use double-quote characters:

awk '{ print "names : " $2 }' students.txt

To indicate number of lines with suffixes, we create an num.awk script which will be called :

awk -f num.awk students.txt

The num.awk script :

{
    # default suffix
    suffix = "th"

    # exceptions for 1st, 2n and 3rd
    if (NR % 10 == 1 && NR % 100 != 11) suffix = "st"
        else if (NR % 10 == 2 && NR % 100 != 12) suffix = "nd"
            else if (NR % 10 == 3 && NR % 100 != 13) suffix = "rd"

                # print with suffix and complete record
                print NR suffix " line : " $0
}
5.2 printf :
  • Format String: It requires a format string as its first argument, which specifies how subsequent arguments should be formatted (e.g., width, precision, data type).

  • No Automatic Newline: printf does not automatically add a newline. If a newline is desired, it must be explicitly included in the format string using \n.

  • No OFS/ORS Influence: The OFS and ORS variables do not affect printf's output.

Example : Insert a number for each record :

awk '{ printf "%2d. %s\n", NR, $0 }' students.txt

this command gives the same result as awk '{ print NR, $0 }' students.txt see 4.1 %2d. prints the record with 2 digits cause we have less than 100 lines in our file.

5.2.1 Escape characters (sequences) :
  • \t Horizontal tab inserts a horizontal tab (spaces to next tab stop).
awk 'BEGIN { printf "ID Name\tforeName\tgrade\tresult\n" }'
ID Name foreName        grade   result
  • \v Vertical tab inserts a vertical tab (rarely visible on modern terminals).
awk 'BEGIN { printf "ID Name\vforeName\vgrade\vresult\n" }'
ID Name
       foreName
               grade
                    result                    
  • \n New line inserts a newline.
awk 'BEGIN { printf "ID\nName\nForename\ngrade\nresult\n" }'
ID
Name
Forename
grade
result
  • \b Backspace

Prints a backspace after every field except the last one. It erases the last number from the first fourth fields

awk 'BEGIN { printf "ID 1\bName 2\bForename 3\bgrade 4\bresult 5\n" }'

ID Name Forename grade result 5
  • \\ Backslash itself
awk 'BEGIN { printf "Backslash:     \\\\ -> Hello\\World\n" }'
Backslash:     \\ -> Hello\World
  • \" Double quote
awk 'BEGIN { printf "Double quote:  \\\" -> \"Hello World\"\n" }'
Double quote:  \" -> "Hello World"
5.2.2 Format Specifier

%c

Prints a single character. If the argument used for %c is numeric, it is treated as a character and printed.

awk 'BEGIN { printf "ASCII value 90 = character %c\n", 90 }'

%d and %i

Prints the integer part of a decimal number

awk 'BEGIN { printf "intg = %d\n", 75.22 }'

%e and %E

Prints a floating point number of the form [-]d.dddddde[+-]dd for %e and [-]d.ddddddE[+-]dd for %E

awk 'BEGIN { printf "intg = %e\n", 75.22 }'

%f

Prints a floating point number of the form [-]ddd.dddddd.

awk 'BEGIN { printf "intg = %f\n", 75.22 }'

%o

Prints an unsigned octal number

awk 'BEGIN { printf "Octal value of 10 decimal is = %o\n", 10}

%s

Prints a character string.

awk 'BEGIN { printf "value = %s\n", "invalid" }'

%x and %X

Prints an unsigned hexadecimal number. The %X format uses uppercase letters.

awk 'BEGIN { printf "Hexadecimal value of 13 decimal is  = %x\n", 13 }'

Hexadecimal value of 13 decimal is  = d

The same command with %X :

awk 'BEGIN { printf "Hexadecimal value of 13 decimal is  = %x\n", 13 }'

Hexadecimal value of 13 decimal is  = D

%%

Special escape sequence for a literal percent sign (%).

awk 'BEGIN { printf "intg = %d%%\n", 75.22 }'
intg = 75%

Left and right justification :

awk 'BEGIN { num = 10; printf "Left : |%-5d|\n", num; printf "Right: |%5d|\n", num }' | cat -ve
Left : |10   |$
Right: |   10|$

%-5d → is a format specifier:

5 = minimum field width = 5 characters.

  • = left-justify (instead of the default right-justify).

10 is left-aligned, 3 spaces follow.

%5d → 10 is right-aligned, 3 spaces before.

So, it will print the integer (10) left-aligned in a field 5 chars wide. The command is then piped to cat to make non visible non-printing characters.

cat options:

-v → show non-printing characters (like control characters) visibly.

-e → show end of line (newline) as a literal $.

So cat -ve makes it very clear where tabs, spaces, and newlines are.

6. Built-in Variables

Awk provides a set of built-in variables that offer information about the input data and control how Awk processes it. These variables can be used in patterns, actions, and BEGIN/END blocks.

Commonly Used Built-in Variables:

6.1 Record and Field Counters:

  • NR: Stores the current record (line) number being processed

The following command gives the number of lines of students.txt. It reports the total number of lines in the specified file by printing a summary line once all processing is complete.

awk 'END { print "File", FILENAME, "contains", NR, "lines." }' students.txt
  • FNR: Stores the current record number within the current input file. This is useful when processing multiple input files.

In the following example when reading students.txt, both NR and FNR start at 1 and increment together.

When awk switches to bank.csv:

NR continues (doesn’t reset, it just keeps counting). FNR resets to 1 for the new file.

awk '{ print "NR=" NR, "FNR=" FNR, "File=" FILENAME, "Line=" $0 }' students.txt bank.csv
  • NF: Stores the number of fields in the current input record.
awk '{ print "Line", NR, "has", NF, "fields" }' students.txt

If we'd like to print only records that contain more than three fields :

awk 'NF > 3' students.txt
  • $NF: Represents the last field of the current record.

The following command prints only the last field of each line :

awk '{ print $NF }' fichier.txt

6.2 Separators:

  • FS: The input field separator. By default, it's white-space (spaces and tabs). It can be changed using the -F option or within the BEGIN block.
awk 'BEGIN {print "Input Field Separator = " FS}' | cat -ve
  • RS: The input record separator. By default, it's a newline character.
awk 'BEGIN {print "Input Record Separator = " RS}' | cat -ve
  • OFS: The output field separator. It's used when printing multiple fields with a comma in the print statement. The default is a space.
awk 'BEGIN {print "Output Field Separator = " OFS}' | cat -ve
  • ORS: The output record separator. It's appended to the end of each print statement. The default is a newline.
awk 'BEGIN {print "Output Record Separator = " ORS}' | cat -ve

6.3 File Information:

  • FILENAME: Stores the name of the current input file being processed. If input is from standard input, its value is "-".
awk 'END {print FILENAME}' students.txt

6.4 Command Line Arguments:

  • ARGC: Stores the number of command-line arguments (including the awk command itself). This count includes the awk command itself, variable assignments made on the command line (e.g., var=value), and the names of any input data files. It does not include awk's own options (like -f or -F).
awk 'BEGIN {print "Number of arguments =", ARGC}' One Two Three
  • ARGV: An array that stores the command-line arguments where ARGV[0] is typically the name of the awk command, and subsequent elements (ARGV[1], ARGV[2], etc.) contain the remaining arguments in the order they were provided. The value of ARGC will always be one greater than the highest index used in ARGV (since ARGV is zero-indexed).
awk 'BEGIN { for (i=0; i < ARGC; i++) { printf "ARGC[%d] = %s\n",i, ARGV[i] }}' students.txt var=value bank.csv

ARGC[0] = awk
ARGC[1] = students.txt
ARGC[2] = var=value
ARGC[3] = bank.csv
  1. everything after the hash mark (#) and until the end of the line is considered to be a comment.
  2. Long lines can be broken into multiple lines using the continuation character, backslash ().

source : linuxize stephane robert tutorialspoint

Previous Post Next Post