Awk is a kind of programming language designed for advanced text processing. Awk needs an input, performs some actions and delivers the result to standard output.
The GNU implementation of awk is called gawk, but for the end-user calling awk interpreter is transparent and is a symlink to gwak.
While "mawk" is a fast, minimal implementation focusing on speed, which is actually used on debian Linux, "gawk" offers more features and extensions.
nawk : During the development of the AWK language, the creators launched a new version (which is why it's referred to as new awk) to prevent any mix-ups.
Awk processes input data by dividing it into records and then further splitting each record into fields.
2.1 Records: Awk treats its input as a sequence of records, and processes by handling a single record at a time until it has traversed the entire input. Each line of the input file is considered a separate record, with the newline character (\n) serving as the record separator. This default is stored in the built-in variable RS. The RS variable can be changed to any other character or regular expression to define a different record separator, allowing for multi-line records or records delimited by specific patterns.
2.2 Fields :
Each record is systematically divided into segments referred to as fields.
By default, whitespace (spaces and tabs) acts as the field separator, meaning fields are typically individual words or sequences of characters separated by whitespace. This default is stored in the built-in variable FS.
The FS variable can be changed to use a different character or regular expression as the field separator, enabling parsing of structured data like CSV files (e.g., FS=",").
Fields are accessed using dollar sign ($) followed by their numerical position, starting from 1 (e.g., $1 for the first field, $2 for the second). The entire record is represented by $0.
Example :
fdisk -l
Device Boot Start End Sectors Size Id Type
/dev/sdb2 629153595 1953520064 1324366470 631,5G 83 Linux
<----$1---><-$2-> <---$3-----><-----$4------><-----$5-----><----$6---><-$7-><-$8 ($NF)-> Fields
<------------------------------------------------------$0------------------------------------------------------> Records
In order to process text using awk,we can write a program that directs the command on how to proceed. This program includes a series of rules and user-defined functions.
Each rule consists of a pattern paired with an action. Rules are distinguished by newline characters or semi-colons (;)
The basic syntax of an awk command is :
awk [options] 'pattern {action}' inputfile
A pattern is a condition that must be met for the associated action to be executed on a given record, i.e if the pattern matches the record, awk performs the specified action on that record.
An awk action is encapsulated in braces ({}) and is made up of various statements. Each statement indicates the operation that should be carried out.
An action can include multiple statements, which are separated by newlines or semicolons (;)
Print Statement:
print: Prints the entire current record ($0) by default, or specified fields/variables. Example: awk '{print $1, $3}' filename (prints the first and third fields of each line).
Control Flow Statements:
if (condition) {action}: Conditional execution. for (init; condition; increment) {action}: Looping with initialization, condition, and increment. while (condition) {action}: Looping while a condition is true. next: Skips the rest of the current record's processing and moves to the next record. exit: Terminates the AWK program.
Common types of patterns include:
Regular expressions: Enclosed in slashes (e.g., /pattern/), they match records containing the specified text.
Relational expressions: Comparisons involving field values (e.g., $3 > 10).
Range patterns: Two patterns separated by a comma (e.g., /start_pattern/, /end_pattern/), which match records from the one matching the first pattern up to and including the one matching the second.
Special patterns: BEGIN (executed before any input is processed) and END (executed after all input is processed)
Patterns can be combined using logical operators (&& for AND, || for OR, ! for NOT).
The following examples will use two files, a csv type called bank.csv and a text one called students.txt
bank.csv :
operation,date,amount,currency,balance
withdrawal,25-08-10,150,USD,2600
withdrawal,15-08-07,100,TND,2630
purchase,25-08-07,60,TND,2650
payment,25-08-02,900,TND,2950
students.txt :
ID name forename grade result
120 weslati firas 85 success
140 dridi mohamed 80 success
145 yacoubi salah 95 success
147 benrejeb wissal 90 success
148 mekki ryan 75 success
152 mbanebe philippe 80 success
153 sako aminata 70 fail
awk '{ print $0 }' students.txt
The command is the same for bank.csv and is similar to : cat students.txt
To have a line-number count for each line,use the NR built-in variable:
awk '{ print NR, $0 }' students.txt
4.2 Print a specific field :
awk '{ print $2 }' students.txt
This will print the second field of all records for students.txt. For the csv file, the command needs the -F option, since white space acts as the default field separator, setting -F"," changes the FS built-in variable to" ," :
awk -F"," '{print $2}' bank.csv
To print the first and the third column :
awk -F"," '{print $1, $3}' bank.csv
4.3 Using regex patterns:
The syntax will look like :
awk '/regex pattern/{action}' students.txt
For example let's print the first field of each record that contains "USD"
awk -F"," '/USD/ { print $1 }' bank.csv
Print the first field if the record starts with 12 :
awk '/^12/ { print $1 }' students.txt
_Print the first field of all records whose third field is greater than 100:
awk -F"," '$3 > 100 { print $1 }' bank.csv
Print the second and third field of all records if the fourth field is greater or equal to 75:
awk '$4 >=75 { print $2, $3 }' students.txt
4.4 Range patterns :
A range pattern is made of two patterns separated by a comma, in the form ‘begpat, endpat’. The first pattern, begpat, controls where the range begins, while endpat controls where the pattern ends.
awk '/dridi/,/mekki/ { print $1 }' students.txt
The command above prints the first field of all records starting from the record containing "dridi" until "mekki".
4.5 Using logical operators :
Awk provides three logical (or Boolean) operators for combining patterns and expressions :
&& (AND): This operator returns true only if both expressions on either side of the && are true. If the first expression is false, the second expression is not evaluated (short-circuiting).
|| (OR): This operator returns true if at least one of the expressions on either side of the || is true. If the first expression is true, the second expression is not evaluated (short-circuiting).
! (NOT): This operator negates the truth value of the expression that follows it. If the expression is true, ! makes it false, and vice-versa.
awk '$1 >=140 && $1 < 150 && $4 >=75 { print $2, $3 }' students.txt
The above command will print name and forename of students whose ID is between 140 and 149 and whose grade is equal or over 75.
awk '$4 >= 75 || $5 ="success" {print $2, $3}' students.txt
The above command will print name and forename of students whose grade is over 75 or whose result is success.
awk '$1 >= 140 && $1 < 150 ' students.txt
The above command prints all records starting from the one whose first field is equal to 140 until the one whose first field is less to 150.
awk ' !($5 == "fail") { print $0 } ' students.txt
The above command will print the whole record for students whose result is NOT fail.
4.6 Special patterns :
Awk provides two special patterns, BEGIN and END, which are used to execute actions at specific phases of the Awk script's execution:
The action associated with the BEGIN pattern is executed before any input records are read or processed.
awk '
BEGIN {
print "--- Start of Report ---"
FS = "," # Set field separator to comma
}
$1 == "purchase" || $1 == "withdrawal"
{ print $1, $3 }
' bank.csv
The action associated with the END pattern is executed after all input records have been read and processed.
awk '
$1 == "purchase" || $1 == "withdrawal" { expenses += $3 }
END {
print "Total :", expenses
print "--- End of Report ---"
}
' bank.csv
The same example where BEGIN and END are combined in an awk script called expense.awk :
BEGIN {
FS = "," # Set field separator to comma
expenses = 0
print "---expenses Report ---"
print "operation\tamount"
print "--------------------------"
}
$1 == "purchase" || $1 == "withdrawal" {
amount = $3 + 0 # Ensure numeric addition (force conversion)
expenses += amount
print $1 "\t" amount
}
END {
print "--------------------------"
print "Total : \t" expenses
print "--- End of Report ---"
}
Then to run this script :
awk -f expense.awk bank.csv
4.7 Assign Variable :
The -v option allows for the assignment of a value to a variable before the awk program begins execution. The assigned variable is accessible throughout the entire awk script (in BEGIN, pattern/action blocks, and END).
It provides a convenient way to pass values from the shell environment or other sources into an awk script. The -v option can only set one variable, however, it can be utilized multiple times to assign different variables with each invocation: ' awk -v x=1 -v y=2 … '.
awk -F',' -v var="Amount:" '{print var, $3}' bank.csv
4.8 Arithmetic operations :
Sum :
Sum all the values of third record:
awk -F',' '{ sum += $3 } END { print sum }' bank.csv
Average :
Calculate the average of the digital values of fourth field:
awk '{ aver += $4; n++ } END { print aver / n }' students.txt
Maximum value :
Find the maximum of the digital values of fourth field :
awk 'NR == 2 || ($4+0) > max { max = $4+0 } END { print max }' students.txt
Since all the records are processed, the first record contains string values which cannot be added to numerical values, so we have to force number and avoid first record :
NR == 2 → initialize max with the first true numeric value (record 2).
$4+0 → force $4 to be a nimber (important in case of string).
END { print max } → displays max.
Another option using regex : (take only numeric values of fourth field):
awk '$4 ~ /^[0-9]+$/ { if (NR==2 || $4 > max) max=$4 } END { print max }' students.txt
Minimum value :
Find the minimum of the digital values of fourth field :
awk 'NR == 2 || ($4+0) < min { min = $4+0 } END { print min }' students.txt
Conditional average
Calculate the average of the records where the grade is > 75
awk 'NR > 1 && ($4 + 0) > 75 { total += ($4 + 0); n++ } END { if (n > 0) printf "%.2f\n", total / n; else print "no value > 75" }' students.txt
NR > 1 : ignore the header.
($4 + 0) > 75 : force la conversion to number
total += ($4 + 0) : numerical addition.
printf "%.2f" : prints the average with two digits.
We used the print function since the beginning of this article, but the need for more control over the output format in the last example related to average calculation led us to use the printf statement, so let's dive into the syntax and uses of both functions.
Automatic Newline: It automatically appends a newline character to the end of its output, unless the ORS (Output Record Separator) variable is changed.
Output Field Separator (OFS): When printing multiple arguments separated by commas, print uses the OFS variable (defaulting to a space) to separate them.
Default Behavior: print by itself (without arguments) prints the entire current input record.
When printing several items, it is necessary to separate them using commas. For instance:
awk '{ print $1, $4, $5 }' students.txt
If we omit commas, the items will be concatenated with no spaces :
awk '{ print $1 $4 $5 }' students.txt
IDforenameresult
120firassuccess
140mohamedsuccess
145salahsuccess
147wissalsuccess
148ryansuccess
152philippesuccess
153aminatafail
To print a "names" before the 2nd field, use double-quote characters:
awk '{ print "names : " $2 }' students.txt
To indicate number of lines with suffixes, we create an num.awk script which will be called :
awk -f num.awk students.txt
The num.awk script :
{
# default suffix
suffix = "th"
# exceptions for 1st, 2n and 3rd
if (NR % 10 == 1 && NR % 100 != 11) suffix = "st"
else if (NR % 10 == 2 && NR % 100 != 12) suffix = "nd"
else if (NR % 10 == 3 && NR % 100 != 13) suffix = "rd"
# print with suffix and complete record
print NR suffix " line : " $0
}
Format String: It requires a format string as its first argument, which specifies how subsequent arguments should be formatted (e.g., width, precision, data type).
No Automatic Newline: printf does not automatically add a newline. If a newline is desired, it must be explicitly included in the format string using \n.
No OFS/ORS Influence: The OFS and ORS variables do not affect printf's output.
Example : Insert a number for each record :
awk '{ printf "%2d. %s\n", NR, $0 }' students.txt
this command gives the same result as awk '{ print NR, $0 }' students.txt see 4.1
%2d. prints the record with 2 digits cause we have less than 100 lines in our file.
awk 'BEGIN { printf "ID Name\tforeName\tgrade\tresult\n" }'
ID Name foreName grade result
awk 'BEGIN { printf "ID Name\vforeName\vgrade\vresult\n" }'
ID Name
foreName
grade
result
awk 'BEGIN { printf "ID\nName\nForename\ngrade\nresult\n" }'
ID
Name
Forename
grade
result
Prints a backspace after every field except the last one. It erases the last number from the first fourth fields
awk 'BEGIN { printf "ID 1\bName 2\bForename 3\bgrade 4\bresult 5\n" }'
ID Name Forename grade result 5
awk 'BEGIN { printf "Backslash: \\\\ -> Hello\\World\n" }'
Backslash: \\ -> Hello\World
awk 'BEGIN { printf "Double quote: \\\" -> \"Hello World\"\n" }'
Double quote: \" -> "Hello World"
%c
Prints a single character. If the argument used for %c is numeric, it is treated as a character and printed.
awk 'BEGIN { printf "ASCII value 90 = character %c\n", 90 }'
%d and %i
Prints the integer part of a decimal number
awk 'BEGIN { printf "intg = %d\n", 75.22 }'
%e and %E
Prints a floating point number of the form [-]d.dddddde[+-]dd for %e and [-]d.ddddddE[+-]dd for %E
awk 'BEGIN { printf "intg = %e\n", 75.22 }'
%f
Prints a floating point number of the form [-]ddd.dddddd.
awk 'BEGIN { printf "intg = %f\n", 75.22 }'
%o
Prints an unsigned octal number
awk 'BEGIN { printf "Octal value of 10 decimal is = %o\n", 10}
%s
Prints a character string.
awk 'BEGIN { printf "value = %s\n", "invalid" }'
%x and %X
Prints an unsigned hexadecimal number. The %X format uses uppercase letters.
awk 'BEGIN { printf "Hexadecimal value of 13 decimal is = %x\n", 13 }'
Hexadecimal value of 13 decimal is = d
The same command with %X :
awk 'BEGIN { printf "Hexadecimal value of 13 decimal is = %x\n", 13 }'
Hexadecimal value of 13 decimal is = D
%%
Special escape sequence for a literal percent sign (%).
awk 'BEGIN { printf "intg = %d%%\n", 75.22 }'
intg = 75%
Left and right justification :
awk 'BEGIN { num = 10; printf "Left : |%-5d|\n", num; printf "Right: |%5d|\n", num }' | cat -ve
Left : |10 |$
Right: | 10|$
%-5d → is a format specifier:
5 = minimum field width = 5 characters.
10 is left-aligned, 3 spaces follow.
%5d → 10 is right-aligned, 3 spaces before.
So, it will print the integer (10) left-aligned in a field 5 chars wide. The command is then piped to cat to make non visible non-printing characters.
cat options:
-v → show non-printing characters (like control characters) visibly.
-e → show end of line (newline) as a literal $.
So cat -ve makes it very clear where tabs, spaces, and newlines are.
Awk provides a set of built-in variables that offer information about the input data and control how Awk processes it. These variables can be used in patterns, actions, and BEGIN/END blocks.
Commonly Used Built-in Variables:
6.1 Record and Field Counters:
The following command gives the number of lines of students.txt. It reports the total number of lines in the specified file by printing a summary line once all processing is complete.
awk 'END { print "File", FILENAME, "contains", NR, "lines." }' students.txt
In the following example when reading students.txt, both NR and FNR start at 1 and increment together.
When awk switches to bank.csv:
NR continues (doesn’t reset, it just keeps counting). FNR resets to 1 for the new file.
awk '{ print "NR=" NR, "FNR=" FNR, "File=" FILENAME, "Line=" $0 }' students.txt bank.csv
awk '{ print "Line", NR, "has", NF, "fields" }' students.txt
If we'd like to print only records that contain more than three fields :
awk 'NF > 3' students.txt
The following command prints only the last field of each line :
awk '{ print $NF }' fichier.txt
6.2 Separators:
awk 'BEGIN {print "Input Field Separator = " FS}' | cat -ve
awk 'BEGIN {print "Input Record Separator = " RS}' | cat -ve
awk 'BEGIN {print "Output Field Separator = " OFS}' | cat -ve
awk 'BEGIN {print "Output Record Separator = " ORS}' | cat -ve
6.3 File Information:
awk 'END {print FILENAME}' students.txt
6.4 Command Line Arguments:
awk 'BEGIN {print "Number of arguments =", ARGC}' One Two Three
awk 'BEGIN { for (i=0; i < ARGC; i++) { printf "ARGC[%d] = %s\n",i, ARGV[i] }}' students.txt var=value bank.csv
ARGC[0] = awk
ARGC[1] = students.txt
ARGC[2] = var=value
ARGC[3] = bank.csv
source : linuxize stephane robert tutorialspoint