Don't Forget Pipe
When we need to mangle or massage some data on a Linux/UNIX server or desktop, we usually like to use applications like Google Sheets or other fully fledged data processing software forgetting about pipes.
A basic description of the pipe (
| ) is that it allows the output of one command to be the input of another. What this lets you do is filter data in steps as you chain commands.
Basic commands like
head that come with every Linux/UNIX installation can take you a long way. Reading the man pages of these commands will help give you creative ways of how you could use them. A few quick starters are:
grep – file pattern searcher
-c, --count : Only a count of selected lines is written to standard output. -i, --ignore-case : Perform case insensitive matching. -v, --invert-match : Selected lines are those not matching any of the specified patterns.
cut – cut out selected portions of each line of a file
- d delim : Use delim as the field delimiter character instead of the tab character. - f list : The list specifies fields, separated in the input by the field delimiter character.
sort – sort or merge records (lines) of text and binary files
-f, --ignore-case : Case-independent sorting. -r, --reverse : Sort in reverse order. -b, --ignore-leading-blanks : Ignore leading blank characters when comparing lines. -h, --human-numeric-sort : Sort by numerical value, but take into account the SI suffix, if present.
uniq – report or filter out repeated lines in a file.
-c : Precede each output line with the count of the number of times the line occurred in the input. -i : Case insensitive comparison of lines.
head – display first lines of a file
- n count : Displays the first count lines
It is particularly simple when the data that needs massaging has some repeated or unique patterns. A csv file is a good example. Let’s use this csv file (generic_call_data.csv) containing 350 rows of some generic failed call data. The columns are, date and time of the call (datetime), the SIP Call-ID (callid), the SIP To header field of the call (to) and the response code of the failed call (res_code). An example row is:
2019-12-17 00:03:38,0HODFHNEIG26699@192.168.86.100,+9994234818 <sip:+firstname.lastname@example.org>;tag=gh567hkn,404
Using the sample data, if we needed to count how many times calls to the number +9994234804 ended with an error, using
grep and the
-c flag we could just do …
grep -c 9994234804 generic_call_data.csv
no pipe needed. How about if we needed to find out all the different error codes? For that we would have to focus on the last (4th) column. Here
cut comes in handy. To get the 4th column or field we set the
-f flag to 4, knowing that we are using a csv file and that the separator of the columns is
',', so we set that as our delimiter (
cut -d ',' -f 4 generic_call_data.csv
We have now extracted the 4th column but we still have 350 rows. Our next step is to
sort the result (pun intended) and here we will need our first pipe …
cut -d ',' -f 4 generic_call_data.csv | sort
Now we have a sorted list of the 4th row. What is left is just to get the
unique (pun intended) values …
cut -d ',' -f 4 generic_call_data.csv | sort | uniq
404 408 481 486 500 res_code
If we want to know the count of each error, we just add the
cut -d ',' -f 4 generic_call_data.csv | sort | uniq -c
309 404 5 408 1 481 25 486 9 500 1 res_code
Now using the first example, let’s find the different error codes and their count for calls to +9994234804 …
grep 9994234804 generic_call_data.csv | cut -d ',' -f 4 | sort | uniq -c
6 404 1 408 1 500
Finally lets find the top 10 numbers that had the most errors. Looking at the data file, the numbers are in the 3rd column.
Cut could be used to extract it. Looking at an example data field
we see that there is a space between the number and the SIP URI of the number,
cut could be used again (
cut -d ' ' -f 1). After that all that would be left would be
sorting the numbers and then finding and counting
unique values …
cut -d ',' -f 3 generic_call_data.csv | cut -d ' ' -f 1 | sort | uniq -c
17 +9993234050 6 +9993234100 2 +9993234101 1 +9993234104 27 +9993234109 1 +9993234116 ... ...
We now need to sort the number count but from the biggest to the smallest (
sort -r) and then we select the top 10 (
cut -d ',' -f 3 generic_call_data.csv | cut -d ' ' -f 1 | sort | uniq -c | sort -r | head
34 +9998234240 27 +9993234109 17 +9993234050 16 +9993234163 11 +9993234159 9 +9995234886 9 +9994234834 8 +9998234245 8 +9994234804 7 +9994234802
So the next time you need to quickly analyse a csv, log or any file with some, preferably, repeated patterns or words, I challenge you to resist the temptation to use a spreadsheet application but instead, fire up the terminal and get piping 😎