March 3, 2021

910 words 5 mins read

Don't Forget Pipe

Don't Forget Pipe

When we need to mangle or massage some data on a Linux/UNIX server or desktop, we usually like to use applications like Google Sheets or other fully fledged data processing software forgetting about pipes.

A basic description of the pipe ( | ) is that it allows the output of one command to be the input of another. What this lets you do is filter data in steps as you chain commands.

Basic commands

Basic commands like grep, cut, sort, uniq and head that come with every Linux/UNIX installation can take you a long way. Reading the man pages of these commands will help give you creative ways of how you could use them. A few quick starters are:

grep – file pattern searcher

Some flags:

  -c, --count : Only a count of selected lines is written to standard output.
  -i, --ignore-case : Perform case insensitive matching.
  -v, --invert-match :  Selected lines are those not matching any of the specified patterns.
cut – cut out selected portions of each line of a file

Some flags:

  - d delim : Use delim as the field delimiter character instead of the tab character.
  - f list : The list specifies fields, separated in the input by the field delimiter character.
sort – sort or merge records (lines) of text and binary files

Some flags:

  -f, --ignore-case : Case-independent sorting.
  -r, --reverse : Sort in reverse order.
  -b, --ignore-leading-blanks : Ignore leading blank characters when comparing lines.
  -h, --human-numeric-sort : Sort by numerical value, but take into account the SI suffix, if present.
uniq – report or filter out repeated lines in a file.

Some options:

  -c : Precede each output line with the count of the number of times the line occurred in the input.
  -i : Case insensitive comparison of lines.
head – display first lines of a file

Some options:

  - n count : Displays the first count lines

It is particularly simple when the data that needs massaging has some repeated or unique patterns. A csv file is a good example. Let’s use this csv file (generic_call_data.csv) containing 350 rows of some generic failed call data. The columns are, date and time of the call (datetime), the SIP Call-ID (callid), the SIP To header field of the call (to) and the response code of the failed call (res_code). An example row is:

2019-12-17 00:03:38,0HODFHNEIG26699@192.168.86.100,+9994234818 <sip:+9994234818@192.168.86.200>;tag=gh567hkn,404

Filtering

Using the sample data, if we needed to count how many times calls to the number +9994234804 ended with an error, using grep and the -c flag we could just do …

grep -c 9994234804 generic_call_data.csv 

Result:

    8

no pipe needed. How about if we needed to find out all the different error codes? For that we would have to focus on the last (4th) column. Here cut comes in handy. To get the 4th column or field we set the -f flag to 4, knowing that we are using a csv file and that the separator of the columns is ',', so we set that as our delimiter (-d) …

 cut -d ',' -f 4 generic_call_data.csv 

We have now extracted the 4th column but we still have 350 rows. Our next step is to sort the result (pun intended) and here we will need our first pipe …

cut -d ',' -f 4 generic_call_data.csv | sort

Now we have a sorted list of the 4th row. What is left is just to get the unique (pun intended) values …

cut -d ',' -f 4 generic_call_data.csv | sort | uniq

Result:

    404
    408
    481
    486
    500
    res_code

If we want to know the count of each error, we just add the -c flag

cut -d ',' -f 4 generic_call_data.csv | sort | uniq -c

Result:

 309 404
   5 408
   1 481
  25 486
   9 500
   1 res_code

Now using the first example, let’s find the different error codes and their count for calls to +9994234804 …

grep 9994234804 generic_call_data.csv | cut -d ',' -f 4 | sort | uniq -c

Result:

   6 404
   1 408
   1 500

Finally lets find the top 10 numbers that had the most errors. Looking at the data file, the numbers are in the 3rd column. Cut could be used to extract it. Looking at an example data field

+9994234818 <sip:+9994234818@192.168.86.200>;tag=gh567hkn

we see that there is a space between the number and the SIP URI of the number, cut could be used again (cut -d ' ' -f 1). After that all that would be left would be sorting the numbers and then finding and counting unique values …

cut -d ',' -f 3 generic_call_data.csv | cut -d ' ' -f 1 | sort | uniq -c

Result:

  17 +9993234050
   6 +9993234100
   2 +9993234101
   1 +9993234104
  27 +9993234109
   1 +9993234116
   ...
   ...

We now need to sort the number count but from the biggest to the smallest ( sort -r) and then we select the top 10 ( head ).

cut -d ',' -f 3 generic_call_data.csv | cut -d ' ' -f 1 | sort | uniq -c | sort -r | head

Result:

  34 +9998234240
  27 +9993234109
  17 +9993234050
  16 +9993234163
  11 +9993234159
   9 +9995234886
   9 +9994234834
   8 +9998234245
   8 +9994234804
   7 +9994234802

So the next time you need to quickly analyse a csv, log or any file with some, preferably, repeated patterns or words, I challenge you to resist the temptation to use a spreadsheet application but instead, fire up the terminal and get piping 😎