Parallelisation of data processing pipelines in shell using ‘xargs’ (and parallel)

I like scripting with shell a lot since it is simple, available almost everywhere and is fast and efficient. Recently while reading this I came across a way to make data processing several times faster using ‘xargs’ command. xargs is the command which takes input from a stdin stream and passes it to a chosen command as command line arguments. It can also execute these arguments simultaneously as independent processes when combined with the -P switch. This makes sure that all the processors/cores in the system are used simultaneously.

As an example, lets consider a slightly complex counting algorithm I have made for my research. This is an Rscript which reads csv data from stdin, analyses and produces a csv result to stdout. The pipeline looks something like this,

cat input.csv | ./count > output.csv

This takes around 20 seconds to complete. Now imagine there are 2 files input_1.csv and input_2.csv in data folder and we want to run the count script on both of them. The obvious way to do this is manually,

cat input_1.csv | ./count > output_1.csv
cat input_2.csv | ./count > output_2.csv

This takes 40 seconds. This can also be written in a for loop for scalability.

for i in $(ls data); do cat data/$i | ./count > data/$i_out; done

Update: feedback from reddit suggest that ls command is not appropriate here. it is better to do for i in data/*

This also takes 40 seconds since the processing of the files are done serially rather than simultaneously. If we see the output of htop only one processor would be used for the processing. Any modern computer has anywhere from 2-8 cores which could be used to run the script on the files simultaneously. This can be done with ‘xargs’ as shown below,

find data/ -name “input*” -print0 | xargs -0 -n1 -P0 sh -c ‘cat “$@” | ./count > “$@_out” _;

This does the exact same this as the one before but takes only half the time – 20 seconds because it uses both the processors. The find command finds all the files starting with “input” in the data folder and prints it out separated by null values (-print0). xargs takes this as input reads the null seperated list (-0) and creates as many processes possible (-P0), with one argument each (-n1) and executes the sh -c ‘pipeline’ command. Each pipeline is started simultaneously and the number of parallel processes started depends on the number of cores available in the machine. Within the pipeline we can refer to the argument (file name) as $@.

The change from 40 seconds to 20 seconds doesn’t seem much here but when I use this in a server with 24 cores on 500 files, the difference is from 2:48hrs to 7minutes!

Update: Feedback from reddit suggest that there is a cleaner way of doing this using gnu parallel.

find data/ -name “input*” | parallel “cat {} | ./count > {}_out”

This can be extended to multiple machines via ssh as well, thus giving us a simple cluster!

Advertisements

Downloading or streaming podcasts from BBC from linux shell.

I like listening to podcasts when I code. Not only I get to learn something when I am doing mundane tasks but it also acts as a white noise when I am concentration on the task at hand. So recently I wanted to download this podcast from bbc as mp3 files on to my laptop so that I can listen to them offline during my commute. I would have ususally done it manually but with my newly learned shell knowlege, I did it with this one liner instead.

for i in {0..5}; do if [ $i = 0 ]; then i=”” else i=”\?page\=$i” fi ; echo “https://www.bbc.co.uk/programmes/p02pc9qx/episodes/downloads$1 ” ; done | xargs curl | grep -Po ‘href=”\K.*?(?=”)’ | grep “.mp3” | grep “low” | xargs wget

The for loop creates 6 loops for each page in the downloads section. The if else condition appends the page number and formats the urls. xargs then passes this url into curl to get the contents of the page. grep then searches the returned page for links. The second and third greps search only for the links with words “.mp3” and ‘low’ in them. The second xargs converts them into arguments and passes to wget which then downloads those links to disk. When we run the command, we go through all six pages and download all the low quality mp3s for each episodes.

As a bonus we can then rename the resulting files based on the information on their id3 tags. using the id3 utility.

id3 -f “%y_%t.mp3” *.mp3

This will rename all the files with the format “year_title.mp3”.

If we dont want to download them to disk but send them to vlc as a playlist for streaming, then we can do xargs vlc rather than xargs wget at the end.

jq – manipulating JSON in shell

jq is amazing. It is an unique combination of javascript and linux shell which gives an immensely powerful tool to work with JSON files (This post gives a introduction to JSON format) . It plays really well with the existing shell tools and has quickly become one of the most used tools in my data analysis/ processing pipeline.

jq is like sed (streaming editor). It is takes an input stream, applies the expression on it and returns an output stream. It does not modifies files directly. The syntax is,

 input stream | jq 'expression' | output stream

Input and Outputs

The input and output streams are just plain text streams. They can be a file, program, http request etc. etc. For example consider the following commands,

curl "https://jsonplaceholder.typicode.com/posts/" | jq '.[0:5]'  > posts.json

cat posts.json | jq '.[].id' > post_ids.json

cat post_ids.json | jq '.' | curl -X POST -d "$(</dev/stdin)" "http://ptsv2.com/t/5jo6w-1522072388/post"

The first one gets json data from the url, filters the first 5 elements and puts that in posts.json file. The second one takes this posts.json file, filters just the ids from each element and puts that in post_ids.json file. The third one takes this post_ids.json file and posts all of it to a http api as a post request (the results are here). In all these examples, jq does nothing but change input stream and send it to output text stream. This makes it extremely efficient and versatile.

Expressions

The expression part in jq is essentially a tiny javascript engine which is used to manipulate the JSON. This is really powerful. A full list things than can be done is available in the manual. I’ll just outline some basic selection and filtering

selection expressions
. - Shows the original object
.keyname - selects the specific field in the object
.[] - selects all elements (if the object is an array)
.[index (:no of elements)] - selects the specified index from the array

function expressions (in addition to basic arithmetic)
length - returns length of the array
keys - returns fields in an object
map - applies a function to all the elements in an array
del - deletes an object
sel - returns an object in an array if the condition is met
test - regex like pattern matching

All these can be combined, nested and piped to each other (yes, these are pipes within pipes) indefinitely to manipulate JSON. For example consider the following JSON file named data.json

[
 {
   "id": 1,
   "title": "sunt aut facere",
   "body": "quia et t architecto"
 },
 {
   "id": 2,
   "title": "qui est esse",
   "body": "est rerum tempore"
 },
 {
   "id": 3,
   "title": "ea molestias quasi",
   "body": "et iusto sed quo"
 },
 {
   "id": 4,
   "title": "eum et est occaecati",
   "body": "ullam et saepe"
 },
 {
   "id": 5,
   "title": "nesciunt quas odio",
   "body": "repudiandae veniam quaerat"
 }
]

This can be filtered in the following ways,

'.' - all the data.

'.[0]' - first element of the data

'.[1:3]' - three elements from index 1 (ie, second, third and fourth elements.)

'.[0].title' - title of the first element

'.[].id' - ids of all elements "1,2,3,4,5"

'[.[].id]' - ids of all elements as an array. "[1,2,3,4,5]"

'. | length' - number of elements (5)

'.[] | length' - number of elements in each object of the array [3,3,3,3,3]

'.[0] | keys' - the fields/keys in the first element

'.[] | select(.id==3)' - the element with id as 3

'. | del(.[2])' - everything but third element

'. | del((.[] | select(.id==3)))' - everything but the element with id as 3

'. | map(.id = .id+1)' - increase the id variable for all elements by 1

'. | map(del(.id))' - remove the field id from all elements

'.[] | select(.body | test("et"))' - elements with 'et' in the body fields

Combining all these we can easily explore and process, json files right from linux terminal and finally the data can be organised in an array and exported as a csv using the @csv function. For example,

cat data.json | jq -r '.[] | [.id, .title, .body] | @csv' > data.csv

the -r is important since that makes jq to output raw csv text.

Running Windows inside Arch Linux with VirtualBox

Even though I have moved over to Linux completely for quite sometime, every now and then I’ll encounter situations in which I really have to use windows. Last time I had to do it is because of a form which was a Word document set up in a way that I had to use ms word to fill it in. Initially I planned to never go back to windows and in such situations I’ll just borrow a windows computer just for that purpose. Then I realised it is better to have a windows installation loaded with commonly used software ready to go whenever I needed it rather than depending on someone else. So I installed windows on my desktop using VirtualBox. The only thing which needs to be sourced is the windows installation disk (.iso) which either someone can loan you a copy or buy one. I used my university’s license on this one.

The steps are straightforward with Arch. Install virtualbox, virtualbox-host-modules-arch, virtualbox-ext-oracle (this one is from AUR). Open virtual box and create your Virtual machine following the step by step GUI and start the machine. Thats it. We have a working windows installation. First thing we need to do in the guest system is install “guest additions” which can be inserted as a disc from Devices>Insert Guest Additions CD Image. The way to make the virtual machine look seamless with host OS is by setting the same wallpaper , set the guest to auto-resize to host window, hide the menu bar and status bar.

My configuration for i3 is available at https://github.com/sbmkvp/config

Sending mail from command line using mutt

Sometimes you just don’t have the patience to open a GUI. Imagine you are working on a terminal remotely through a very feeble internet connection and after hours of data wrangling you got your results in one small package. Now all you want is to email this 200kb document (average size of a 20k word .txt document). You can either load a GUI, open a browser, open gmail (the login page itself is 2MB), attach the file and send the email or just execute a one line command which does everything for you. With some minimal setup you can do the latter – sending email via CLI just like any other shell command. You can even include this in your scripts (send mail when the script finishes running etc).

We will do this using a terminal program called “mutt” which also has a brilliant CLI interface and will configure it to use gmail via imap. First step is to install mutt using a package manager, (apt/ yum/ pacman for linux and brew for macosx). I am doing this in Arch with pacman. I am installing mutt and smtp-forwarder and then create necessary folders and files for mutt.

sudo paman -S mutt smtp-forwarder
mkdir -p ~/.mutt/cache/headers
mkdir ~/.mutt/cache/bodies
touch ~/.mutt/certificates
touch ~/.mutt/muttrc

Edit the muttrc file with your favourite text editor and add these configurations, (make sure to change the username to your username and if your are using two factor authentication with gmail the password has to be generated from App passwords.

set ssl_starttls=yes
set ssl_force_tls=yes
set imap_user = 'username@gmail.com'
set imap_pass = 'yourpassword'
set from= 'username@gmail.com'
set realname='yourname'
set folder = imaps://imap.gmail.com/
set spoolfile = imaps://imap.gmail.com/INBOX
set postponed="imaps://imap.gmail.com/[Gmail]/Drafts"
set header_cache = "~/.mutt/cache/headers"
set message_cachedir = "~/.mutt/cache/bodies"
set certificate_file = "~/.mutt/certificates"
set smtp_url = 'smtps://username@smtp.gmail.com:465/'
set imap_pass = 'yourpassword'
set move = no
set imap_keepalive = 900
set editor = vim
bind pager j next-line
bind pager k previous-line
set sort = threads
set sort_aux = reverse-date-sent
unset imap_passive
set imap_check_subscribed
set mail_check=60
set timeout=10

That is it! Now we can send mail from terminal by just passing some text or a file with the text to the mutt command,

echo "email body"  | mutt -s "email-subject" -- recipient@gmail.com
mutt -s "email-subject" -- recipient@gmail.com <  file_with_body_text.txt

we can even attach files like this,

echo "please find attached"  | mutt -s "email-subject" -a "attachment.pdf" -- recipient@gmail.com

 

Running a simple static HTTP server

I have been really busy the past 3 days so there were no posts. So there is going to be 3 unrelated small posts on small utilities I use. First thing is a http-server. Since web browsers are locked down these days, it is not easy to read files off local machine when you are testing even a simple website. For example, If I have a html file where I want to load a csv, parse it and display it, serving the html from a http server is the only way to allow a chrome/ Firefox to read the file. At the same time, I really don’t want to install a full  Apache wen server to serve two html files.

The solution to this is a node package – ‘http-server’. It is a tiny http server which when run from a folder in CLI, serves the folder contents as a http-host at localhost. All we need to do is,

# Install nodejs and node package manager (npm)
sudo pacman -S node npm
# Install http-server package through npm globally
npm install -g http-server
# start the server
http-server

That is it! whichever folder you ran http-server would be accessible at the ip/port shown. we can combine this with forever (another node package) or run under a gnu-screen session to keep it in the background.

Installing Arch Linux (and R) on an Android device

This is a really recent development and I am very excited about this. I finally found a way to have a somewhat proper Linux installation on my phone. Though it might not be the best place to have a CLI, it is really promising and I can rely on this to do some small stuff on the go. As the tools I use are getting simpler (Photoshop vs Imagemagick) and the hardware of the phones I own are getting better, it is should possible for my phone to do the things my 5 year old laptop could handle provided the right environment.

This is done by installing a full Arch installation on an Android phone under the termux environment using the installer from TermuxArch. The installation here is actually way easier than installing Arch on a normal desktop. We start by installing termux android app. When we open termux we get a bash shell. From here we install wget by running, pkg install wget . When this is complete we download and run the Arch installation script by,

# Download the script
wget https://raw.githubusercontent.com/sdrausty/TermuxArch/master/setupTermuxArch.sh 
# Adding execute permissions
chmod a+x setupTermuxArch.sh
# Run the script
./setupTermuxArch.sh

Now we can just follow the instructions in the script which will download and unpack a base Arch Linux system and ask you to edit the mirror list. At this point, just  un-comment (remove the #) of the closest mirrors and save and exit the file. When the installation is complete you have a vanilla arch system on your mobile! Now we can theoretically install and use any program I have on my work desktop on my phone which including to ssh, vim, git, latex, R, node, postgres, mongodb, etc etc. I can even ssh into my work desktop straight from here. Below are some screenshots of the system (the chart is done entirely on phone!).