Parallelisation of data processing pipelines in shell using ‘xargs’ (and parallel)

I like scripting with shell a lot since it is simple, available almost everywhere and is fast and efficient. Recently while reading this I came across a way to make data processing several times faster using ‘xargs’ command. xargs is the command which takes input from a stdin stream and passes it to a chosen command as command line arguments. It can also execute these arguments simultaneously as independent processes when combined with the -P switch. This makes sure that all the processors/cores in the system are used simultaneously.

As an example, lets consider a slightly complex counting algorithm I have made for my research. This is an Rscript which reads csv data from stdin, analyses and produces a csv result to stdout. The pipeline looks something like this,

cat input.csv | ./count > output.csv

This takes around 20 seconds to complete. Now imagine there are 2 files input_1.csv and input_2.csv in data folder and we want to run the count script on both of them. The obvious way to do this is manually,

cat input_1.csv | ./count > output_1.csv
cat input_2.csv | ./count > output_2.csv

This takes 40 seconds. This can also be written in a for loop for scalability.

for i in $(ls data); do cat data/$i | ./count > data/$i_out; done

Update: feedback from reddit suggest that ls command is not appropriate here. it is better to do for i in data/*

This also takes 40 seconds since the processing of the files are done serially rather than simultaneously. If we see the output of htop only one processor would be used for the processing. Any modern computer has anywhere from 2-8 cores which could be used to run the script on the files simultaneously. This can be done with ‘xargs’ as shown below,

find data/ -name “input*” -print0 | xargs -0 -n1 -P0 sh -c ‘cat “$@” | ./count > “$@_out” _;

This does the exact same this as the one before but takes only half the time – 20 seconds because it uses both the processors. The find command finds all the files starting with “input” in the data folder and prints it out separated by null values (-print0). xargs takes this as input reads the null seperated list (-0) and creates as many processes possible (-P0), with one argument each (-n1) and executes the sh -c ‘pipeline’ command. Each pipeline is started simultaneously and the number of parallel processes started depends on the number of cores available in the machine. Within the pipeline we can refer to the argument (file name) as $@.

The change from 40 seconds to 20 seconds doesn’t seem much here but when I use this in a server with 24 cores on 500 files, the difference is from 2:48hrs to 7minutes!

Update: Feedback from reddit suggest that there is a cleaner way of doing this using gnu parallel.

find data/ -name “input*” | parallel “cat {} | ./count > {}_out”

This can be extended to multiple machines via ssh as well, thus giving us a simple cluster!

Advertisements

Is Java or Python the Best Programming Language for Android Apps?

The following guest post comes courtesy of Michael Kordvani from fueled.nyc

***

Though Objective-C, Swift, Java, Python and HTML5 are widely regarded as the most popular mobile app programming languages, when it comes to programming languages for Android apps, it is Java and Python that are the rulers of them all.

But, perhaps you are an Android app developer who does not have a need for two programming languages. Perhaps, you just want the Big Dawg of all of the Android app programming languages in your toolbox. If so, read on to learn whether Python or Java is the best Android app programming language for you and your career.

Python

Python has been one of the most popular Android app programming languages for years because of these core tenets:

  • Readability Matters
  • Beautiful, not Ugly
  • Simple, not Complex
  • Complex, not Complicated
  • Explicit, not Implicit

So, really, when it comes time to learn how to create an Android app, Python deserves serious consideration.

Now, the majority of Android app developers do not learn this programming language because they are interested in Python’s core philosophy; they probably don’t even know about the philosophy. Android app developers choose the Python programming language because it is very simple to learn and features perhaps the easiest readability of all of the Android app developer languages in use today. One of the other main reasons for an Android app developer to code in Python is that the language supports dynamic typing, which can be a real game changer.

Now, you must not make the mistake of thinking that Python’s simplicity means that it cannot be used to create sophisticated apps. It can. In fact, Spotify, DropBox, Quora and YouTube were all developed on Python. That’s an amazing portfolio, yeah?

Java

Though Python has long been the one of the most popular app developing languages on the planet, it is not the most popular. Java is the most popular language.

Its popularity is one reason Android app developers ought to learn and work with it. After all, there is no other Android app programming language that has nearly as many open-source tools and Java-based libraries supporting it.

Another quality that Java is known for may be viewed by some developers as a negative. It is not concise. This is a great drawback for beginning Android app developers; however, skilled Android app developers tend to enjoy working with Java’s large vocabulary because it allows them to be as precise and creative as they could ever hope to be.

Java, Python: Which Should You Choose?

The bottom line is that only you can decide which Android app language is best for you and your career as an Android app developer. Novice developers tend to prefer Python for its simplicity, while experienced developers often choose Java for its extensive vocabulary. However, since there are currently close to 3 million Android apps available, you can rest assured that there is indeed a robust market for you and your apps, regardless of the language that may be the best fit for your project. Happy Android app building!

A Bloomberg like ‘Terminal’ for open data?

bloomberg_terminal

Recently I had a tour of to Bloomberg HQ in London from my friend who started a job there. It was the first time I actually saw Bloomberg terminal and had a go at it. There are three things I found very fascinating and which Bloomberg terminal has done right.

  1. The User Interface.
  2. Data completeness.
  3. Community.

User Interface (UI)

Bloomberg terminal interface is neither command line nor GUI, it is a hybrid. If I have seen any other software which masters this kind of hybrid interface that would be AutoCAD. I have worked with AutoCAD exclusively for almost 4 years from 2006 to 2010. In that time almost everything I did was inAutoCAD – presentation, maps, drawing, simple mathematical calculations, geometric problems. I used it for anything that required visual thinking or communication. Since then I have used other suite of software aimed at other things – Adobe Suite for graphic design, Max for 3D modelling, Unity for game development etc etc. but never have met an interface that match up to even half of what AutoCAD has (or it used to have). The key thing at which AutoCAD excels is taking the best of a two interface types, combining them in all possible ways and let the users figure out how to use this blend effectively for their needs.

There are two major types of interfaces when it comes to productivity software – Graphical User Interface (GUI) and Command Line Interface (CLI). In a GUI every action you do has a on screen visual element which you can interact with. The best example for this is Microsoft Office. Literally every thing you do in office has a dialog, menu or button somewhere. The other extreme is commandline interface, where everything you do is or part of a command/function. Even thought there is a lot of debate about which one is better and a trememdous amount of personal preference (I am a commandline guy through and through), I have to admit that both have their advantages and disadvantages.

GUI – is easy to learn, incredibly hard to master, simple things can be done quickly but repetitive complex tasks cause problems and finally handles spatial or graphical information well but struggles with factual information. For example, Microsoft Word is easy to learn, anybody can pick up within an hour but word “experts” are hard to come by. Typing a simple letter in Word is terribly easy but complex such as making customised labels for 200 items are truly horrifying. Drawing things, moving features which have 2/3 dimensions to them is very clear but logic such as if-then-else, concepts such as routines, functions, recursion are alomst impossible. I still haven’t seen any visual/GUI programming thing that is worth using, I think after certain level of abstract concepts cannot be visualised with GUI. Commandline is complete oppsosite of this, it can do complex, abstract, repetitive tasks very easily but struggles with anything that has spatial element to it. eg. graphic design, CAD etc are extremely hard without GUI. It has a huge value for a poweruser but has no ease of use.

Except for very few old ones, I dont think any software stick with just one or the other philosophy anymore. With web and mobile apps, things have gotten more and more ambigous. GUI applications tend to provide some commanline like functionality with shortcuts (eg. Excel, Adobe) and some form of programming or scripting interface for complex tasks (eg. VB scripting). CLI applications tend to solve their spatial information problem by adopting layouts (vim, mutt) and also tend to be modular so that GUI can be built on top of them if necessary. Usually well thought out software borrow from both and try to strike a balance but very few are successful in my opinion. AutoCAD and Bloomberg terminal are those who got this so perfect that once you get used to them it is hard to move to anything else.

From what I have seen, Bloomberg terminal seems to be inherently command-line based while having a powerful GUI alongside. The GUI seems to be strictly utilitarian (maps , charts etc.) and all the GUI elements are mapped to some key or other. It seems to be easy to use for new users with the GUI and at the same time rewards them immensely if they learn to use CLI.

Data Completeness

If you are in the realm of financial markets, Bloomberg terminal provides a perfect bottom line for all your data and visualisation needs. There are no extra plugins to buy, no external tools to use with it and no major data cleaning or transformation needed to use it. One can get the data, analyse, plot and infer from it all in the terminal without using anything else. The data is ready to use, vis tools are ready to use and all you concentrate about is extracting the meaning out of it.

Bloomberg does have the advantage of working with data from the highly regulated financial markets but also has an immense challange in other sources of data, for example news and social media which can be quite varied and diverse. Aggregating all these sources and providing them comprehensively in a single place creates amazing value. 80% of all data analysis is formatting the data and the terminal takes that out of it. Also it seems that they go for the comprehensiveness. In their field, they pretty much have eveything covered and constantly have people covering new things that are popping up. The general attitude is – If it is not on terminal either it is irrelevant or it doesn’t exists.

The open-data stores miss this regularly by being in multitude of places, embracing multitude of formats using a directory based storage which is specific towards a certain set of activities. For example, London and UK has their own data stores which are all full of static datasets which needs to be accessed using own authentication methods in at least 5 different formats. Simple things like plotting the boundary of a borough of london will take all your knowledge of GIS and spatial data structures. Bloomberg terminal just takes all this away from the user. You don’t even have to know what a csv is to see how the markets are doing. When we take away this task off the users shoulders we can empower the user in ways we might have never thought about.

For example, if I ask the question of how many people are there in Camden Borough of London right now. There is no one place to quickly get the answer. Yes, of course I can Google it but it is equivalent to searching a library rather than a database. The data is open, available and free. The only thing that is preventing that to be used by anyone is the way systems are set up to access the data. This is where the terminal shines. If you want to know something, ask the terminal, if cannot find it, call the 24×7 support and ask for it. It is not there, it most probably doesn’t exists.

Social Network

This is a big one. It is absolutely amazing how the terminal users is connected not only to the data and customer service but also to each other. They can mail each other, they can chat with each other, they have forums, groups etc. etc. With Bloomberg tv they can even communicate with people who are being interviewed on TV. Social media streams into the terminal. If you are a trader working with other traders, terminal is one stop solution. No need for 15 other services to do the basic job (which is just data analysis and communication). I use at 10 different communication tools to collaborate with people I work with and each tool has it own work flow. Though I tend to get everything down to a unix shell interface for my own sanity, I can imagine how cool it must be to have a consistent work flow with everyone within one single application. Just by connecting the users to each other, the terminal adds exponential value to every user!

Open data terminal?

I personally want these advantages to be available for open data. Imagine a open data terminal built on a hybrid user interface which has a comprehensive set of data available to public along with tools to interpret them and communicate with other users simultaneously. It will be such a game changer for so many people. For example, public institutions who make policies based on the data, research institutions who use the data to advance our understanding of the world and volunteers/ journalists who report on the data can do their jobs quick and easy without spending large amounts of time and resources in the data cleaning and verification process. I would like to see all the open data stores work together to make a common platform for making all their products available to the users in a simple ubiqutous way. Just imagine doing,

data [dataname]@ visualisation [vis-type] parameters

to just about any data source available. Imagine adding the vast spatial data from openstreet map to this mix! It will be phenomenal. May be as a hobby project, in the coming months I should pick up a small unit (a borough or small town) and try to gather all open data possible and build something similar to what I have in mind. Lets see how it goes…

Downloading or streaming podcasts from BBC from linux shell.

I like listening to podcasts when I code. Not only I get to learn something when I am doing mundane tasks but it also acts as a white noise when I am concentration on the task at hand. So recently I wanted to download this podcast from bbc as mp3 files on to my laptop so that I can listen to them offline during my commute. I would have ususally done it manually but with my newly learned shell knowlege, I did it with this one liner instead.

for i in {0..5}; do if [ $i = 0 ]; then i=”” else i=”\?page\=$i” fi ; echo “https://www.bbc.co.uk/programmes/p02pc9qx/episodes/downloads$1 ” ; done | xargs curl | grep -Po ‘href=”\K.*?(?=”)’ | grep “.mp3” | grep “low” | xargs wget

The for loop creates 6 loops for each page in the downloads section. The if else condition appends the page number and formats the urls. xargs then passes this url into curl to get the contents of the page. grep then searches the returned page for links. The second and third greps search only for the links with words “.mp3” and ‘low’ in them. The second xargs converts them into arguments and passes to wget which then downloads those links to disk. When we run the command, we go through all six pages and download all the low quality mp3s for each episodes.

As a bonus we can then rename the resulting files based on the information on their id3 tags. using the id3 utility.

id3 -f “%y_%t.mp3” *.mp3

This will rename all the files with the format “year_title.mp3”.

If we dont want to download them to disk but send them to vlc as a playlist for streaming, then we can do xargs vlc rather than xargs wget at the end.

jq – manipulating JSON in shell

jq is amazing. It is an unique combination of javascript and linux shell which gives an immensely powerful tool to work with JSON files (This post gives a introduction to JSON format) . It plays really well with the existing shell tools and has quickly become one of the most used tools in my data analysis/ processing pipeline.

jq is like sed (streaming editor). It is takes an input stream, applies the expression on it and returns an output stream. It does not modifies files directly. The syntax is,

 input stream | jq 'expression' | output stream

Input and Outputs

The input and output streams are just plain text streams. They can be a file, program, http request etc. etc. For example consider the following commands,

curl "https://jsonplaceholder.typicode.com/posts/" | jq '.[0:5]'  > posts.json

cat posts.json | jq '.[].id' > post_ids.json

cat post_ids.json | jq '.' | curl -X POST -d "$(</dev/stdin)" "http://ptsv2.com/t/5jo6w-1522072388/post"

The first one gets json data from the url, filters the first 5 elements and puts that in posts.json file. The second one takes this posts.json file, filters just the ids from each element and puts that in post_ids.json file. The third one takes this post_ids.json file and posts all of it to a http api as a post request (the results are here). In all these examples, jq does nothing but change input stream and send it to output text stream. This makes it extremely efficient and versatile.

Expressions

The expression part in jq is essentially a tiny javascript engine which is used to manipulate the JSON. This is really powerful. A full list things than can be done is available in the manual. I’ll just outline some basic selection and filtering

selection expressions
. - Shows the original object
.keyname - selects the specific field in the object
.[] - selects all elements (if the object is an array)
.[index (:no of elements)] - selects the specified index from the array

function expressions (in addition to basic arithmetic)
length - returns length of the array
keys - returns fields in an object
map - applies a function to all the elements in an array
del - deletes an object
sel - returns an object in an array if the condition is met
test - regex like pattern matching

All these can be combined, nested and piped to each other (yes, these are pipes within pipes) indefinitely to manipulate JSON. For example consider the following JSON file named data.json

[
 {
   "id": 1,
   "title": "sunt aut facere",
   "body": "quia et t architecto"
 },
 {
   "id": 2,
   "title": "qui est esse",
   "body": "est rerum tempore"
 },
 {
   "id": 3,
   "title": "ea molestias quasi",
   "body": "et iusto sed quo"
 },
 {
   "id": 4,
   "title": "eum et est occaecati",
   "body": "ullam et saepe"
 },
 {
   "id": 5,
   "title": "nesciunt quas odio",
   "body": "repudiandae veniam quaerat"
 }
]

This can be filtered in the following ways,

'.' - all the data.

'.[0]' - first element of the data

'.[1:3]' - three elements from index 1 (ie, second, third and fourth elements.)

'.[0].title' - title of the first element

'.[].id' - ids of all elements "1,2,3,4,5"

'[.[].id]' - ids of all elements as an array. "[1,2,3,4,5]"

'. | length' - number of elements (5)

'.[] | length' - number of elements in each object of the array [3,3,3,3,3]

'.[0] | keys' - the fields/keys in the first element

'.[] | select(.id==3)' - the element with id as 3

'. | del(.[2])' - everything but third element

'. | del((.[] | select(.id==3)))' - everything but the element with id as 3

'. | map(.id = .id+1)' - increase the id variable for all elements by 1

'. | map(del(.id))' - remove the field id from all elements

'.[] | select(.body | test("et"))' - elements with 'et' in the body fields

Combining all these we can easily explore and process, json files right from linux terminal and finally the data can be organised in an array and exported as a csv using the @csv function. For example,

cat data.json | jq -r '.[] | [.id, .title, .body] | @csv' > data.csv

the -r is important since that makes jq to output raw csv text.

Running Windows inside Arch Linux with VirtualBox

Even though I have moved over to Linux completely for quite sometime, every now and then I’ll encounter situations in which I really have to use windows. Last time I had to do it is because of a form which was a Word document set up in a way that I had to use ms word to fill it in. Initially I planned to never go back to windows and in such situations I’ll just borrow a windows computer just for that purpose. Then I realised it is better to have a windows installation loaded with commonly used software ready to go whenever I needed it rather than depending on someone else. So I installed windows on my desktop using VirtualBox. The only thing which needs to be sourced is the windows installation disk (.iso) which either someone can loan you a copy or buy one. I used my university’s license on this one.

The steps are straightforward with Arch. Install virtualbox, virtualbox-host-modules-arch, virtualbox-ext-oracle (this one is from AUR). Open virtual box and create your Virtual machine following the step by step GUI and start the machine. Thats it. We have a working windows installation. First thing we need to do in the guest system is install “guest additions” which can be inserted as a disc from Devices>Insert Guest Additions CD Image. The way to make the virtual machine look seamless with host OS is by setting the same wallpaper , set the guest to auto-resize to host window, hide the menu bar and status bar.

My configuration for i3 is available at https://github.com/sbmkvp/config

Tunnelling internet through ssh server in MacOSX

This is a neat trick I use to tunnel my internet traffic on my mac book through a ssh server. It involves setting up a socks proxy and connecting that to a ssh connection. It involves two steps. Which you can make aliases in your .bashrc (.zshrc) file and use them from terminal.

alias mac_sst_start='ssh -D 8080 -f -q -C -N usename@serveraddress'
alias mac_proxy_on="sudo networksetup -setsocksfirewallproxy Wi-Fi localhost 8080"
alias mac_proxy_off="sudo networksetup -setsocksfirewallproxystate Wi-Fi off"

The first command mac_sst_start  starts a ssh server at the port 8080 and forwards all the internet traffic presented to it through the ssh server. When you run this, there will be a prompt for password which is the ssh account password in the server.

The second command mac_proxy_on changes the WiFi preference on the MacBook to use this port 8080 as a socks proxy and forward all the traffic to this proxy. This will also ask for password but this is the local MacBook password. Once these two are run, the internet is tunnelled through the server so if you check your ip, it will show up as the host’s ip. The third one is to switch off the proxy when you want to return to the normal internet connection.

I use this with my university servers which gives me access to my university resource from all over the world. I can access library, journal articles, servers in the university etc etc as if I am connected to my university network (just like a vpn).