jq – manipulating JSON in shell

jq is amazing. It is an unique combination of javascript and linux shell which gives an immensely powerful tool to work with JSON files (This post gives a introduction to JSON format) . It plays really well with the existing shell tools and has quickly become one of the most used tools in my data analysis/ processing pipeline.

jq is like sed (streaming editor). It is takes an input stream, applies the expression on it and returns an output stream. It does not modifies files directly. The syntax is,

 input stream | jq 'expression' | output stream

Input and Outputs

The input and output streams are just plain text streams. They can be a file, program, http request etc. etc. For example consider the following commands,

curl "https://jsonplaceholder.typicode.com/posts/" | jq '.[0:5]'  > posts.json

cat posts.json | jq '.[].id' > post_ids.json

cat post_ids.json | jq '.' | curl -X POST -d "$(</dev/stdin)" "http://ptsv2.com/t/5jo6w-1522072388/post"

The first one gets json data from the url, filters the first 5 elements and puts that in posts.json file. The second one takes this posts.json file, filters just the ids from each element and puts that in post_ids.json file. The third one takes this post_ids.json file and posts all of it to a http api as a post request (the results are here). In all these examples, jq does nothing but change input stream and send it to output text stream. This makes it extremely efficient and versatile.

Expressions

The expression part in jq is essentially a tiny javascript engine which is used to manipulate the JSON. This is really powerful. A full list things than can be done is available in the manual. I’ll just outline some basic selection and filtering

selection expressions
. - Shows the original object
.keyname - selects the specific field in the object
.[] - selects all elements (if the object is an array)
.[index (:no of elements)] - selects the specified index from the array

function expressions (in addition to basic arithmetic)
length - returns length of the array
keys - returns fields in an object
map - applies a function to all the elements in an array
del - deletes an object
sel - returns an object in an array if the condition is met
test - regex like pattern matching

All these can be combined, nested and piped to each other (yes, these are pipes within pipes) indefinitely to manipulate JSON. For example consider the following JSON file named data.json

[
 {
   "id": 1,
   "title": "sunt aut facere",
   "body": "quia et t architecto"
 },
 {
   "id": 2,
   "title": "qui est esse",
   "body": "est rerum tempore"
 },
 {
   "id": 3,
   "title": "ea molestias quasi",
   "body": "et iusto sed quo"
 },
 {
   "id": 4,
   "title": "eum et est occaecati",
   "body": "ullam et saepe"
 },
 {
   "id": 5,
   "title": "nesciunt quas odio",
   "body": "repudiandae veniam quaerat"
 }
]

This can be filtered in the following ways,

'.' - all the data.

'.[0]' - first element of the data

'.[1:3]' - three elements from index 1 (ie, second, third and fourth elements.)

'.[0].title' - title of the first element

'.[].id' - ids of all elements "1,2,3,4,5"

'[.[].id]' - ids of all elements as an array. "[1,2,3,4,5]"

'. | length' - number of elements (5)

'.[] | length' - number of elements in each object of the array [3,3,3,3,3]

'.[0] | keys' - the fields/keys in the first element

'.[] | select(.id==3)' - the element with id as 3

'. | del(.[2])' - everything but third element

'. | del((.[] | select(.id==3)))' - everything but the element with id as 3

'. | map(.id = .id+1)' - increase the id variable for all elements by 1

'. | map(del(.id))' - remove the field id from all elements

'.[] | select(.body | test("et"))' - elements with 'et' in the body fields

Combining all these we can easily explore and process, json files right from linux terminal and finally the data can be organised in an array and exported as a csv using the @csv function. For example,

cat data.json | jq -r '.[] | [.id, .title, .body] | @csv' > data.csv

the -r is important since that makes jq to output raw csv text.

Advertisements

Tunnelling internet through ssh server in MacOSX

This is a neat trick I use to tunnel my internet traffic on my mac book through a ssh server. It involves setting up a socks proxy and connecting that to a ssh connection. It involves two steps. Which you can make aliases in your .bashrc (.zshrc) file and use them from terminal.

alias mac_sst_start='ssh -D 8080 -f -q -C -N usename@serveraddress'
alias mac_proxy_on="sudo networksetup -setsocksfirewallproxy Wi-Fi localhost 8080"
alias mac_proxy_off="sudo networksetup -setsocksfirewallproxystate Wi-Fi off"

The first command mac_sst_start  starts a ssh server at the port 8080 and forwards all the internet traffic presented to it through the ssh server. When you run this, there will be a prompt for password which is the ssh account password in the server.

The second command mac_proxy_on changes the WiFi preference on the MacBook to use this port 8080 as a socks proxy and forward all the traffic to this proxy. This will also ask for password but this is the local MacBook password. Once these two are run, the internet is tunnelled through the server so if you check your ip, it will show up as the host’s ip. The third one is to switch off the proxy when you want to return to the normal internet connection.

I use this with my university servers which gives me access to my university resource from all over the world. I can access library, journal articles, servers in the university etc etc as if I am connected to my university network (just like a vpn).

 

 

Sending mail from command line using mutt

Sometimes you just don’t have the patience to open a GUI. Imagine you are working on a terminal remotely through a very feeble internet connection and after hours of data wrangling you got your results in one small package. Now all you want is to email this 200kb document (average size of a 20k word .txt document). You can either load a GUI, open a browser, open gmail (the login page itself is 2MB), attach the file and send the email or just execute a one line command which does everything for you. With some minimal setup you can do the latter – sending email via CLI just like any other shell command. You can even include this in your scripts (send mail when the script finishes running etc).

We will do this using a terminal program called “mutt” which also has a brilliant CLI interface and will configure it to use gmail via imap. First step is to install mutt using a package manager, (apt/ yum/ pacman for linux and brew for macosx). I am doing this in Arch with pacman. I am installing mutt and smtp-forwarder and then create necessary folders and files for mutt.

sudo paman -S mutt smtp-forwarder
mkdir -p ~/.mutt/cache/headers
mkdir ~/.mutt/cache/bodies
touch ~/.mutt/certificates
touch ~/.mutt/muttrc

Edit the muttrc file with your favourite text editor and add these configurations, (make sure to change the username to your username and if your are using two factor authentication with gmail the password has to be generated from App passwords.

set ssl_starttls=yes
set ssl_force_tls=yes
set imap_user = 'username@gmail.com'
set imap_pass = 'yourpassword'
set from= 'username@gmail.com'
set realname='yourname'
set folder = imaps://imap.gmail.com/
set spoolfile = imaps://imap.gmail.com/INBOX
set postponed="imaps://imap.gmail.com/[Gmail]/Drafts"
set header_cache = "~/.mutt/cache/headers"
set message_cachedir = "~/.mutt/cache/bodies"
set certificate_file = "~/.mutt/certificates"
set smtp_url = 'smtps://username@smtp.gmail.com:465/'
set imap_pass = 'yourpassword'
set move = no
set imap_keepalive = 900
set editor = vim
bind pager j next-line
bind pager k previous-line
set sort = threads
set sort_aux = reverse-date-sent
unset imap_passive
set imap_check_subscribed
set mail_check=60
set timeout=10

That is it! Now we can send mail from terminal by just passing some text or a file with the text to the mutt command,

echo "email body"  | mutt -s "email-subject" -- recipient@gmail.com
mutt -s "email-subject" -- recipient@gmail.com <  file_with_body_text.txt

we can even attach files like this,

echo "please find attached"  | mutt -s "email-subject" -a "attachment.pdf" -- recipient@gmail.com

 

Installing Arch Linux (and R) on an Android device

This is a really recent development and I am very excited about this. I finally found a way to have a somewhat proper Linux installation on my phone. Though it might not be the best place to have a CLI, it is really promising and I can rely on this to do some small stuff on the go. As the tools I use are getting simpler (Photoshop vs Imagemagick) and the hardware of the phones I own are getting better, it is should possible for my phone to do the things my 5 year old laptop could handle provided the right environment.

This is done by installing a full Arch installation on an Android phone under the termux environment using the installer from TermuxArch. The installation here is actually way easier than installing Arch on a normal desktop. We start by installing termux android app. When we open termux we get a bash shell. From here we install wget by running, pkg install wget . When this is complete we download and run the Arch installation script by,

# Download the script
wget https://raw.githubusercontent.com/sdrausty/TermuxArch/master/setupTermuxArch.sh 
# Adding execute permissions
chmod a+x setupTermuxArch.sh
# Run the script
./setupTermuxArch.sh

Now we can just follow the instructions in the script which will download and unpack a base Arch Linux system and ask you to edit the mirror list. At this point, just  un-comment (remove the #) of the closest mirrors and save and exit the file. When the installation is complete you have a vanilla arch system on your mobile! Now we can theoretically install and use any program I have on my work desktop on my phone which including to ssh, vim, git, latex, R, node, postgres, mongodb, etc etc. I can even ssh into my work desktop straight from here. Below are some screenshots of the system (the chart is done entirely on phone!).

 

Mapping distribution of National Institutions in Higher Education in India [R + tidyverse + tmap]

Since I started learning R and moved away from proprietary data analysis and GIS packages, I have been amazed by the availability of free data-sets and tools enabling people to make awesome analysis and visualisations. Today we’ll look into a simple exercise of taking open data from different sources and combining them using opensource and free tools to produce maps which help us understand the data better.

We’ll use data from two sources, a tabular file on distribution of national institutes of higher education in India from here and shape file containing geographic information on boundaries of Indian states from here. I have cleaned and prepared both data which can be downloaded here. First we need to download the archive and extract/ unzip it to our working directory. Once that is done we can start combining and plotting the data.

# Loading the data
data <- read.csv("institutions.csv")
library(rgdal) # library for reading shape files
states <- readOGR(".", "india_state")

Note that, in readOGR, the first parameter is the folder at which the shape file is kept (it should be “.” if we directly unzipped the files to working directory) and second one is the name of the shape file.

# calculating the total number institutions and 
# the number of institutions per 10 million people
data <-  data %>%
mutate( Total = rowSums(.[,3:10]) )%>%
mutate( Totalppm = Total/Population)

# merging the data into the shapefile
states <- states %>%
merge(data,by="state")

Now we have succesfully merged the data into the shape file we can see this by asking states@data at the console. Now we need to map the data using tmap. First of all we load the library by running library(tmap). This is complex package and has a lot of dependencies so it might take a while to install and download. As a first step we plot just the borders of the state by running,

tm_shape(states) +
    tm_borders(col = "grey")

1

We can add labels to the map by adding a tm_text layer to it. Notice that the order of the layer is important since things overlap on each other.

tm_shape(states) +
    tm_borders(col = "grey") +
    tm_text("state", root = 10, size = "AREA")

2

Now we can plot a single variable on the map as the fill colour for the states. For example If we want to highlight all the states with an IIT, we do,

tm_shape(states) +
    tm_fill("IIT") +
    tm_borders(col = "grey") +
    tm_text("state", root = 10, size = "AREA")

3

We can plot multiple variables on the same plot side by side by just passing a vector of variables to compare. Notice that we switched of legends on one of the layer here by setting legend.size.show=FALSE.

tm_shape(states) +
    tm_fill(c("IIT", "IIM")) +
    tm_borders(col = "grey") +
    tm_text("state", root = 10, size = "AREA",
            legend.size.show = FALSE)

4

Finally we can plot the number of institutions per population by,

tm_shape(states) +
    tm_fill("Totalppm",
            title = "Institutions per 10m people") +
    tm_borders(col = 'grey') +
    tm_text("state", root = 10, size = "AREA",
            legend.size.show = FALSE)

5

This map is very uninformative because of the small Himalayan states with low populations skewing the whole distribution. So we  have to classify this data ourselves by giving a custom breaks parameter,

tm_shape(states) +
    tm_fill("Totalppm", 
            breaks = (1:10) / 2,
            title = "Institutions per 10m people") +
    tm_borders(col = 'grey') +
    tm_text("state", root = 10, size = "AREA",
            legend.size.show = FALSE)

6

Now this shows a possible north south divide in the distribution of institutions per person. This may be because of most of the national institutions in North being located in Delhi, while in the South Bangalore, Mumbai and Chennai compete for them.

That completes the post for today. To summarise we took a tabular data, joined and plotted it with geographic data and uncovered new information which are not present in them individually!

Data manipulation basics with tidyverse – Part 2 – Basic functions

In part 1 we saw how to use pipes to pass data between functions so that we can write R code like a sentence. The second impressive thing with tidyverse is the grammar for manipulating data. The way the functions are structured and named in tidyverse gives us a consistent way of writing R code which is clear, concise and readable. Except for very few cases, I almost always find myself using just 5 basic function with tidyverse ,

  • select – select columns
  • filter – select records
  • mutate – modify columns
  • summarise – combine records
  • arrange – arrange records

consider the sample table below,

| name  |  year | sex | town   |
 ----------------------------
|  A    |  1998 |  M  | London |
|  B    |  1995 |  M  | Berlin |
|  C    |  1994 |  F  | London |
|  D    |  2000 |  F  | Madrid |
|  E    |  1995 |  M  | Berlin |

1) Select function is to select vertical columns from the table. for example, select(year,sex,town) will return table with just the the three columns selected. We can even rename the columns as we select them (or use rename() as well), select( year, sex, city = town)

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  2000 |  F  | Madrid |
|  1995 |  M  | Berlin |

2) Filter function is to select records based on a criteria. for example, filter(year < 2000) will select only records where year is less than 2000. we can even combine multiple criteria with logical operators & (and) and | (or),

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  1995 |  M  | Berlin |

3) Mutate function modifies columns. for example, mutate(age = 2018 - year) will create a new column with name age and calculate it based on year,

|  year | sex | city   |  age |
 -------------------------------
|  1998 |  M  | London |  20  |
|  1995 |  M  | Berlin |  22  |
|  1994 |  F  | London |  23  |
|  1995 |  M  | Berlin |  22  |

4) Summarise is a two-part function which combines records based on one or more columns based on a formula (function). for example, if we need average age of people in cities according to gender, we can do – group_by(city,sex) %>% summarise(average.age=mean(age)) gives us,

|  city  | sex |  average.age |
-------------------------------
| Berlin |  M  |       23     |
| London |  F  |       24     |
| London |  M  |       20     |

5) Arrage function arranges records based on the value in the columns specified. for example, arrange(average.age) gives us,

|  city  | sex |  average.age |
-------------------------------
| London |  M  |      20      |
| Berlin |  M  |      23      |
| London |  F  |      24      |

I have found that most of the data manipulation can be done combining these 5 functions in tidyverse and the best part is that the resulting code translates really well to english. All the stuff we did earlier can be written down in a single line, clearly without any intermediate objects or referring to the data we are working on repeatedly. For example,

people %>% 
    select(year, sex, city) %>%
    filter(year < 2000) %>%
    mutate(age = 2018 - year) %>%
    group_by(city, sex) %>% summarise(average.age = mean(age))
    arrange(average.age)

Translates to,
Take the people table, select year, sex and city columns, filter for records where year is less than 2000, calculate age column from year, group the table by city and age and find out average age for the groups and arrange the records by age.

Minimal Latex environment for Linux (vim + entr + mupdf)

In the past year I have completely moved to the linux environment and it has been great for everything I do except for document writing. I have read about Latex and played with it for sometime but the whole process of write file > build pdf > view pdf looked to be tedious and using a IDE was not my preference (I could just use office on wine if thats the case). All I wanted was a way to edit text file using vim and see the resulting pdf in a window realtime.

After playing with few pdf viewers and utilities. I think I have the perfect solution. It has three components

  • vim (text editor)
  • entr (simple linux program to monitor files)
  • mupdf (lightweight pdf viewer)

Open the .tex file in vim, open the final pdf file in mupdf. The run the following command in the background. Thats all.

ls your.tex | entr /bin/sh -c “pdflatex your.tex && pkill -HUP mupdf”

To explain, ‘ls your.tex’ lists the file to be monitored, entr runs the command between the double quotes using the program /bin/sh when the monitored file is changed. The commands inside the quotes are to build the pdf using pdflatex and update mupdf. So now every time you write your .tex file in vim, mupdf will show the final output. The entire environment is shown in the video below.

ps. I am using i3 for the split between terminal and pdf viewer and screen for split inside terminal.