Data manipulation basics with tidyverse – Part 2 – Basic functions

In part 1 we saw how to use pipes to pass data between functions so that we can write R code like a sentence. The second impressive thing with tidyverse is the grammar for manipulating data. The way the functions are structured and named in tidyverse gives us a consistent way of writing R code which is clear, concise and readable. Except for very few cases, I almost always find myself using just 5 basic function with tidyverse ,

  • select – select columns
  • filter – select records
  • mutate – modify columns
  • summarise – combine records
  • arrange – arrange records

consider the sample table below,

| name  |  year | sex | town   |
 ----------------------------
|  A    |  1998 |  M  | London |
|  B    |  1995 |  M  | Berlin |
|  C    |  1994 |  F  | London |
|  D    |  2000 |  F  | Madrid |
|  E    |  1995 |  M  | Berlin |

1) Select function is to select vertical columns from the table. for example, select(year,sex,town) will return table with just the the three columns selected. We can even rename the columns as we select them (or use rename() as well), select( year, sex, city = town)

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  2000 |  F  | Madrid |
|  1995 |  M  | Berlin |

2) Filter function is to select records based on a criteria. for example, filter(year < 2000) will select only records where year is less than 2000. we can even combine multiple criteria with logical operators & (and) and | (or),

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  1995 |  M  | Berlin |

3) Mutate function modifies columns. for example, mutate(age = 2018 - year) will create a new column with name age and calculate it based on year,

|  year | sex | city   |  age |
 -------------------------------
|  1998 |  M  | London |  20  |
|  1995 |  M  | Berlin |  22  |
|  1994 |  F  | London |  23  |
|  1995 |  M  | Berlin |  22  |

4) Summarise is a two-part function which combines records based on one or more columns based on a formula (function). for example, if we need average age of people in cities according to gender, we can do – group_by(city,sex) %>% summarise(average.age=mean(age)) gives us,

|  city  | sex |  average.age |
-------------------------------
| Berlin |  M  |       23     |
| London |  F  |       24     |
| London |  M  |       20     |

5) Arrage function arranges records based on the value in the columns specified. for example, arrange(average.age) gives us,

|  city  | sex |  average.age |
-------------------------------
| London |  M  |      20      |
| Berlin |  M  |      23      |
| London |  F  |      24      |

I have found that most of the data manipulation can be done combining these 5 functions in tidyverse and the best part is that the resulting code translates really well to english. All the stuff we did earlier can be written down in a single line, clearly without any intermediate objects or referring to the data we are working on repeatedly. For example,

people %>% 
    select(year, sex, city) %>%
    filter(year < 2000) %>%
    mutate(age = 2018 - year) %>%
    group_by(city, sex) %>% summarise(average.age = mean(age))
    arrange(average.age)

Translates to,
Take the people table, select year, sex and city columns, filter for records where year is less than 2000, calculate age column from year, group the table by city and age and find out average age for the groups and arrange the records by age.
Advertisements

Minimal Arch-Linux desktop environment with i3+rofi

Another short post showing off the stuff I have been working on. As I mentioned earlier I have moved to Linux couple of years back with Ubuntu and now I have moved to Arch. It is as minimal as it gets and along with AUR, stuff just works. Though I am not a big fan of GUI, I definitely have a need for at least a window manager to display the browser and graphics/documents I work on. So I need a system which is minimal, bloat free with exactly the things I want, with an efficient keyboard based navigation. The video below shows my current set up (note: I mostly reside inside the terminal with tiling mode so floating windows here are just to make things look nice)

Mapping building footprints – European cities

Since Mapzen is shutting down the metro extracts are going to be shut down by end of this month. So I am going to make all the maps with the scripts I have put together earlier. So here are some building footprints of city centres of European cities – London, Madrid, Helsinki, Berlin and Amsterdam. I am currently running a batch job with larger extents and more cities in one of the high performance servers at university, will update with results when they are complete.

tmap installation in R under Arch linux

This is a small quirk I solved recently so wanted to document here. Recently I moved my desktop at work from ubuntu to arch linux. To be honest, it went surprisingly well. Every single thing I had in ubuntu migrated nicely to arch with just one exception – tmap package in R.

The “tmap” package depends on “v8” package which in turn depends on v8 library  in linux specifically version 3.14. This library is not available under pacman (arch package manager) and had to be compiled from AUR package. It seemed easy enough until it turned out that the package needed to update gyp which moved its source code from svn to git. While I could still compile it by modifying the PKGBUILD file, it resulted in segmentation faults while actually loading v8 package in R. After meddling with it for sometime I eventually gave up and decided to just wait for the update for tmap rather than wasting time on it. But I fixed it this week!

The trick is a two step install with yaourt (separate package manager for AUR). First we need to install yaourt and then both v8-3.14 and v8-3.14-bin package using yaourt after which tmap installs and works perfectly in R

yaourt -S v8-3.14
yaourt -S v8-3.14-bin

 

Data manipulation basics with tidyverse in R – Part 1 – Piping data between functions

Last July I attended the talk at LSE by Hadley Wickham on tidyverse and since then I have been working with tidyverse sparingly. My opinion on the whole thing is that it is brilliant! Along with ggplot2 it is clear, concise and powerful for analysis and visualisation of most data I encounter on daily basis. The things I like about tidyverse are,

  1. Pipes: similar to linux, each function does one thing and does that well and the data can be piped from one function to another with “%>%” operator. For me it kind of bridges one of the major gap between shell scripting and R (which are very similar to begin with)
  2. Grammar: Similar to vim, when dealing with tabular data, the grammar to manipulate them is very clear and consistent.

In this post I’ll look into the pipe operator in detail,

Piping is primarily done with the operator “%>%”, this is similar to the “|” operator in linux shell. It takes the output of the previous function and uses it as input for the next function. I used to previously do this in R by storing the output of the first function in an intermediate object but it gets tedious really quickly

## Instead of doing this,
sand <- dig(earth)
brick <- bake(sand)
wall <- lay(bricks)

## We can do this
wall <-
    earth %>% 
    dig() %>%
    bake() %>% 
    lay()

Special cases of the piping are,
%T>% This one sends the input of the previous function as input for the next function essentially skipping the previous function in the pipe. I usually use this I have to return the final output after plotting as shown below,

 data %>%
    modify_1() %T>%
    plot() %>%
    modify_2() %>% 
    plot()

%<>% This one sends the end of the pipe back to the object before it. This is useful when we are trying to transform the object itself rather than creating new one. For example,

## This, 
data <- 
    data %>%
    modify_1() %>%
    modify_2()

## Can be simplified to,
data %<>%
    modify_1() %>%
    modify_2()

The pipes pass the output to the first input variable to the next function. This can be a problem with some functions. To overcome this we use “.” to denote where the data is passed to in the function following the pipe. For example,

## This, 
1:5 %>%
    data.frame(
        x=.
        y=.^2 )

## returns the data frame,
#     x   y
# 1   1   1
# 2   2   4
# 3   3   9
# 4   4  16
# 5   5  25

That covers the basics of pipes. In the next post I’ll talk about the data manipulation functions within tidyverse in detail.

Executing commands from vim to R in terminal similar to R-Studio

As I have mentioned earlier in the blog, I really don’t like GUI based Integrated Development Environments. The reasons I find them pointless are,

  • I work with a lot of programming languages and environments simultaneously with no intention of being an expert in any one of them. For example, I program in java-android, java-processing, latex, R, javascript-node, javascript-general, shell scripting, sql-postgres all in one day. I cannot gain expertise with all of them and I don’t find investing time in just one of them worthwhile.
  • Since I never deal with anything bigger than a one person project, I almost never use any advanced IDE features like debugging etc. The only things I find useful in a GUI based IDE are syntax highlighting and some of the WYSWYG stuff.
  • I work with headless systems where I don’t have the permissions to install my favourite GUI systems, so irrespective of how good an IDE is I cannot use them in the most powerful machines I have at my disposal. I need my workflow to be portable to the least common denominator.

Because of this I had to give up on one of the best GUI IDEs I have ever seen – R Studio. Though I could get by with most of the stuff with vim and screen the biggest feature in R studio  I missed was the ability to execute commands from the script into the console. Today I figured out how to implement the same in my vim+screen+R setup.

map <C-L> "kyy:echo system("screen -S $STY -p R -X stuff ".escape(shellescape(@k),"$"))<CR>j 

vmap <C-L> "xy:echo system("screen -S $STY -p R -X stuff ".escape(shellescape(@x."\n"),"$"))<CR>j

 

Having these two lines in the .vimrc adds mapping to Ctrl-L in normal mode and in visual mode to transfer commands from the current file to the window with R console opened under the current screen session.  In normal mode the entire line is sent to R and in visual mode what ever that is selected is sent to R. A demo of these mappings is shown below.