Extracting multiple linear sequences out of two dimensional data

This is current problem I am working on. I don’t know how to explain it properly but I’ll try. There is single linear time series of data collected in which there are multiple increasing sequences are present. For example consider this series of numbers (y) collected at time (x) generated with three linear equations.

x = 2, 2, 3, 4, 4, 5, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 16, 18, 18, 19

y = 14, 12, 37, 22, 14, 26, 15, 73, 30, 34, 97, 38, 18, 109, 19, 121, 20, 133, 50, 21, 145, 54, 23, 169, 62, 181, 26, 217, 78, 29

When we plot this data on a chart we can see that there are three sequences in them.

2018-01-15-223825

The problem is to isolate these three clusters. Since I have no idea how to do this, I was first going for a k-means clustering algorithm with 3 clusters. which gave me this,

2018-01-15-223833

This is clearly wrong since we have a series which is forward moving both on x axis and y axis so we cannot have the blue cluster possibly occur linearly. This is when I though might be a graph based clustering algorithm might help. I can put all my rules in making the graph where only linearly possible clusters are connected and then just partition the graph. If it is too dense then I might be able to run some community detection algorithm to get the clusters out of it.

As an initial experiment, I made a graph between all these points (nodes) where distance is the euclidean distance between them. Then I applied the rules where for two nodes a, b (points) a link can exist from a to b only if

  1. b(x) > a(x),
  2. b(y) > a(y) and
  3. b(x) – a(x) is not more than 5

The resulting graph looks like this,

2018-01-15-224153

This seems good progress since I seem to have 2 connected components (ignoring the lone node) where one of them is a clear linear sequence. Then when I ran a random walk on the graph, I get three clusters,

2018-01-15-224142

we seem to be able to cluster linear sequences out of the data, where except for when these linear sequences are really close. This looks very promising for the stuff I am working on! Will see how this works with real data and post the update.

ps. I would really like to know if there is already a method which can extract multiple linear sequences out of a data similar to what I am trying here. Please mention in the comments if you think anything is relevant.

Advertisements

Installing Arch Linux (and R) on an Android device

This is a really recent development and I am very excited about this. I finally found a way to have a somewhat proper Linux installation on my phone. Though it might not be the best place to have a CLI, it is really promising and I can rely on this to do some small stuff on the go. As the tools I use are getting simpler (Photoshop vs Imagemagick) and the hardware of the phones I own are getting better, it is should possible for my phone to do the things my 5 year old laptop could handle provided the right environment.

This is done by installing a full Arch installation on an Android phone under the termux environment using the installer from TermuxArch. The installation here is actually way easier than installing Arch on a normal desktop. We start by installing termux android app. When we open termux we get a bash shell. From here we install wget by running, pkg install wget . When this is complete we download and run the Arch installation script by,

# Download the script
wget https://raw.githubusercontent.com/sdrausty/TermuxArch/master/setupTermuxArch.sh 
# Adding execute permissions
chmod a+x setupTermuxArch.sh
# Run the script
./setupTermuxArch.sh

Now we can just follow the instructions in the script which will download and unpack a base Arch Linux system and ask you to edit the mirror list. At this point, just  un-comment (remove the #) of the closest mirrors and save and exit the file. When the installation is complete you have a vanilla arch system on your mobile! Now we can theoretically install and use any program I have on my work desktop on my phone which including to ssh, vim, git, latex, R, node, postgres, mongodb, etc etc. I can even ssh into my work desktop straight from here. Below are some screenshots of the system (the chart is done entirely on phone!).

 

Mapping distribution of National Institutions in Higher Education in India [R + tidyverse + tmap]

Since I started learning R and moved away from proprietary data analysis and GIS packages, I have been amazed by the availability of free data-sets and tools enabling people to make awesome analysis and visualisations. Today we’ll look into a simple exercise of taking open data from different sources and combining them using opensource and free tools to produce maps which help us understand the data better.

We’ll use data from two sources, a tabular file on distribution of national institutes of higher education in India from here and shape file containing geographic information on boundaries of Indian states from here. I have cleaned and prepared both data which can be downloaded here. First we need to download the archive and extract/ unzip it to our working directory. Once that is done we can start combining and plotting the data.

# Loading the data
data <- read.csv("institutions.csv")
library(rgdal) # library for reading shape files
states <- readOGR(".", "india_state")

Note that, in readOGR, the first parameter is the folder at which the shape file is kept (it should be “.” if we directly unzipped the files to working directory) and second one is the name of the shape file.

# calculating the total number institutions and 
# the number of institutions per 10 million people
data <-  data %>%
mutate( Total = rowSums(.[,3:10]) )%>%
mutate( Totalppm = Total/Population)

# merging the data into the shapefile
states <- states %>%
merge(data,by="state")

Now we have succesfully merged the data into the shape file we can see this by asking states@data at the console. Now we need to map the data using tmap. First of all we load the library by running library(tmap). This is complex package and has a lot of dependencies so it might take a while to install and download. As a first step we plot just the borders of the state by running,

tm_shape(states) +
    tm_borders(col = "grey")

1

We can add labels to the map by adding a tm_text layer to it. Notice that the order of the layer is important since things overlap on each other.

tm_shape(states) +
    tm_borders(col = "grey") +
    tm_text("state", root = 10, size = "AREA")

2

Now we can plot a single variable on the map as the fill colour for the states. For example If we want to highlight all the states with an IIT, we do,

tm_shape(states) +
    tm_fill("IIT") +
    tm_borders(col = "grey") +
    tm_text("state", root = 10, size = "AREA")

3

We can plot multiple variables on the same plot side by side by just passing a vector of variables to compare. Notice that we switched of legends on one of the layer here by setting legend.size.show=FALSE.

tm_shape(states) +
    tm_fill(c("IIT", "IIM")) +
    tm_borders(col = "grey") +
    tm_text("state", root = 10, size = "AREA",
            legend.size.show = FALSE)

4

Finally we can plot the number of institutions per population by,

tm_shape(states) +
    tm_fill("Totalppm",
            title = "Institutions per 10m people") +
    tm_borders(col = 'grey') +
    tm_text("state", root = 10, size = "AREA",
            legend.size.show = FALSE)

5

This map is very uninformative because of the small Himalayan states with low populations skewing the whole distribution. So we  have to classify this data ourselves by giving a custom breaks parameter,

tm_shape(states) +
    tm_fill("Totalppm", 
            breaks = (1:10) / 2,
            title = "Institutions per 10m people") +
    tm_borders(col = 'grey') +
    tm_text("state", root = 10, size = "AREA",
            legend.size.show = FALSE)

6

Now this shows a possible north south divide in the distribution of institutions per person. This may be because of most of the national institutions in North being located in Delhi, while in the South Bangalore, Mumbai and Chennai compete for them.

That completes the post for today. To summarise we took a tabular data, joined and plotted it with geographic data and uncovered new information which are not present in them individually!

Visualising flows as Sankey diagrams with R

This one is on making quick and easy Sankey diagrams with R (and networkD3 package)  for exploring data. All we need to do is to understand how to convert data into a network and rest is really easy. We’ll create a random sample data-set which shows the room at which people were at three instances – morning, afternoon and evening and go on to visualise how people flow from each room over time. We’ll use the tidyverse stuff which I mentioned in this and this post.

First we need to create a random set of data. we do this by generating 100 random names and assign them to 5 rooms randomly  for three instances.

# load required libraries
library(randomNames)
library(tidyverse)

# generate people names
people <- randomNames(100, which.names = 'first')
# generate a set pf rooms
rooms <- paste(rep("Room", 5), 1:5)
# populate data-set by combining both
morning <- sample(rooms, 100, replace=TRUE)
afternoon <- sample(rooms, 100, replace=TRUE)
evening <- sample(rooms, 100, replace=TRUE)
data <- data.frame( people, morning, afternoon, evening)

head(data) #gives us
  people   morning afternoon evening
1 Symone    Room 3  Room 3    Room 4
2 Adrian    Room 5  Room 1    Room 2
3 Orlando   Room 3  Room 4    Room 2
4 Cristal   Room 5  Room 4    Room 2
5 Emily     Room 4  Room 1    Room 4
6 Elizabeth Room 4  Room 2    Room 4

Now that we have the data, we will try to calculate how people move between rooms from morning to evening. We’ll create a network of rooms at a time period with number of people moving between them as links.

# first we calculate number of people moving 
# between morning to afternoon for each room
# we label the rooms uniquely for morning and
# afternoon by adding "m_" and "a_"
mor_to_aft <- data %>% 
    mutate(
          from = paste0("m_", morning),
          to = paste0("a_", afternoon)) %>% 
    group_by(from, to) %>% 
    summarise(people = length(people))

# we do the same for afternoon to evening
aft_to_eve <- data %>% 
    mutate(
          from = paste0("a_", afternoon),
          to = paste0("e_", evening)) %>% 
    group_by(from, to) %>% 
    summarise(people = length(people))

# and we combine both to create links data
links <- bind_rows(mor_to_aft, aft_to_eve)
links # gives us
      from       to   people
1 m_Room 1 a_Room 1      6
2 m_Room 1 a_Room 2      2
3 m_Room 1 a_Room 3      1
4 m_Room 1 a_Room 4      6
5 m_Room 1 a_Room 5      2
6 m_Room 2 a_Room 1      3

Now we need to make the nodes, we do that by finding all unique instances of rooms in the links and indexing them from 0 (this is because of d3 and the way javascript works).

nodes <- c(links$from, links$to) %>% 
    unique() %>% 
    data.frame(name = .) %>% 
    mutate(id = as.numeric(row(.)) - 1)

Now we have to join these indexes into the links so that the network package understands the relationship between these two objects.

links <- links %>%
    left_join(nodes,by=c("from"="name")) %>%
    left_join(nodes,by=c("to"="name")) %>%
    ungroup() %>%
    select(from=id.x,to=id.y,people)

That completes data preparation. Now we have a network of time_rooms which linked by people moving between them. This can be plotted by,

library(networkD3)
sankeyNetwork(links, nodes, "from", "to", "people", NodeID = "name")

which produces,

2018-01-11-210848

Here we can clearly see which rooms had the most people at a given time and where did those people come from and where did they go in the next session. We can use the same technique to produce amazing complex diagrams visualising complex interactions at multiple levels like these ones 1, 2, 3, 4.

Data manipulation basics with tidyverse – Part 2 – Basic functions

In part 1 we saw how to use pipes to pass data between functions so that we can write R code like a sentence. The second impressive thing with tidyverse is the grammar for manipulating data. The way the functions are structured and named in tidyverse gives us a consistent way of writing R code which is clear, concise and readable. Except for very few cases, I almost always find myself using just 5 basic function with tidyverse ,

  • select – select columns
  • filter – select records
  • mutate – modify columns
  • summarise – combine records
  • arrange – arrange records

consider the sample table below,

| name  |  year | sex | town   |
 ----------------------------
|  A    |  1998 |  M  | London |
|  B    |  1995 |  M  | Berlin |
|  C    |  1994 |  F  | London |
|  D    |  2000 |  F  | Madrid |
|  E    |  1995 |  M  | Berlin |

1) Select function is to select vertical columns from the table. for example, select(year,sex,town) will return table with just the the three columns selected. We can even rename the columns as we select them (or use rename() as well), select( year, sex, city = town)

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  2000 |  F  | Madrid |
|  1995 |  M  | Berlin |

2) Filter function is to select records based on a criteria. for example, filter(year < 2000) will select only records where year is less than 2000. we can even combine multiple criteria with logical operators & (and) and | (or),

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  1995 |  M  | Berlin |

3) Mutate function modifies columns. for example, mutate(age = 2018 - year) will create a new column with name age and calculate it based on year,

|  year | sex | city   |  age |
 -------------------------------
|  1998 |  M  | London |  20  |
|  1995 |  M  | Berlin |  22  |
|  1994 |  F  | London |  23  |
|  1995 |  M  | Berlin |  22  |

4) Summarise is a two-part function which combines records based on one or more columns based on a formula (function). for example, if we need average age of people in cities according to gender, we can do – group_by(city,sex) %>% summarise(average.age=mean(age)) gives us,

|  city  | sex |  average.age |
-------------------------------
| Berlin |  M  |       23     |
| London |  F  |       24     |
| London |  M  |       20     |

5) Arrage function arranges records based on the value in the columns specified. for example, arrange(average.age) gives us,

|  city  | sex |  average.age |
-------------------------------
| London |  M  |      20      |
| Berlin |  M  |      23      |
| London |  F  |      24      |

I have found that most of the data manipulation can be done combining these 5 functions in tidyverse and the best part is that the resulting code translates really well to english. All the stuff we did earlier can be written down in a single line, clearly without any intermediate objects or referring to the data we are working on repeatedly. For example,

people %>% 
    select(year, sex, city) %>%
    filter(year < 2000) %>%
    mutate(age = 2018 - year) %>%
    group_by(city, sex) %>% summarise(average.age = mean(age))
    arrange(average.age)

Translates to,
Take the people table, select year, sex and city columns, filter for records where year is less than 2000, calculate age column from year, group the table by city and age and find out average age for the groups and arrange the records by age.

Mapping building footprints – European cities

Since Mapzen is shutting down the metro extracts are going to be shut down by end of this month. So I am going to make all the maps with the scripts I have put together earlier. So here are some building footprints of city centres of European cities – London, Madrid, Helsinki, Berlin and Amsterdam. I am currently running a batch job with larger extents and more cities in one of the high performance servers at university, will update with results when they are complete.