Extracting multiple linear sequences out of two dimensional data

This is current problem I am working on. I don’t know how to explain it properly but I’ll try. There is single linear time series of data collected in which there are multiple increasing sequences are present. For example consider this series of numbers (y) collected at time (x) generated with three linear equations.

x = 2, 2, 3, 4, 4, 5, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 16, 18, 18, 19

y = 14, 12, 37, 22, 14, 26, 15, 73, 30, 34, 97, 38, 18, 109, 19, 121, 20, 133, 50, 21, 145, 54, 23, 169, 62, 181, 26, 217, 78, 29

When we plot this data on a chart we can see that there are three sequences in them.

2018-01-15-223825

The problem is to isolate these three clusters. Since I have no idea how to do this, I was first going for a k-means clustering algorithm with 3 clusters. which gave me this,

2018-01-15-223833

This is clearly wrong since we have a series which is forward moving both on x axis and y axis so we cannot have the blue cluster possibly occur linearly. This is when I though might be a graph based clustering algorithm might help. I can put all my rules in making the graph where only linearly possible clusters are connected and then just partition the graph. If it is too dense then I might be able to run some community detection algorithm to get the clusters out of it.

As an initial experiment, I made a graph between all these points (nodes) where distance is the euclidean distance between them. Then I applied the rules where for two nodes a, b (points) a link can exist from a to b only if

  1. b(x) > a(x),
  2. b(y) > a(y) and
  3. b(x) – a(x) is not more than 5

The resulting graph looks like this,

2018-01-15-224153

This seems good progress since I seem to have 2 connected components (ignoring the lone node) where one of them is a clear linear sequence. Then when I ran a random walk on the graph, I get three clusters,

2018-01-15-224142

we seem to be able to cluster linear sequences out of the data, where except for when these linear sequences are really close. This looks very promising for the stuff I am working on! Will see how this works with real data and post the update.

ps. I would really like to know if there is already a method which can extract multiple linear sequences out of a data similar to what I am trying here. Please mention in the comments if you think anything is relevant.

Advertisements

Visualising flows as Sankey diagrams with R

This one is on making quick and easy Sankey diagrams with R (and networkD3 package)  for exploring data. All we need to do is to understand how to convert data into a network and rest is really easy. We’ll create a random sample data-set which shows the room at which people were at three instances – morning, afternoon and evening and go on to visualise how people flow from each room over time. We’ll use the tidyverse stuff which I mentioned in this and this post.

First we need to create a random set of data. we do this by generating 100 random names and assign them to 5 rooms randomly  for three instances.

# load required libraries
library(randomNames)
library(tidyverse)

# generate people names
people <- randomNames(100, which.names = 'first')
# generate a set pf rooms
rooms <- paste(rep("Room", 5), 1:5)
# populate data-set by combining both
morning <- sample(rooms, 100, replace=TRUE)
afternoon <- sample(rooms, 100, replace=TRUE)
evening <- sample(rooms, 100, replace=TRUE)
data <- data.frame( people, morning, afternoon, evening)

head(data) #gives us
  people   morning afternoon evening
1 Symone    Room 3  Room 3    Room 4
2 Adrian    Room 5  Room 1    Room 2
3 Orlando   Room 3  Room 4    Room 2
4 Cristal   Room 5  Room 4    Room 2
5 Emily     Room 4  Room 1    Room 4
6 Elizabeth Room 4  Room 2    Room 4

Now that we have the data, we will try to calculate how people move between rooms from morning to evening. We’ll create a network of rooms at a time period with number of people moving between them as links.

# first we calculate number of people moving 
# between morning to afternoon for each room
# we label the rooms uniquely for morning and
# afternoon by adding "m_" and "a_"
mor_to_aft <- data %>% 
    mutate(
          from = paste0("m_", morning),
          to = paste0("a_", afternoon)) %>% 
    group_by(from, to) %>% 
    summarise(people = length(people))

# we do the same for afternoon to evening
aft_to_eve <- data %>% 
    mutate(
          from = paste0("a_", afternoon),
          to = paste0("e_", evening)) %>% 
    group_by(from, to) %>% 
    summarise(people = length(people))

# and we combine both to create links data
links <- bind_rows(mor_to_aft, aft_to_eve)
links # gives us
      from       to   people
1 m_Room 1 a_Room 1      6
2 m_Room 1 a_Room 2      2
3 m_Room 1 a_Room 3      1
4 m_Room 1 a_Room 4      6
5 m_Room 1 a_Room 5      2
6 m_Room 2 a_Room 1      3

Now we need to make the nodes, we do that by finding all unique instances of rooms in the links and indexing them from 0 (this is because of d3 and the way javascript works).

nodes <- c(links$from, links$to) %>% 
    unique() %>% 
    data.frame(name = .) %>% 
    mutate(id = as.numeric(row(.)) - 1)

Now we have to join these indexes into the links so that the network package understands the relationship between these two objects.

links <- links %>%
    left_join(nodes,by=c("from"="name")) %>%
    left_join(nodes,by=c("to"="name")) %>%
    ungroup() %>%
    select(from=id.x,to=id.y,people)

That completes data preparation. Now we have a network of time_rooms which linked by people moving between them. This can be plotted by,

library(networkD3)
sankeyNetwork(links, nodes, "from", "to", "people", NodeID = "name")

which produces,

2018-01-11-210848

Here we can clearly see which rooms had the most people at a given time and where did those people come from and where did they go in the next session. We can use the same technique to produce amazing complex diagrams visualising complex interactions at multiple levels like these ones 1, 2, 3, 4.

Complete graph creator using Raphael.js

update 27 Dec 2017: fixed mistake in code and fixed link.

Complete graph creator

This one is very similar to A Simple Gravity Model except that this one is made with javascript (Raphael.js)  and does not has the gravity model for the width of the links. I made this as a demonstration for how easy it is to make interactive graphics with javascript. With less than 40 lines of code, with raphael and javascript we can create this complete graph generator where you can click the canvas to create nodes and click and drag the nodes to move them. the links are generated and updated based on the position of the nodes. I am planning to create a full suite of tools for making and analysing networks online for which this is the first step. [ link]

Code:

<!DOCTYPE html>
<html>
<head>
	<title>jsgraph</title>
	<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/raphael/2.2.7/raphael.min.js"></script>
</head>
<body >
	<script>
		var paper = Raphael(0,0,'100%','100%');
		var background = paper.rect(0,0,'100%','100%').attr({'fill':'#ddd','stroke-width':'0'});
		var circles = [];
		var lines = [];
		background.click(function(event){
			circles.push(new circle(event.clientX,event.clientY));
			refreshLines(circles);
		});
		function refreshLines(arr) {
			for (i in lines) {
				lines[i].remove();
			}
			for (i in arr) {
				for (j in arr) {
					lines.push(new line(arr[i].attrs.cx,arr[i].attrs.cy,arr[j].attrs.cx,arr[j].attrs.cy));
				}
			}
		}
		function circle (mx,my) {
			return paper.circle(mx,my,10).attr({fill:'#444',stroke:'#444'}).drag(
				function(da,db,x,y){
					this.attr({cx:x,cy:y});
				},
				function(){
					var color = this.attrs.fill=="#444" ? '#f00' : '#444';
					this.attr({fill:color});
				},
				function(){
					var color = this.attrs.fill=="#444" ? '#f00' : '#444';
					this.attr({fill:color});
					refreshLines(circles);
				}
			);
		}
		function line (sx,sy,ex,ey) {
			return  paper.path( "M"+sx+","+sy+" "+"L"+ex+","+ey ).attr({stroke:'#444'});
		}
	</script>
</body>
</html>

New Project – RefNet

After trying to organise the reading for my research last week, I realised that research process in my mind is not organised as a list or a checklist but as a network of interconnected ideas from various sources. This is where I felt the reference managers which I was using were failing miserably. Though they did a good job in organising the meta data on the papers, books and articles which I was reading and including them as references in my write-up, they did not help me in the research process. My research still remained as an exercise where I go through search engines and list of references in other papers manually and trying to put together all the stuff in my mind by myself. This is where I decided that If I cannot find a tool which I want I would rather build one myself and also that all the things I learned about networks and web development in the past year has to be put in use somewhere.

So here it is, RefNet – A reference manager which organises the references/bibliography as network of objects rather than a list. The idea is to build a tool where you can drag and drop papers and books as objects, and based on the citations in them they are organised as network of interconnected ideas. I started a github repository and using vivagraph library (inspired from here), put together a very preliminary working concept and added some data on the things I have been reading the past week. The result is as below, (click the image for interactive version)

refnet

The plan forward is to make the tool more dynamic with drag and drop option, automatic citation importing from a database such as web of knowledge, possibly a suggestion tool to say which papers to read further based on the network properties and finally a plugin to integrate this with google docs/ ms word. As mentioned earlier the project and code as of now is up on github (here) and would be really happy to collaborate with interested people on building this.

Dissertation – Part 1

Its been a month since I have submitted my dissertation and as promised in the previous posts (http://goo.gl/XqYvuP and http://goo.gl/c7Lvu6 ) now it’s time to start blogging about the dissertation and its outputs. So starting from today I would try to do one post every day  for at least 2 weeks, talking about one aspect of my dissertation, leading up to the final results and conclusions.

To give brief introduction, In my dissertation (titled-“Network theory approach on modelling occurrence of Road Accidents – Case study UK”) I set out to create a dual graph of UK roads, attach the Road accidents data to the graph, find patterns in the distribution of accidents in the dual graph and find correlations between this distribution and the properties of the network.

Today, as a start I want to do a flash-forward into the whole process and share one of the key outputs produced as a part of the dissertation – A static visualisation of the dual graph of the entire road network of UK (England, Scotland and Wales) showing the betweenness centrality of each and every node.

Betweenness Centralities - Road Network, UK

Explanation: The above visualisation shows all the roads in UK as points (simplified by averaging the coordinates of their constituent segments)  and intersections between roads as lines connecting the corresponding points. The size of the points show the betweenness centrality of the corresponding roads in the information space. To put it simply, the visualisation highlights the most central roads in UK when the network is resolved as a dual graph. We can see that M25 is the most central road in the whole network with  a the highest probability of being in a shortest path derived between any two random roads in the network. The visualisation is produced using Processing.

Though it is simple plot of a series of data on a 2D plane, lot of things have been done in the background to create this abstract representation of the UK road network, about which I would be blogging in detail starting from tomorrow. I am hoping this series of posts are interesting to the readers and help me identify all the holes, drawbacks and errors in my research process.