Archive for the ‘OpenStreetMap’ Category
Following up the last post, where I have outlined my overall understanding and plan for the intended function, today I have finished the first working code of the aosm( ) function and as promised in the last post, I am posting the code along with the explanations and instructions. Before going into the function I would like to stress the importance of getting the system setup for running the function. I think getting all the software installations right, with the system environment variable updated with their executable files, is the biggest problem I faced while trying to get the function running on other systems. So please read the system requirements section carefully before trying to run the script.
1. Windows Operating system (windows 7 preferably) with administrator rights since installation of the software programs listed below is required.
2. R 2.15.2 (http://www.r-project.org/) installed and if possible R studio (http://www.rstudio.com/) which provides a better user interface (R studio needs an R installation to work.). Since there is nothing else in the script file, the function can be loaded to R workspace directly by running source(“c:\\%location%\\aosm.r”) command. [update(27 Mar '13): the R session must be run in administrator mode for the system() commands to work in windows 8.]
3. Osmosis (http://wiki.openstreetmap.org/wiki/Osmosis) installed in the system, with ‘osmosis.bat’ (which is inside the bin folder) file location added to the system path variable. This is really important, the function relies heavily on osmosis without which it’ll never work. Osmosis is based on java, so make sure you have java installed and system path variable is updated with java location as well. To check if everything is OK, open command prompt and type run the commands “java” and “osmosis” to see if they are recognized.
4. 7zip (http://www.7-zip.org/) installed in the system with ‘7z.exe’ file location added to the system path variable. This is also equally important if you don’t have the OSM data locally in .osm format. After using the installer to install the program manually add the 7-zip folder to the system path. Again to check, run “7z” in the command prompt and see if it is recognized.
5. Since there is a lot of data which needs to be downloaded and extracted it is recommended to have atleast 2 GB of free space in the hard drive. The OSM data for London is around 150MB when zipped and it is almost 1.5 GB when extracted, so make sure you don’t run out of space. If you have multiple drives, please set your working directory to a drive with minimum 2GB space before running the function (the function does not changes the working directory at all, so all the downloads/ temporary files are kept in the current working directory)
6. Finally most important one is internet connection. The function was envisioned as a way to get data directly from the internet and the option to read local data is built to optimize the running time in consecutive runs. So the function makes a lot of references to internet and strictly requires internet connectivity to work. (I know this is absurd and am trying to build a work around. I realized this being a real problem when internet outage lasting for half a day in the UCL Halls left me paralyzed when developing the function)
name.object <- aosm (“world”, ”geo-filter”, ”tag-filter”, ”analysis”, ”type “)
“world” – A string, which is the name of the city for which the data to be developed. The string has to be in lowercases and if the city name has spaces in it, then it has to be replaced with “-“. (E.g. “london”, “san-francisco”). The script checks for local file availability in the current working directory and if it cannot find one, then it downloads it from the OSM extracts. There are three formats to supply this data locally: ‘.osm’ file, ‘.osm.pbf’ file or ‘osm.bz’ archive.
“geo-filter” – A string, which is the name of the area within the city for which the data has to be extracted.The string has to be in lowercases and if the name has spaces in it, then it has to be replaced with “-“. (E.g. “Islington”, “city-of-london”). It denotes the name of a .poly file which can be supplied locally or can be downloaded by the function from the internet. as of now, I have boundaries of all the boroughs in London hosted in the server (http://balaspa.50webs.com/poly/) and would be updating it as I get more time. The other way to supply the boundary file is to have a shape file in the current working directory named “boundary.shp” with the polygon you want to use with the name of the polygon in the attribute table under the header “NAME”. By default any shape file in the current directory with the name – “boundary.shp” will be converted into individual polygon files with the string in the “NAME” column as file names. for example, if you keep a london borough boundary shape file (with the name “boundary.shp”) in the working directory, the function will extract all the boroughs as .poly files.
“tag-filter” – A string, denoting filter definition. The syntax is “switch_name”. Where, “switch” is either “d” or “t” denoting if the name is a definition file or the tag filter in itself. OSM has a really straight forward way of tagging its features which is every tag has Key and Value. For example a way can have a “building” tag and a value of “yes” which marks it as a building and a way can have a key of “highway” and the value of “residential” which makes it a residential street. So there is two ways of building a tag based filter one is by just writing the key and values directly in the function (t_highway,residential or t_building,yes) or for more complex filters, by making a definition file and keep it a text file in the working directory and pointing it in the function (for example d_landuse, where the function will search for a file named “landuse.txt” for the definitions). There are two sample definition files I have hosted in the server which the function can download (http://balaspa.50webs.com/def/). If there is only one value in the tag-filter definition, then all the features with the corresponding key are extracted regard less of the values. One can see what are all the keys and values used in OSM by the volunteers in the wiki page or taginfo, which will give an idea of how things are organised in OSM.
“analysis” – A string, which defines the type of analysis to be done on the data extracted and the type of result expected from the function. Currently supports the following the values of “default” (for an sp object), “utm” (for an sp object with CRS), “cn” (for count of features), “ar” (for sum of all the areas of the features) and “len” (for sum of all lengths/perimeters of the features). A detailed explanation can be found in the outputs section.
“type” – A string with one of the these three values – “points”, “lines” or “polygons”. This is determine the type of sp object which is returned by the function.
The output from the function differs significantly based on the “analysis” string in the input. The possible strings and corresponding analysis are given below.
“default” – returns an sp class object (SpatialPointsDataFrame, SpatialLinesDataFrame, SpatialPolygonsDataFrame) with the added attributes showing the key-value tags and the name tags without any CRS information
“utm” – returns a similar object to above but with the CRS information using Universal Transverse Mercator and WGS84
“cn” – returns the count of features
“area” – returns the sum of areas of all the features in the resulted data in square meters
“len” – returns the length/perimeter of all the features in the resulted data in meters.
The function aosm( ) takes 5 inputs and applies 16 sub functions on the inputs to generate the results. Since I have attached the source code of the script and this blog is getting really long, I would like keep the explanation brief.
The function first sets the environment by installing all the required packages. It then checks the WD for boundary shape file and if found, converts it to .poly files. It then evaluates the inputs to see where and in what formats do the required data exists and creates a data frame explaining the situation. This involves checking for locally available data and data available in the internet sources for all compatible formats. The next step checks the situation and evaluates if the function can continue. If it finds any errors or missing information, it reports the error and shuts down the function before any intensive task is started. Once the validity of the inputs are confirmed, the function then arranges the data from the available formats, downloads it and converts it to the desired format. Here local data is given preference over data on the internet. Once the data is arranged, the function invokes osmosis for the filtering process and makes a system() call based on the inputs. Once the filtering process is complete, osmar is used to import the filtered data file to a sp object and the extra attribute information is attached to it. The resulted object is then projected using UTM projection and WGS84 datum. As the final step, based on the inputs, the function applies appropriate analysis on the sp object and returns the results.
If you are really interested in how the script works, then the following chart explains how all the sub functions are tied together and process the inputs. I would also recommend consulting the instructions file referred in the start which explains all the functions as well.
will return the following plot,
will return the value 1364
will return the value 1312264.90810641
will return the value 337414.610756376
So concluding this extremely long and drab post, I would request the readers to give it a try and share the results & problems in the comments section below. Also feel free to put in your suggestions, point out any mistakes and tell me about any other existing solutions which may serve same purpose.
It is highly unlikely that anyone who has been working with geographic data in the last 5 years has not come across the OpenStreetMap project. The project started in 2004 as a crowd sourced solution to create an open free geographic database of the world has now exploded in to a movement in itself with more than a million users and volunteers and even provides better coverage and quality of data in some cases than some commercial data providers. The project is even more exciting for a person with a background of working in developing countries, where the biggest problem faced is the availability of geographic data itself.
OpenStreetMap – Advantages
The biggest strength of OSM is that it is open which gives complete freedom to each person in the world in terms of creating, editing and consuming the database. It is also a non-profit project funded through donations run by a community of volunteers which makes it free from all influences and pressures of the market and also gives the project access to potentially unlimited amounts of data which, in my opinion, cannot be matched by any commercial surveys. It also remains as the only hope for professionals working in geographic information sciences for accessing geographic data for developing countries.
OpenStreetMap – Disadvantages
In spite of having the clear advantages mentioned above, OSM project also has its own share of challenges as well. The prime one is its quality and coverage. Being a crowd sourced project, it is impossible for OSM to maintain the quality of the database within strict standards. It also has problems of standardization stemming from it’s free tagging policy which itself is the backbone for its richness. The general (yet largely true) assumption is that the community will monitor and balance itself in the long run to maintain the quality and standard of the database. The third is the size and complexity of the data generated. Being a global and general purpose project, OSM generates a huge and complex database compared to the regional and specific data-sets collected and distributed by commercial institutions and government.
As noted in the last post, after putting lots of energy in to understanding the database and the project for the last three months it was time to find ways to utilizing the database by extracting meaningful data for geographic analysis and finalize a proper toolkit for doing it. The tool kit had to be open and free like OSM data itself and had to be flexible and versatile enough to accommodate a variety of formats and finally had to be powerful enough to handle the scale and complexity of the data-set, the requirements which ‘R’ seemed to perfectly fulfill.
R Project for Statistical Computing
R is a programming language for statistical analysis and visualization developed in 1993 by Robert Gentleman and Ross Ihaka, which was introduced to me in GISS module of the M.Res course with CASA, UCL as a free, command line alternative to ArcGIS for Geographic analysis and visualization Though it was a bit difficult to grasp the concept of command line based system and negate the steep learning curve demanded by the language in the beginning, the advantages of R was apparent after using it for some time. The first and foremost advantage is that it is open and free (as in both lunch and freedom) compared to the equally powered, super-costly commercial GIS packages. Second is the flexibility and versatility offered in terms of supported data formats for input and output used in almost every field of study (Biology, Economics, Geography, etc.) which is made possible by the extensive support from the community of developers who make specialized packages extending the functionality of the language. When one considers all the above advantages, R makes a clear choice for becoming the central tool for the tool kit intended to be built for the extraction, manipulation and analysis of OSM data.
I have to admit that my first attempt to use R to analyze and visualize OSM data was a complete disaster. It was done for a course work, where I was trying to output a land use map of a city with OSM extracts for the particular city and R. I was working with the package ‘osmar’ and within few hours of experiments it was apparent there were a lot of problems arising from my approach. The first problem was that the data-set was huge at a city level. Though I knew it beforehand, I never expected it to be unmanageable. During my initial runs, R used to take as long as 15 min to load the data and some times gave ”memory not sufficient” errors as well. So I had to restrict my attempt with smaller cities, which had a smaller data-set. The second problem was the coverage as you can clearly see in the map below the land use information was not complete and left huge holes in the map I was trying to create. Final one was the tags, which were neither standardized nor consistent. So when I tried printing a land use map, it had 42 categories of land uses and made no sense at all. If I wanted to make any sense out of the data I had to manually sort all these categories to a standard classification, which is not feasible with bigger cities like London.
After this exercise I realized the need to find a way to breakdown and filter the OSM data into more manageable part in terms of geography and tags and also to calculate basic statistics on the filtered data. For example even though ‘osmar’ provides ways to import OSM data in XML format and convert them to ‘sp’ files, It cannot filter the data geographically beyond a bounding box. moreover the OSM API for downloading such data also restricts itself to 20,000 features which too small a scale for many practical purposes. It also imports all the features at once without any option to select the features you want to import based on the tags or type(polygons,lines etc). The resulting file is also devoid of any projections which makes it harder to do any geographic analysis over it.
This experience along with the final course work in the GISS module, gave me the opportunity to develop a function (aosm -Analyze OpenStreetMap) combining the functionality provided by the ‘osmar’ package in R along with OSM tools such as osmosis (java based tool to manipulate OSM data). The original plan was to create function to produce a precise plot of geographic data from OSM data with options to filter the data geographically (polygons) and based on tags(highways, amenities etc.) but I decided to keep the plotting part out of the function to keep it flexible and add a small analysis component to make it more useful . So the overall plan now is as shown in the flow chart below. I wanted the function to be aware of the environment (to check and see if local data is present and convert them to suitable formats), flexible (in terms of input data) and extendable with an option to add more functionality later on. The tool kit envisioned here is regional OSM extracts for the base data, osmosis for filtering data, ‘osmar’ for importing and converting data, ‘maptools’/'rgeos’ for geographic analysis of the data and expecting to output an ‘sp’ object with the data preserved in it for further analysis.
the function will look like this,object.name<-aosm(“city”,”filter-polygon”,”tags”,”analysis”,”type”)
where ‘city’ is the name of the city for which the data has to be downloaded, ‘filter-polygon’ is the name of the .poly file which denotes the specific geographic area one is concerned with (eg. boroughs in London , ‘tags’ is the key and values in the specific keys which need to be extracted, ‘analysis’ is the type of analysis you want to do on the file (eg. ‘default’ will return an ‘sp’ object, ‘utm’ will return a projected ‘sp’ object, ‘area’ will return the sum of all areas of the features etc) and ‘type’ denoting the type of the features (lines, points, polygons).
I have already started working on the function and I am in the final stages of producing a first usable draft. Will be updating the blog with the code and the results shortly once the code is usable.