Skip to content

Latest commit

 

History

History
2624 lines (2052 loc) · 61.2 KB

Wrangle-OpenStreetMap-Data.md

File metadata and controls

2624 lines (2052 loc) · 61.2 KB

Wrangle-OpenStreetMap-Data


On the particular project, I am using data mungling techniques to assess the quality of OpenStreetMap’s (OSM) data for the center of Singapore regarding their consistency and uniformity. The data wrangling takes place programmatically, using Python for the most of the process and SQL for items that need further attention while in the PostgreSQL.

The dataset describes the center of Singapore, covering an area from Clementi on the west, to Bedok on the east and from Serangoon on the north, to Sentosa Island on the south. The size of the dataset is 96 MB and can can be downloaded from here

About the project

Scope

OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. The creation and growth of OSM have been motivated by restrictions on use or availability of map information across much of the world, and the advent of inexpensive portable satellite navigation devices.

On the specific project, I am using data from https://www.openstreetmap.org and data mungling techniques, to assess the quality of their validity, accuracy, completeness, consistency and uniformity.
The biggest part of the wrangling takes place programmatically using Python and then the dataset is entered into a PostgreSQL database for further examination of any remaining elements that need attention. Finally, I perform some basic exploration and express some ideas for additional improvements.

Skills demonstrated

  • Assessment of the quality of data for validity, accuracy, completeness, consistency and uniformity.
  • Parsing and gathering data from popular file formats such as .xml and .csv.
  • Processing data from very large files that cannot be cleaned with spreadsheet programs.
  • Storing, querying, and aggregating data using SQL.

The Dataset

OpenStreetMap's data are structured in well-formed XML documents (.osm files) that consist of the following elements:

  • Nodes: "Nodes" are individual dots used to mark specific locations (such as a postal box). Two or more nodes are used to draw line segments or "ways".
  • Ways: A "way" is a line of nodes, displayed as connected line segments. "Ways" are used to create roads, paths, rivers, etc.
  • Relations: When "ways" or areas are linked in some way but do not represent the same physical thing, a "relation" is used to describe the larger entity they are part of. "Relations" are used to create map features, such as cycling routes, turn restrictions, and areas that are not contiguous. The multiple segments of a long way, such as an interstate or a state highway are grouped into a "relation" for that highway. Another example is a national park with several locations that are separated from each other. Those are also grouped into a "relation".

All these elements can carry tags describing the name, type of road, and other attributes.

For the particular project, I am using a custom .osm file for the center of Singapore which I exported by using the overpass API. The dataset has a volume of 96 MB and can be downloaded from this link


Data Preparation

Imports and Definitions

%matplotlib inline

import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
from operator import itemgetter
from difflib import get_close_matches

#For export to csv and data validation
import csv
import codecs
import cerberus

#For reverse geocoding
from geopy.geocoders import GoogleV3
geolocator = GoogleV3()
from geopy.exc import GeocoderTimedOut
#OSM downloaded from openstreetmap
SG_OSM = '../Helper/Singapore.osm'
#The following .csv files will be used for data extraction from the XML.
NODES_PATH = "../Helper/nodes.csv"
NODE_TAGS_PATH = "../Helper/nodes_tags.csv"
WAYS_PATH = "../Helper/ways.csv"
WAY_NODES_PATH = "../Helper/ways_nodes.csv"
WAY_TAGS_PATH = "../Helper/ways_tags.csv"
#Regular expressions
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'\"\?%#$@\,\.\t\r\n]')
#A list to save elements need further attention
PROBLEMATICS = []

Parsing the Data

The size of the dataset allows me to parse it to memory to speed up the processing.
In the case of a significant bigger XML, I would have to use the iterparse() function instead.

tree = ET.parse(SG_OSM)
root = tree.getroot()

Data Assessment

An initial exploration of the dataset revealed the following problems:

  • Abbreviations of street types like ‘Av’ instead of ‘Avenue’ and ‘Rd’ instead of ‘Road’.
  • All lowercase letters like ‘street’ instead of ‘Street’.
  • Postcodes including the first letter (S xxxxxx) or the whole name (Singapore xxxxxx) of the country.
  • Postcodes omitting the leading ‘0’ (probably because of declared as integers at some point before their import to OpenStreetMap.)
  • Multi-abbreviated amenity names.

The problems in the amenity names were to a small extent, and they were corrected directly in the database, the rest resolved programmatically using Python on the biggest part and a subtle portion of them needed further assessment, resolven in the database.

Auditing Street Types

To audit the street names I should extract them from the XML.
The street names appear in two forms in the dataset:
In Node and Way elements, in the form of: "< tag k="addr:street" v="street_name"/>"

<node id="337171253" lat="1.3028023" lon="103.8599300" version="3" timestamp="2015-08-01T01:38:25Z" changeset="33022579" uid="741163" user="JaLooNz">
    <tag k="addr:city" v="Singapore"/>
    <tag k="addr:country" v="SG"/>
    <tag k="addr:housenumber" v="85"/>
    <tag k="addr:postcode" v="198501"/>
    <tag k="addr:street" v="Sultan Gate"/>
    <tag k="fax" v="+65 6299 4316"/>
    <tag k="name" v="Malay Heritage Centre"/>
    <tag k="phone" v="+65 6391 0450"/>
    <tag k="tourism" v="museum"/>
    <tag k="website" v="http://malayheritage.org.sg/"/>
</node>

In Way elements that have the "< tag k="highway" ..../>", and the 'v' attribute is one of ['living_street', 'motorway', 'primary', 'residential', 'secondary', 'tertiary'], as "< tag k="name" v="street_name"/>".

<way id="4386520" version="23" timestamp="2016-11-07T12:03:39Z" changeset="43462870" uid="2818856" user="CitymapperHQ">
    <nd ref="26778964"/>
    <nd ref="247749632"/>
    <nd ref="1275309736"/>
    <nd ref="1275309696"/>
    <nd ref="462263980"/>
    <nd ref="473019059"/>
    <nd ref="4486796339"/>
    <nd ref="1278204303"/>
    <nd ref="3689717007"/>
    <nd ref="246494174"/>
    <tag k="highway" v="primary"/>
    <tag k="name" v="Orchard Road"/>
    <tag k="oneway" v="yes"/>
</way>
def chk_for_street(element):
    '''Extracts adrresses from elements.
    
    Args:
        element (element): An element of the XML tree
        
    Returns:
        str: If the element has an address it returns it as a string , otherwise it returns nothing.
        
    '''
    highway_types = [
        'living_street', 'motorway', 'primary', 'residential', 'secondary',
        'tertiary'
    ]
    tag = element.find("./tag[@k='addr:street']")
    if tag is None:
        if element.tag == 'way':
            tag = element.find("./tag[@k='highway']")
            try:
                if tag.get('v') in highway_types:
                    return element.find("./tag[@k='name']")
            except AttributeError:
                return
    else:
        return tag
    return
def get_street_names(tree):
    '''Creates a dictionary for all elements in a given tree.
    
    Args:
        tree (ElementTree): An ElementTree object for which I want to find the street names
        
    Returns
        dict: A dictionary with the following stracture: {element_id:street_name}
        
    '''
    result = {}
    for path in ["./node", "./way"]:
        for element in tree.findall(path):
            try:
                result[element.get('id')] = chk_for_street(element).get('v')
            except (AttributeError):  #chk_for_street() returns nothing
                continue
    return result
street_names = get_street_names(root)
#Sample of the dictionary
pprint.pprint(dict(street_names.items()[:10]))
{'173767759': 'Lornie Road',
 '178436854': 'Upper Weld Road',
 '201275515': 'Jalan Novena',
 '241208738': 'Race Course Road',
 '241208739': 'Race Course Road',
 '260092923': 'Jalan Novena Utara',
 '334112229': 'Joo Chiat Road',
 '388268000': 'Waringin Park',
 '46818743': 'Bukit Purmei Avenue',
 '9590561': 'Merchant Road'}

I am searching for multiple versions of the same street type. The different versions include different abbreviations, like Street/St, or different letter cases, like Avenue/avenue.

Although most of the Singaporean street names end with the street type (e.g., “Serangoon Road” or “Arab Street”) it is also common to end with a number instead (e.g. “Bedok North Avenue 1”).
Thus, by using regular expressions, I extracted the last word that does not contain numbers from the street name.

st_types_re = re.compile(r'[a-zA-Z]+[^0-9]\b\.?')

The result will be a dictionary with the format: {street_type:(list_of_street_names)}
I am also adding not expected street names to the "PROBLEMATICS" list for further assessment.

def audit_st_types(streets):
    '''Extracts the "street type" part from an address
    
    Args:
        streets (dict): A dictionary containing street names in the form of {element_id:street_name}
        
    Returns:
        dict: A dictionary of street types in the form of 
        {street_type:(street_name_1,street_name_2,...,street_name_n)}
    
    '''
    result = defaultdict(set)
    for key, value in streets.iteritems():
        try:
            street_type = st_types_re.findall(value)[-1].strip()
        except (IndexError):  #One word or empty street names
            PROBLEMATICS.append((key, 'street name', value))
        result[street_type].add(value)

    return result
streets = audit_st_types(street_names)
#Sample of the dictionary
pprint.pprint(dict(streets.items()[:7]))
{'Fyover': set(['Ophir Fyover']),
 'Melor': set(['Jalan Melor']),
 'Pelatina': set(['Jln Pelatina']),
 'Riang': set(['Jalan Riang']),
 'Rise': set(['Ascot Rise',
              'Binchang Rise',
              'Binjai Rise',
              'Cairnhill Rise',
              'Canning Rise',
              'Clover Rise',
              'Dover Rise',
              'Goldhill Rise',
              'Greenleaf Rise',
              'Holland Rise',
              'Matlock Rise',
              'Mount Sinai Rise',
              'Novena Rise',
              'Oxley Rise',
              'Siglap Rise',
              'Slim Barracks Rise',
              'Telok Blangah Rise',
              'Toa Payoh Rise',
              'Watten Rise']),
 'Satu': set(['Jalan Satu', 'Lengkong Satu']),
 'Taman': set(['Jalan Taman'])}

It would be easy to populate the list of Common Street Types with some profound values like Street or Avenue, but guessing does not take into account any local peculiarity. Instead, I searched the dataset for all the different types and used the 12 with the most occurrences (From 13th position, abbreviations start to appear).

def sort_street_types(street_types):
    '''Counts the number of appearances of each street type and sorts them.
    
    Args:
        street_types (dict): A dictionary of street types in the form of 
        {street_type:(street_name_1,street_name_2,...,street_name_n)}
        
    Returns:
        list: A sorted list of tupples where each tupple includes a 
        street type and the number of occurences in the dataset.
    '''
    result = []
    for key, value in street_types.iteritems():
        result.append((key, len(value)))
        result = sorted(list(result), key=itemgetter(1), reverse=True)
    return result
street_types = sort_street_types(streets)
#print a samle of the list
street_types[:15]
[('Road', 576),
 ('Avenue', 145),
 ('Street', 139),
 ('Drive', 87),
 ('Lane', 79),
 ('Geylang', 42),
 ('Crescent', 42),
 ('Walk', 40),
 ('Park', 39),
 ('Close', 37),
 ('Link', 35),
 ('Terrace', 30),
 ('Ave', 29),
 ('Hill', 25),
 ('Place', 23)]
def populate_expected(street_types, threshold):
    '''Populates the Expected list
    
    Args:
        street_types (list): A sorted list of (street_type, #_of_appearances).
        threshold (int): The number of the top elements I want to put in the "expected" list.
        
    Returns:
        list: Returns a list of the x most frequent street types (x defined by "threshold)
    
     '''
    
    expected = []
    for i in street_types[:threshold]:
        expected.append(i[0])

    return expected
EXPECTED = populate_expected(street_types, 12)
EXPECTED
['Road',
 'Avenue',
 'Street',
 'Drive',
 'Lane',
 'Geylang',
 'Crescent',
 'Walk',
 'Park',
 'Close',
 'Link',
 'Terrace']

To find the street names that need correction, I used the “get_close_matches()” function from the difflib module to find “close matches” of the 12 Common Street Types. This is what I found:

def find_abbreviations(expected, data):
    """Uses get_close_matces() to find similar text
    
    Args:
        expected (list): A list of the expected street types.
        data (list): A list of all the different street types.
        
    Retturns: nothing
        
    """
    
    for i in expected:
        print i, get_close_matches(i, data, 4, 0.5)
find_abbreviations(EXPECTED, list(streets.keys()))
Road ['Road', 'road', 'Rd', 'Ria']
Avenue ['Avenue', 'Aenue', 'Avebue', 'Ave']
Street ['Street', 'street', 'See', 'Stangee']
Drive ['Drive', 'Grove', 'Grisek', 'Bridge']
Lane ['Lane', 'Lana', 'Lateh', 'Layang']
Geylang ['Geylang', 'Pelangi', 'Selangat', 'Selanting']
Crescent ['Crescent', 'Cresent', 'Cres', 'Green']
Walk ['Walk', 'walk', 'Wajek', 'Wakaff']
Park ['Park', 'park', 'Parkway', 'Paras']
Close ['Close', 'Cross', 'Circle', 'Flyover']
Link ['Link', 'Minyak', 'Bingka', 'Limu']
Terrace ['Terrace', 'Terrance', 'Ter', 'Tenteram']

Now, I can map the different variations to the one it meant to be and correct all the different abbreviations of street types.

mapping = {
    'road': 'Road',
    'Rd': 'Road',
    'street': 'Street',
    'Ave': 'Avenue',
    'Avebue': 'Avenue',
    'Aenue': 'Avenue',
    'park': 'Park',
    'walk': 'Walk',
    'link': 'Link',
    'Cresent': 'Crescent',
    'Terrance': 'Terrace',
    'Ter': 'Terrace'
}
def update_street_type(tree):
    '''Corrects the dataset's street name according to the mapping
    
    Args:
        tree (ElementTree): An ElementTree object for which I want to clean the street names
    
    Returns: nothing
           
    '''
    changes = {}
    for path in ["./node", "./way"]:  #"elements" do not have street names.
        for element in tree.findall(path):
            try:
                tag = chk_for_street(element)
                street_name = tag.get('v')
            except (AttributeError
                    ):  #In case element doen't have "street name" attribute
                continue
            try:
                street_type = st_types_re.findall(street_name)[-1].strip()
            except (IndexError):
                #Leaves the problematic street names as is.
                #They are already in the PROBLEMATICS list.
                street_type = street_name

            if street_type in mapping:
                tag.attrib['v'] = tag.attrib['v'].replace(street_type,
                                                          mapping[street_type])

                if street_name not in changes:
                    changes[street_name] = [tag.attrib['v'], 1]
                else:
                    changes[street_name][1] += 1
    counter = 0
    for key, value in changes.iteritems():
        counter += value[1]
        if value[1] == 1:
            print key + ' ==> ' + value[0]
        else:
            print key + ' ==> ' + value[0] + " " + "(" + str(value[
                1]) + " occurrences" + ")"
    print str(counter) + " street names were fixed"
    update_street_type.called = True #Function attribute to track if a function has been called.
update_street_type.called = False
update_street_type(root)
Greenwood Ave ==> Greenwood Avenue (2 occurrences)
Smith street ==> Smith Street
Eunos Ave 7A ==> Eunos Avenue 7A
Arumugam Rd ==> Arumugam Road (3 occurrences)
Yarwood Ave ==> Yarwood Avenue (2 occurrences)
Eunos Ave 6 ==> Eunos Avenue 6
Eunos Ave 5 ==> Eunos Avenue 5
Bukit Batok East Ave 6 ==> Bukit Batok East Avenue 6 (2 occurrences)
Roseburn Ave ==> Roseburn Avenue
Read Cresent ==> Read Crescent
Sophia Rd ==> Sophia Road
Bedok North road ==> Bedok North Road (4 occurrences)
Chee Hoon Ave ==> Chee Hoon Avenue (2 occurrences)
Holland Grove Ter ==> Holland Grove Terrace
Serangoon Aenue 1 ==> Serangoon Avenue 1
Hua Guan Ave ==> Hua Guan Avenue (3 occurrences)
Elite Park Ave ==> Elite Park Avenue
Malcolm Rd ==> Malcolm Road
Eng Neo Ave ==> Eng Neo Avenue (2 occurrences)
Bukit Timah Rd ==> Bukit Timah Road (2 occurrences)
Stockport Rd ==> Stockport Road
Greenmead Ave ==> Greenmead Avenue
Ross Ave ==> Ross Avenue
Nanson road ==> Nanson Road
Raeburn park ==> Raeburn Park
Upper Wilkie Rd ==> Upper Wilkie Road
Tai Thong Cresent ==> Tai Thong Crescent
Clementi Ave 2 ==> Clementi Avenue 2 (3 occurrences)
Clementi Ave 1 ==> Clementi Avenue 1
First Hospital Ave ==> First Hospital Avenue
Wareham Rd ==> Wareham Road
31 Lower Kent Ridge Rd ==> 31 Lower Kent Ridge Road
Towner Rd ==> Towner Road
Greenleaf Ave ==> Greenleaf Avenue
1013 Geylang East Ave 3 ==> 1013 Geylang East Avenue 3
Lakme Terrance ==> Lakme Terrace
Ubi Ave 1 ==> Ubi Avenue 1
Daisy Ave ==> Daisy Avenue
Bayfront Avebue ==> Bayfront Avenue
Eunos Ave 5A ==> Eunos Avenue 5A
Commonwealth Cresent ==> Commonwealth Crescent
Pine walk ==> Pine Walk
Elite Terrance ==> Elite Terrace
Sian Tuan Ave ==> Sian Tuan Avenue
Tai Keng Ave ==> Tai Keng Avenue
Orchard Rd ==> Orchard Road
Wilmonar Ave ==> Wilmonar Avenue
Parkstone Rd ==> Parkstone Road
Kent Ridge Cresent ==> Kent Ridge Crescent (9 occurrences)
Vanda Ave ==> Vanda Avenue
Gloucester road ==> Gloucester Road (5 occurrences)
Tanjong Pagar Rd ==> Tanjong Pagar Road
Hougang Ave 3 ==> Hougang Avenue 3
Hougang Ave 1 ==> Hougang Avenue 1
Chempaka Ave ==> Chempaka Avenue
Greendale Ave ==> Greendale Avenue
83 street names were fixed

Auditing Postcodes

Postcodes in Singapore consist of 6 digits with the first two, denoting the Postal Sector, take values between 01 and 80, excluding 74 (https://www.ura.gov.sg/realEstateIIWeb/resources/misc/list_of_postal_districts.htm).
I am searching the dataset for this pattern, correcting whatever can be addressed automatically and adding the rest to the "PROBLEMATICS" for further examination.

def fix_pcodes():
    """Tries to find an integer between 01 and 80, excluding 74 in the postcode field and
    if needed change the field value accordingly
    
    Args: No args
    
    Returns: Nothing
    """
    postcode_re = re.compile(
        r'(([0-6][0-9])|(7([0-3]|[5-9]))|80)[0-9]{4}')# all integers between 01 and 80, excluding 74 
    for element in root.findall(".//*[@k='addr:postcode']/.."):
        tag = element.find("./*[@k='addr:postcode']")
        postcode = tag.attrib['v']
        try:
            new_tag = postcode_re.search(postcode).group(0)
            if new_tag != postcode:
                tag.attrib['v'] = postcode_re.search(postcode).group(0)
                print postcode + ' ==> ' + tag.attrib['v']
        except (AttributeError):  # If you cannot extract a valid postcode, add the element to PROBLEMATICS
            PROBLEMATICS.append((element.get('id'), 'postcode', postcode))
    fix_pcodes.called = True #Function attribute to track if a function has been called.
fix_pcodes.called = False
fix_pcodes()
Singapore 408564 ==> 408564
S118556 ==> 118556
S 278989 ==> 278989

Postcodes were much more consistent than the street types with 3 problems fixed programmati- cally and 8 pending further inspection.


Importing Dataset to Database

After performing the most of the cleaning through Python, I can store the dataset in the database to examine the PROBLEMATIC elements and explore it further.
I am using PostgreSQL to present a generic solution although a lightweight database like SGLite might be more appropriate.
Initially, I am exporting the data in .csv files using the schema below, creating the tables and importing the .csvs.

Exporting dataset to .CSVs

SCHEMA = {
    'node': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'lat': {'required': True, 'type': 'float', 'coerce': float},
            'lon': {'required': True, 'type': 'float', 'coerce': float},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'node_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    },
    'way': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'way_nodes': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'node_id': {'required': True, 'type': 'integer', 'coerce': int},
                'position': {'required': True, 'type': 'integer', 'coerce': int}
            }
        }
    },
    'way_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    }
}
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']
def shape_element(element):
    """Clean and shape node or way XML element to Python dict
    
    Arrgs:
        element (element): An element of the XML tree
        
    Returns:
        dict: if element is a node, the node's attributes and tags.
              if element is a way, the ways attributes and tags along with the nodes that form the way.
    """
    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = [
    ]  # Handle secondary tags the same way for both node and way elements
    if element.tag == 'node':
        node_attribs['id'] = element.get('id')
        node_attribs['lat'] = element.get('lat')
        node_attribs['lon'] = element.get('lon')
        node_attribs['user'] = element.get('user')
        node_attribs['uid'] = element.get('uid')
        node_attribs['version'] = element.get('version')
        node_attribs['changeset'] = element.get('changeset')
        node_attribs['timestamp'] = element.get('timestamp')
        for child in element:
            if child.tag == 'tag':
                tag = {'id': node_attribs['id']}
                k = child.get('k')
                if not PROBLEMCHARS.search(k):
                    k = k.split(':', 1)
                    tag['key'] = k[-1]
                    tag['value'] = child.get('v')
                    if len(k) == 1:
                        tag['type'] = 'regular'
                    elif len(k) == 2:
                        tag['type'] = k[0]
                tags.append(tag)
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        counter = 0
        way_attribs['id'] = element.get('id')
        way_attribs['user'] = element.get('user')
        way_attribs['uid'] = element.get('uid')
        way_attribs['version'] = element.get('version')
        way_attribs['changeset'] = element.get('changeset')
        way_attribs['timestamp'] = element.get('timestamp')
        for child in element:
            if child.tag == 'tag':
                tag = {'id': way_attribs['id']}
                k = child.get('k')
                if not PROBLEMCHARS.search(k):
                    k = k.split(':', 1)
                    tag['key'] = k[-1]
                    tag['value'] = child.get('v')
                    if len(k) == 1:
                        tag['type'] = 'regular'
                    elif len(k) == 2:
                        tag['type'] = k[0]
                tags.append(tag)
            if child.tag == 'nd':
                nd = {'id': way_attribs['id']}
                nd['node_id'] = child.get('ref')
                nd['position'] = counter
                way_nodes.append(nd)
            counter += 1
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}
def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema
    
    Args:
        element (element): An element of the tree
        validator (cerberus.validator): a validator
        schema (dict): The schema to validate element against.
        
    Returns:
        Nothing
        
        """
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_string = pprint.pformat(errors)

        raise Exception(message_string.format(field, error_string))
class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, unicode) else v)
            for k, v in row.iteritems()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)
def process_map(validate=True):
    """Iteratively process each XML element and write to csv(s)
    
    Arrgs:
        validate (bool): Validate the data before write them to csv or not
        
    Returns:
        Nothing
    """

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
         codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
         codecs.open(WAYS_PATH, 'w') as ways_file, \
         codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
         codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()
        
        #Check that the dataset has been cleared
        if update_street_type.called is not True:
            update_street_type(root)
            
        if fix_pcodes.called is not True:
            fix_pcodes()            

        for element in root.findall("./*"):
            el = shape_element(element)
            if el:
                if validate is True:
                    #try:
                    validate_element(el, validator)
                    #except:
                    #    print element.get('id')

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])
process_map()

Connection to the database

For connection between Jupyter Notebook and PostgreSQL, I am using ipython-sql

"""Loading the ipython-sql module"""
%load_ext sql

"""Disabling the printing of number of rows affected by each query"""
%config SqlMagic.feedback=False

"""Connecting to the database"""
%sql postgresql://jupyter_user:notebook@localhost/Project_3
u'Connected: jupyter_user@Project_3'

Creation of the Tables

I am using DELETE CASCADE on the tables with foreign keys to ensure the drop of "tags" along with the related "elements".

%%sql
CREATE TABLE public.nodes
(
  id bigint NOT NULL,
  lat real,
  lon real,
  "user" text,
  uid integer,
  version integer,
  changeset integer,
  "timestamp" text,
  CONSTRAINT nodes_pkey PRIMARY KEY (id)
);

CREATE TABLE public.nodes_tags
(
  id bigint,
  key text,
  value text,
  type text,
  CONSTRAINT nodes_tags_id_fkey FOREIGN KEY (id)
      REFERENCES public.nodes (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE
);

CREATE TABLE public.ways
(
  id bigint NOT NULL,
  "user" text,
  uid integer,
  version text,
  changeset integer,
  "timestamp" text,
  CONSTRAINT ways_pkey PRIMARY KEY (id)
);

CREATE TABLE public.ways_nodes
(
  id bigint NOT NULL,
  node_id bigint NOT NULL,
  "position" integer NOT NULL,
  CONSTRAINT ways_nodes_id_fkey FOREIGN KEY (id)
      REFERENCES public.ways (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT ways_nodes_node_id_fkey FOREIGN KEY (node_id)
      REFERENCES public.nodes (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE
);

CREATE TABLE public.ways_tags
(
  id bigint NOT NULL,
  key text NOT NULL,
  value text NOT NULL,
  type text,
  CONSTRAINT ways_tags_id_fkey FOREIGN KEY (id)
      REFERENCES public.ways (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE
);
[]

Importing the data

After copying the files to the remote server, i can import them to the database.

%%sql
COPY public.nodes
FROM '/home/yannis/Projects/Data_Analysis/Wrangle-OpenStreetMap-Data/Helper/nodes.csv'
CSV HEADER;

COPY public.nodes_tags
FROM '/home/yannis/Projects/Data_Analysis/Wrangle-OpenStreetMap-Data/Helper/nodes_tags.csv'
CSV HEADER;

COPY public.ways
FROM '/home/yannis/Projects/Data_Analysis/Wrangle-OpenStreetMap-Data/Helper/ways.csv'
CSV HEADER;

COPY public.ways_nodes
FROM '/home/yannis/Projects/Data_Analysis/Wrangle-OpenStreetMap-Data/Helper/ways_nodes.csv'
CSV HEADER;

COPY public.ways_tags
FROM '/home/yannis/Projects/Data_Analysis/Wrangle-OpenStreetMap-Data/Helper/ways_tags.csv'
CSV HEADER;
[]

Data assesment in the database

pprint.pprint(PROBLEMATICS)
[('453243296', 'street name', '2'),
 ('46649997', 'street name', u'\u516c\u53f865'),
 ('453227146', 'street name', u'\u516c\u53f865'),
 ('453253763', 'street name', '2'),
 ('169844052', 'street name', '310074'),
 ('1318498347', 'postcode', '135'),
 ('3026819436', 'postcode', '2424'),
 ('3756813987', 'postcode', '05901'),
 ('4338649392', 'postcode', '88752'),
 ('4496749591', 'postcode', '#B1-42'),
 ('23946435', 'postcode', '437 437'),
 ('169844052', 'postcode', '74'),
 ('172769494', 'postcode', '05901')]

I am querying the database for the above elements.

def element_type(element_id):
    """From the element's id, returns the element's type.
    (I need to know the element's type, to run the appropriate query.)
    
    Args:
        element_id (str): The 'id' of the element
        
    Returns:
        (str): The tag of the element.
    """
    path = str('./*[@id="' + element_id + '"]')
    
    return root.find(path).tag
def get_element_tags(element_id):
    """Returns all the tags for a specific element
    
    Args:
        element_id (str):  The 'id' of the element
        
    Returns (sql.run.ResultSet): The result of the query
    
    """
    if element_type(element_id) == 'node':
        tag = %sql SELECT 'Node' AS el_type, * FROM nodes_tags WHERE nodes_tags.id = $element_id
    elif element_type(element_id) == 'way':
        tag = %sql SELECT 'Way' AS el_type, * FROM ways_tags WHERE ways_tags.id = $element_id
    return tag
get_element_tags(PROBLEMATICS[0][0])
el_type id key value type
Way 453243296 street 2 addr
Way 453243296 building yes regular
get_element_tags(PROBLEMATICS[2][0])
el_type id key value type
Way 453227146 housenumber 65 addr
Way 453227146 postcode 119936 addr
Way 453227146 street 公司65 addr
Way 453227146 building yes regular
Way 453227146 name Habourlink regular

For the two elements above, it looks like the "street" value is actually a housenumber.

%%sql
UPDATE ways_tags SET key = 'housenumber'
WHERE (id = '453243296' OR id = '453253763') AND key = 'street'
[]
get_element_tags(PROBLEMATICS[1][0])
el_type id key value type
Way 46649997 housenumber 65 addr
Way 46649997 postcode 119936 addr
Way 46649997 street 公司65 addr
Way 46649997 building yes regular
Way 46649997 name Habourlink regular
get_element_tags(PROBLEMATICS[3][0])
el_type id key value type
Way 453253763 building yes regular
Way 453253763 housenumber 2 addr

"Habour Link" is a place of worship on "Alexandra Terrace"

%%sql
UPDATE ways_tags SET value = 'Alexandra Terrace'
WHERE (id = '453227146' OR id = '46649997') AND key = 'street'
[]
get_element_tags(PROBLEMATICS[4][0])
el_type id key value type
Way 169844052 postcode 74 addr
Way 169844052 street 310074 addr
Way 169844052 building yes regular
Way 169844052 name Toa Payoh Vista Market regular

"Toa Payoh Vista Market" is on 74 Lor 4 Toa Payoh (https://goo.gl/maps/UpoYE2Q4owm)

%%sql
INSERT INTO ways_tags VALUES ('169844052','housenumber','74','addr');

UPDATE ways_tags SET value = '310074' WHERE id ='169844052' AND key = 'postcode';

UPDATE ways_tags SET value = 'Lor 4 Toa Payoh' WHERE id = '169844052' AND key = 'street'
[]
get_element_tags(PROBLEMATICS[5][0])
el_type id key value type
Node 1318498347 postcode 135 addr
Node 1318498347 street Jln Pelatina addr
Node 1318498347 building yes regular

It looks like "135" is a housenumber not a postcode. To find missing parts of an address (like a postcode) I'm querying the Google Maps' API.

def complete_address(address):
    """
    Tries to find the full address from part of the address (e.g. without the postcode)
    
    Args:
        address(str): Partial address
        
    Returns:
        (str): Full address
    
    """
    while True:  #Retry the call in case of GeocoderTimedOut error
        try:
            location = geolocator.geocode(address)
            #coordinates = str(coord[0]) + "," + str(coord[1])
            print location.raw['formatted_address']
        except GeocoderTimedOut:
            continue
        break
complete_address(' 135 Jln Pelatina, Singapore')
135 Jln Pelatina, Singapore

I cannot find the postcode, I am just changing the "postcode" to "housenumber".

%sql UPDATE nodes_tags SET key = 'housenumber' WHERE id = '1318498347' AND value = '135'
[]
get_element_tags(PROBLEMATICS[6][0])
el_type id key value type
Node 3026819436 city Singapore addr
Node 3026819436 housenumber 136 addr
Node 3026819436 postcode 2424 addr
Node 3026819436 street Orchard Road addr
Node 3026819436 landuse retail regular
Node 3026819436 name Liana Medic Ltd regular
Node 3026819436 bitcoin yes payment
Node 3026819436 phone +65 2424666 regular
Node 3026819436 website http://www.lianamedic.com/ regular
complete_address(' 136 Orchard Road, Singapore')
136 Orchard Rd, Singapore

I cannot find a valid postcode. I am deleting the specific tag.

%sql DELETE FROM nodes_tags WHERE id = '3026819436' AND key = 'postcode'
[]
get_element_tags(PROBLEMATICS[7][0])
el_type id key value type
Node 3756813987 city Singapore addr
Node 3756813987 country SG addr
Node 3756813987 housenumber 6 addr
Node 3756813987 postcode 05901 addr
Node 3756813987 street Sago Street addr
Node 3756813987 amenity restaurant regular
Node 3756813987 name Wonderful Food and Beverage regular
Node 3756813987 phone +65 9108 5572 regular
complete_address(' 6 Sago Street, Singapore')
6 Sago St, Singapore 059011
%sql UPDATE nodes_tags SET value = '059011' WHERE id = '3756813987' AND key = 'postcode'
[]
get_element_tags(PROBLEMATICS[8][0])
el_type id key value type
Node 4338649392 postcode 88752 addr
Node 4338649392 name Exit Plan regular
Node 4338649392 tourism attraction regular

The specific amenity is on "279 New Bridge Road, Singapore 088752" (https://goo.gl/maps/K7JzQ3Ujsvq)

%%sql
UPDATE nodes_tags SET value = '088752' WHERE id = '4338649392' AND key = 'postcode';

INSERT INTO nodes_tags VALUES 
('4496749591','housenumber','279','addr'),
('4496749591','street','New Bridge Road','addr')
[]
get_element_tags(PROBLEMATICS[9][0])
el_type id key value type
Node 4496749591 city Singapore addr
Node 4496749591 postcode #B1-42 addr
Node 4496749591 street Bukit Timah Shopping Centre addr
Node 4496749591 name GP Tuition regular
Node 4496749591 housenumber 279 addr
Node 4496749591 street New Bridge Road addr

This amenity does not exist anymore (https://www.bukittimahshoppingcentre.sg/directory/). I'm deleting the whole node.

%sql DELETE FROM nodes WHERE id = '4496749591'
[]
get_element_tags(PROBLEMATICS[10][0])
el_type id key value type
Way 23946435 city Singapore addr
Way 23946435 country SG addr
Way 23946435 housenumber 80 addr
Way 23946435 postcode 437 437 addr
Way 23946435 street Rhu Cross addr
Way 23946435 description 18 hole Par 72 regular
Way 23946435 leisure golf_course regular
Way 23946435 name Marina Bay Golf Course regular
Way 23946435 phone +65 6345 7788 regular
Way 23946435 website http://www.mbgc.com.sg/ regular
complete_address(' 80 Rhu Cross, Singapore')
80 Rhu Cross, Singapore 437437
%sql UPDATE ways_tags SET value = '437437' WHERE id = '23946435' AND key = 'postcode'
[]

"Way" element 169844052 has been corrected already.

get_element_tags(PROBLEMATICS[12][0])
el_type id key value type
Way 172769494 city Singapore addr
Way 172769494 country SG addr
Way 172769494 housenumber 6 addr
Way 172769494 postcode 05901 addr
Way 172769494 street Sago Street addr
Way 172769494 building yes regular
Way 172769494 source Bing regular
complete_address('6 Sago Street, Singapore')
6 Sago St, Singapore 059011
%sql UPDATE ways_tags SET value = '059011' WHERE id = '172769494' AND key = 'postcode'
[]

Data Exploration

Dataset Specific

Below you may find some basic attributes of the dataset.

Size of the database

size_in_bytes = %sql SELECT pg_database_size('Project_3');
print "DB size: " + str(size_in_bytes[0][0]/1024**2) + ' MB'
DB size: 96 MB

Number of Unique Users

%%sql
SELECT count(DISTINCT(uid)) AS "Unique Users"
FROM (SELECT uid FROM nodes 
      UNION 
      SELECT uid FROM ways) AS elements;
Unique Users
848

Top 10 Users (by number of entries)

%%sql
SELECT nodes_ways."user" AS "User", COUNT(*) AS "Users"
FROM (SELECT "user" FROM nodes
      UNION ALL
      SELECT "user" FROM ways) AS nodes_ways
GROUP BY nodes_ways."user"
ORDER BY "Users" DESC
LIMIT 10;
User Users
JaLooNz 155426
rene78 32937
Luis36995 32150
cboothroyd 20478
calfarome 16615
ridixcr 13830
nikhilprabhakar 13082
Paul McCormack 12620
matx17 12000
yurasi 8868

Number of Nodes and Ways

n_nodes = %sql SELECT COUNT(*) FROM nodes
n_ways = %sql SELECT COUNT(*) FROM ways

print "Number of 'nodes: " + str(n_nodes[0][0])
print "Number of 'ways: " + str(n_ways[0][0])
Number of 'nodes: 409352
Number of 'ways: 66404

Area Specific

Most popular streets

%%sql
SELECT street_names.value AS "Street", COUNT(street_names.value) AS "Times Refered"
FROM
	(SELECT nodes_tags.value
	FROM nodes_tags
	WHERE type = 'addr' AND key = 'street'
	UNION ALL
	SELECT ways_tags.value
	FROM ways_tags
	WHERE 	type = 'addr' AND key = 'street'
		OR
		id in
			(SELECT id
			FROM ways_tags
			WHERE key = 'highway')
	AND key = 'name') AS street_names
GROUP BY street_names.value
ORDER BY "Times Refered" DESC
LIMIT 10
Street Times Refered
Jalan Senang 229
Joo Chiat Road 204
Bedok Reservoir Road 185
Tanjong Pagar Road 166
South Bridge Road 165
Serangoon Road 164
North Bridge Road 135
Seah Street 134
Neil Road 133
Dunlop Street 128

Most frequent amenities

Anyone who has lived in Singapore knows the love of Singaporeans for food. No surprises here; restaurants are, by far, on the top of the results.

%%sql
SELECT value AS "Amenity", COUNT(value) AS "Occurrences"
FROM	(SELECT *
	FROM nodes_tags
	UNION ALL
	SELECT *
	FROM nodes_tags) as tags
WHERE key = 'amenity'
GROUP BY value
ORDER BY "Occurrences" DESC
LIMIT 10
Amenity Occurrences
restaurant 1562
parking 654
taxi 524
cafe 396
fast_food 252
atm 194
toilets 190
bar 176
bank 120
police 120

Most popular cuisine

%%sql
SELECT value AS "Cuisine", COUNT(*) AS "Restaurants" 
FROM (SELECT * FROM nodes_tags 
      UNION ALL 
      SELECT * FROM ways_tags) tags
WHERE tags.key='cuisine'
GROUP BY value
ORDER BY "Restaurants"  DESC
LIMIT 10
Cuisine Restaurants
chinese 99
japanese 42
korean 36
coffee_shop 34
burger 33
italian 32
indian 28
asian 27
pizza 17
french 15

ATMs

%%sql
SELECT value AS "Bank", COUNT(value) AS "ATMs"
FROM nodes_tags
WHERE id in
    (SELECT id
    FROM nodes_tags
    WHERE value = 'atm')
    AND
    key = 'operator'
GROUP BY value
ORDER BY "ATMs" DESC
Bank ATMs
POSB 18
UOB 12
OCBC 9
Citibank 8
DBS 7
singapore room home 1
Quantified Assets, Pte. Ltd. 1
Posb 1
HSBC 1
Uob 1
DSS 1
Overseas Chinese Banking Corporation 1
DBS / UOB 1
DBS/POSB 1
home 1
Standard Chartered Bank 1

There are different abbreviations of bank names and some records that are not banks.

%%sql
UPDATE nodes_tags
SET value = 'POSB'
WHERE value = 'Posb';

UPDATE nodes_tags
SET value = 'UOB'
WHERE value = 'Uob';

UPDATE nodes_tags
SET value = 'OCBC'
WHERE value = 'Overseas Chinese Banking Corporation';
[]
%%sql
DELETE FROM nodes
WHERE id IN
	(SELECT id
	FROM nodes_tags
	WHERE key = 'operator'
	AND (value = 'singapore room home'
	OR
	value = 'home'))
[]
%%sql
SELECT value AS "Bank", COUNT(value) AS "ATMs"
FROM nodes_tags
WHERE id in
    (SELECT id
    FROM nodes_tags
    WHERE value = 'atm')
    AND
    key = 'operator'
GROUP BY value
ORDER BY "ATMs" DESC
Bank ATMs
POSB 19
UOB 13
OCBC 10
Citibank 8
DBS 7
DSS 1
HSBC 1
DBS / UOB 1
DBS/POSB 1
Standard Chartered Bank 1
Quantified Assets, Pte. Ltd. 1

Religion

Singapore is well-known for its multicultural environment. People with different religious and ethnic heritages are forming the modern city-state. This is reflected in the variety of temples that can be found in the country.

%%sql
SELECT tags.value AS "Religion", COUNT(*) AS "Temples" 
FROM (SELECT * FROM nodes_tags
      UNION ALL 
      SELECT * FROM ways_tags) tags
WHERE tags.key='religion'
GROUP BY tags.value
ORDER BY "Temples" DESC;
Religion Temples
christian 73
muslim 38
buddhist 30
hindu 9
taoist 6
jewish 1
sikh 1

Ideas for additional improvements.

There are several areas of improvement of the project in the future. The first one is on the completeness of the data. All the above analysis is based on a dataset that reflects a big part of Singapore but not the whole country. The reason for this is the lack of a way to download a dataset for the entire Singapore without including parts of the neighboring countries. The analyst has to either select a part of the island/country or select a wider area that includes parts of Malaysia and Indonesia. Also, because of relations between nodes, ways, and relations, the downloaded data expand much further than the actual selection. Below you can see a plotting of the coordinates of the nodes of a dataset from a tight selection of Singapore. You can notice that huge parts of nearby countries were selected.

initial selection

As a future improvement, I would download a wider selection or the metro extract from MapZen and filter the non-Singaporean nodes and their references. The initial filtering could take place by introducing some latitude/longitude limits in the code to sort out most of the "non-SG" nodes.

filter to square

Then, I would download a shapefile for Singapore (e.g. http://www.diva-gis.org/gdata), use a GIS library like Fiona to create a polygon and finally with a geometric library like Shapely and compare all the nodes' coordinate against this polygon. Finally, I would clean all the ways and relations from the "non-sg" nodes and remove these that become childless to conclude with a dataset of all (and only) Singapore.

After GIS

The drawback of the above technic is that the comparison of each node against the polygon is a very time-consuming procedure with my initial tests taking 17-18 hours to produce a result. This is the reason the above approach left as a future improvement probably along with the use of multithreading technics to speed up the process.

The second area with room for future improvement is the exploratory analysis of the dataset. Just to mention some of the explorings that could take place:

  • Distribution of commits per contributor.
  • Plotting of element creation per type, per day.
  • Distribution of distance between different types of amenities
  • Popular franchises in the country (fast food, conventional stores, etc.)
  • Selection of a bank based on the average distance you have to walk for an ATM.
  • Which area has the biggest parks and recreation spaces.

The scope of the current project was the wrangling of the dataset, so all the above have been left for future improvement.

Finally, open data are here to make average people's life better. For the non-data analyst, it would be nice if there was an application (mobile or web) that could evaluate the suitability of a potential rental house. The work addresses of all family members, importance weights on several amenities like supermarkets, convenience stores, cafes, public transportation, etc. and the application would calculate the suitability of each potential rental. The user would be able to sort them by score and compare them.


References

Udacity - https://www.udacity.com/
Wikipedia - https://www.wikipedia.org/
OpenStreetMap - https://www.openstreetmap.org
Overpass API - http://overpass-api.de/
Python Software Foundation - https://www.python.org/
Urban Redevelopment Authority of Singapore - https://www.ura.gov.sg
Catherine Devlin's Github repository - https://github.com/catherinedevlin/ipython-sql
Google Map APIs - https://developers.google.com/maps/