Standardization of Street Names

Find groups of different street names that might be alternative representations of the same street. This is an example for the key collision clustering supported by openclean. Uses the NYC Parking Violations Issued - Fiscal Year 2014 dataset.

[1]:
# Download the full 'DOB Job Application Fiings' dataset.
# Note that this is a file of ~ GB!

import gzip
import os

from openclean.data.source.socrata import Socrata

datafile = './jt7v-77mi.tsv.gz'

# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        ds = Socrata().dataset('jt7v-77mi')
        print('Downloading ...\n')
        print(ds.name + '\n')
        print(ds.description)
        ds.write(f)


# As an alternative, you can also use the smaller dataset sample that is
# included in the repository.
#
# datafile = './data/jt7v-77mi.tsv.gz'
[2]:
# Use streaming function to avoid having to load the full dataset
# into memory.

from openclean.pipeline import stream

df = stream(datafile)
[3]:
# Get distinct set of street names. By computing the distinct set of
# street names first we avoid computing keys for each distinct street
# name multiple times.

streets = df.select('Street').distinct()

print('{} distinct streets (for {} total values)'.format(len(streets), sum(streets.values())))
115567 distinct streets (for 9100278 total values)
[4]:
# Cluster street names using key collision (with the default key generator).
# Remove clusters that contain less than seven distinct values (for display
# purposes). Use multiple threads (4) to generate value keys in parallel.

from openclean.cluster.key import key_collision

# Minimum cluster size. Use seven as defaultfor the full dataset (to limit
# the number of clusters that are printed in the next cell).
minsize = 7

# Use minimum cluster size of 2 when using the dataset sample
# minsize = 2

clusters = key_collision(values=streets, minsize=minsize, threads=4)

print('{} clusters of size {} or greater'.format(len(clusters), minsize))
13 clusters of size 7 or greater
[5]:
# For each cluster print cluster values, their frequency counts,
# and the suggested common value for the cluster.

def print_cluster(cnumber, cluster):
    print('Cluster {} (of size {})\n'.format(cnumber, len(cluster)))
    for val, count in cluster.items():
        print('{} ({})'.format(val, count))
    print('\nSuggested value: {}\n\n'.format(cluster.suggestion()))

# Sort clusters by decreasing number of distinct values.
clusters.sort(key=lambda c: len(c), reverse=True)

for i in range(len(clusters)):
    print_cluster(i + 1, clusters[i])

Cluster 1 (of size 8)

2ND AVE (4075)
2nd Ave (67751)
2ND  AVE (5)
2ND AVE. (1)
AVE 2ND (1)
2ND      AVE (1)
2ND    AVE (2)
2ND       AVE (1)

Suggested value: 2nd Ave


Cluster 2 (of size 8)

ST NICHOLAS AVE (2451)
ST. NICHOLAS AVE (125)
St Nicholas Ave (23462)
ST, NICHOLAS AVE (1)
ST NICHOLAS  AVE (9)
ST NICHOLAS   AVE (1)
ST  NICHOLAS AVE (4)
ST. NICHOLAS  AVE (1)

Suggested value: St Nicholas Ave


Cluster 3 (of size 8)

LAWRENCE ST (165)
ST LAWRENCE (34)
LAWRENCE  ST (1)
Lawrence St (2368)
ST. LAWRENCE (2)
ST LAWRENCE ST (1)
LAWRENCE ST. (1)
ST. LAWRENCE ST (1)

Suggested value: Lawrence St


Cluster 4 (of size 8)

ST NICHOLAS (847)
ST NICHOLAS ST (31)
NICHOLAS ST (27)
ST. NICHOLAS (27)
ST  NICHOLAS (2)
ST NICHOLAS  ST (1)
Nicholas St (79)
ST. NICHOLAS ST (1)

Suggested value: ST NICHOLAS


Cluster 5 (of size 7)

W 125 ST (3365)
W 125    ST (1)
W. 125 ST. (1)
W .125 ST (5)
W  125 ST (2)
W 125  ST (1)
W. 125 ST (3)

Suggested value: W 125 ST


Cluster 6 (of size 7)

FERRY LOT 2 (743)
FERRY LOT #2 (140)
FERRY  LOT #2 (1)
FERRY LOT  2 (3)
FERRY LOT # 2 (121)
FERRY LOT  # 2 (2)
FERRY LOT  #2 (1)

Suggested value: FERRY LOT 2


Cluster 7 (of size 7)

3RD AVE (11554)
3rd Ave (148186)
3RD  AVE (8)
3RD AVE. (1)
3RD       AVE (1)
3RD     AVE (2)
3RD      AVE (1)

Suggested value: 3rd Ave


Cluster 8 (of size 7)

CONEY ISLAND AVE (3618)
CONEY ISLAND  AVE (9)
CONEY  ISLAND AVE (9)
Coney Island Ave (35776)
CONEY ISLAND   AVE (1)
CONEY ISLAND AVE . (1)
CONEY ISLAND AVE. (1)

Suggested value: Coney Island Ave


Cluster 9 (of size 7)

W TREMONT AVE (110)
W. TREMONT AVE (17)
W W TREMONT AVE (1)
W Tremont Ave (848)
W  TREMONT AVE (1)
W. TREMONT  AVE (1)
W .TREMONT AVE (1)

Suggested value: W Tremont Ave


Cluster 10 (of size 7)

LGA TERMINAL B (26)
LGA, TERMINAL B (1)
LGA/ TERMINAL B (1)
TERMINAL B LGA (20)
TERMINAL B - LGA (2)
TERMINAL B -LGA (1)
LGA TERMINAL B, (1)

Suggested value: LGA TERMINAL B


Cluster 11 (of size 7)

EL GRANT HWY (67)
E.L GRANT HWY (10)
E.L. GRANT HWY (19)
EL GRANT    HWY (1)
EL. GRANT HWY (2)
E/L/ GRANT HWY (1)
E-L GRANT HWY (1)

Suggested value: EL GRANT HWY


Cluster 12 (of size 7)

JOHN ST (186)
ST JOHN (10)
John St (4192)
ST JOHN ST (8)
ST. JOHN ST (1)
ST. JOHN (1)
JOHN ST. (1)

Suggested value: John St


Cluster 13 (of size 7)

ST JOHNS PL (1478)
ST. JOHNS PL (77)
St Johns Pl (4816)
ST JOHNS PL. (1)
ST. JOHNS PL. (1)
ST  JOHNS PL (1)
ST JOHNS  PL (2)

Suggested value: St Johns Pl