Skip to content

ErikHoel/SpatialJoin

 
 

Repository files navigation

SpatialJoin

Perform Spatial Join on BigData using Hadoop MapReduce

Sample Density Run

In this example, we are joining a set of points from HDFS with a small set of polygons that is loaded from the distributed cache.

Create polygon sample data into local file system:

$ awk -f data.awk | awk -f polygons.awk > /tmp/polygons.txt

Create points sample data into HDFS:

$ awk -f data.awk | hadoop fs -put - points.txt

Build, package and run the tool:

$ mvn -PDensityTool clean package
$ hadoop jar target/SpatialJoin-1.0-SNAPSHOT-job.jar points.txt file:///tmp/polygons.txt output

Save the summary output into local file system:

$ hadoop fs -cat output/part-* > /tmp/relate.txt

Convert polygon set into a shapefile using ogr2ogr so it can be viewed in ArcMap and related with the above file for visualization:

$ awk -f featurecollection.awk /tmp/polygons.txt > /tmp/fc.json
$ ogr2ogr -f "ESRI Shapefile" /tmp/polygons /tmp/fc.json

Spatial Join ArcMap

Sample Airport/Route Run

In this example of a map side spatial join, I have a list of airport code pairs forming flight routes between two cities. In addition, I have a list of the airport code, city and location in lat/lon format. The task at hand it count the number of times an airport is overflown by the flight routes.

Place the data into HDFS:

$ hadoop fs -put data/airports.csv airports.csv
$ hadoop fs -put data/routes.csv routes.csv

Build, package and run the tool:

$ mvn -PJoinMapTool clean package
$ hadoop jar target/SpatialJoin-1.0-SNAPSHOT-job.jar routes.csv output

See the output:

$ hadoop fs -cat output/part-00000 | sort -n -r --key=2 | head -10

Sample Polygon/Polygon Spatial Join

In this example of reduce side spatial join, we will join two polygon sets and return the intersection set.

Create the two polygon sets:

$ awk -f data.awk | hadoop fs -put - data1.txt
$ awk -f data.awk | hadoop fs -put - data2.txt

Build, package and run the tool - here I am specifying a cell size of 5 units:

$ mvn -PJoinRedTool clean package
$ hadoop jar target/SpatialJoin-1.0-SNAPSHOT-job.jar -D com.esri.size=5 data1.txt data2.txt output

To view the result (as long it is small :-), I am using the Esri-Leaflet project.

Create a folder under your web application and extract the data into that folder as GeoJSON formatted files.

$ hadoop fs -cat data1.txt | awk -f geojson.awk -v N=data1 > data1.js
$ hadoop fs -cat data2.txt | awk -f geojson.awk -v N=data2 > data2.js
$ hadoop fs -cat output/part-* | awk -f union.awk > union.js

Copy the file esri-leaflet.js and index.html in the webapp folder to that folder and launch your browser and view the result at http://localhost/myfolder/index.html

Esri Leaflet

Unit Testing

I highly recommend that you look at the test folder - There are two types of unit test cases in there. one that tests the mapper by itself, the reducer by itself and the combination of mapper and reducer. The other type is one that launches a mini cluster that launches a name node, a data node, a job tracker and a task tracker all in one VM. This enable the test case to write some well known data at startup time into HDFS and read the result of the MapReduce job back from HDFS to assert the values.

About

Perform Spatial Join on BigData using Hadoop MapReduce

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published