When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:

allstu

but, as you zoom in, you can see that there are a tremendous number of points:

stuzoom

The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete.

I created a spatial index, so, what was the problem?  There are two main things going on:

1. The polygons had an enormous number of vertices in them – the selected polygon in tselectedpolyhe next image had 5,800 vertices, and the entire layer had about 2 million vertices.  So, the point-in-polygon test required thousands of calculations per polygon.  That takes time.

2. Individual polygons covered a large geographic area – The imagbboxe on the right shows that the spatial index for these polygons covered such a large area that there were millions of points potentially in the bounding box of each individual polygon.  So, we had to test that polygon against all of those points.

The Solution

What was the solution?  Well, knowing what I know about computational geometry, and how indexes work, I decided to decompose the polygons into convex parts – this took my 400 polygons and exploded them to over 700,000 polygons – see image on the right (Manifold GIS has a great tool that allows you to decompose geometries into Delaunay triangles or convex parts).  stuconvex

Dumb idea, right?  Well, stay with me on this one…

 By decomposing the polygons to convex parts, we achieved two things:

1. Polygons only had about 4 or 5 vertices – therefore, any polygon evaluated for a point-in-polygon overlay only required a couple of calculations.

2. The convex parts covered very small areas – with a smaller footprint, many, if not almost all the 37M points were no longer even considered, as they fell outside of the bounding box of an individual area.

So, going from 400 polygons to 700,000 polygons actually reduced the computation time, for reasons I explained above. I teach this concept in my GIS Programming class every fall, but now I have a really dramatic example to demonstrate it on. I’m probably going to create a video lecture on it, as I think it represents a “best practice” for dealing with large data.

The Result

Are you sitting down?  I ran the analysis in the Radian engine, and it completed in 40 minutes.  That’s days to under an hour, just by exploiting some of what we know about how point-in-polygon and spatial indexes work.  I had similar results using spatial Hadoop.  For the Radian engine, I issued the following SQL command:

-- $manifold$
SELECT [stubound Table].[ID_2],[A2012 Table].[pointid]
INTO stucontain
FROM [stubound Table],[A2012 Table]
WHERE GeomContains([stubound Table].[Geom],[A2012 Table].[Geom (I)],0.001)
THREADS 4;

so, the query ran using the 4 CPUs on my computer.

Oh, and did I mention I have two new books out about spatial SQL in Manifold and PostGIS?

2 thoughts on “When More is Less…. lessons from processing large data files

    • No. I’ve put in a request to ESRI to do it, but I don’t know where things stand. You should show their sales person this post, maybe word will get around.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s