When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:

but, as you zoom in, you can see that there are a tremendous number of points:

The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete.

I created a spatial index, so, what was the problem?  There are two main things going on:

1. The polygons had an enormous number of vertices in them – the selected polygon in the next image had 5,800 vertices, and the entire layer had about 2 million vertices.  So, the point-in-polygon test required thousands of calculations per polygon.  That takes time.

2. Individual polygons covered a large geographic area – The image on the right shows that the spatial index for these polygons covered such a large area that there were millions of points potentially in the bounding box of each individual polygon.  So, we had to test that polygon against all of those points.

The Solution

What was the solution?  Well, knowing what I know about computational geometry, and how indexes work, I decided to decompose the polygons into convex parts – this took my 400 polygons and exploded them to over 700,000 polygons – see image on the right (Manifold GIS has a great tool that allows you to decompose geometries into Delaunay triangles or convex parts).

Dumb idea, right?  Well, stay with me on this one…

By decomposing the polygons to convex parts, we achieved two things:

1. Polygons only had about 4 or 5 vertices – therefore, any polygon evaluated for a point-in-polygon overlay only required a couple of calculations.

2. The convex parts covered very small areas – with a smaller footprint, many, if not almost all the 37M points were no longer even considered, as they fell outside of the bounding box of an individual area.

So, going from 400 polygons to 700,000 polygons actually reduced the computation time, for reasons I explained above. I teach this concept in my GIS Programming class every fall, but now I have a really dramatic example to demonstrate it on. I’m probably going to create a video lecture on it, as I think it represents a “best practice” for dealing with large data.

The Result

Are you sitting down?  I ran the analysis in the Radian engine, and it completed in 40 minutes.  That’s days to under an hour, just by exploiting some of what we know about how point-in-polygon and spatial indexes work.  I had similar results using spatial Hadoop.  For the Radian engine, I issued the following SQL command:

```-- \$manifold\$
SELECT [stubound Table].[ID_2],[A2012 Table].[pointid]
INTO stucontain
FROM [stubound Table],[A2012 Table]
WHERE GeomContains([stubound Table].[Geom],[A2012 Table].[Geom (I)],0.001)