A Poor Man’s Parallel Processor for GIS

In addition to SQL, I also am interested in processing large volumes of spatial data.  One of the newest rages in “big data” is Hadoop.  According to Wikipedia:

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

One way this is implemented is a programming model called MapReduce.  Don’t get too excited, it doesn’t have anything to do with maps or GIS – but, it is very clever and powerful for certain types of problems.  The concept is if you have a really large dataset, you divide and conquer that dataset in a number of steps.  For example, say we wanted to know all the people with the name “John” in the phonebook, and say we had 26 computers in a cluster – we might solve this by:

1.  Use each computer (1-26) to find all the “Johns” for the first letter in the last name (A-Z).  That way, you have effectively broken the problem into 26 smaller units.

2.  Once each computer has counted up the number of Johns, you have reduced the dataset (hence, MapReduce) to 26 variables.

3.  Now, count up the total of the 26 variables.

That is an oversimplified version of course, but it helps to illustrate what we want to do.  I understand that the University of Minnesota has created a set of functions called SpatialHadoop.  I want to test this over the summer, but for now I decided to create my own poor man’s version using PostGRES.

The Problem

I have 37 million points, and 240 polygons.  For each point, I want to find out what polygon the point is in.  Easy enough.  However, the data size is prohibiting.  After running the process for days in ArcGIS, Manifold, QGIS, and PostGRES, the process either did not finish, or just crashed the computer.  Our traditional ways do doing this just wasn’t working.

I figured that if I could somehow replicate the MapReduce problem, we might be in business.

The Solution

I decided to try and solve this problem with PostGRES and PostGIS.  My initial test was to work with 1 million records.  If it worked, I would try scaling it up to 37 million.

Now, PostGRES is not a multithreaded application.  Meaning, if you have 8 CPUs on your computer, PostGRES is only going to run in one CPU.  But, if you run 5 instances of PostGRES, each will run in its own thread on a CPU.

Therefore, I created an SQL script that performed the spatial containment query on a portion of the data and wrote the results to a table.  For instance, this query will INSERT the point ID and polygon ID into a table for each point contained in the polygon:

INSERT INTO testtable
SELECT points.pointid, polygons."ID_2" 
FROM
points , polygons 
WHERE st_contains(polygons.geometry,points.geometry)
AND points.pointid BETWEEN 15000000 and 15100000

The above query effectively performs the containment query on only 100,000 records.  I then ran the query in 5 different threads, each evaluating a 100,000 chunk of points.  So, the BETWEEN clause was changed to 15200001 AND 15300000, and so on.  Therefore, the 5 separate threads processed all 500,000 records, 100,000 (1/5th of the data) records at a time.

The Results

I noticed a couple of interesting things: the change in speed and the utilization of the CPU.

Speed

It turns out that running multiple instances of the SQL statements completes the job faster than only running one instance.  For example, running the above query on 500,000 records

INSERT INTO testtable
SELECT points.pointid, polygons."ID_2" 
FROM
points , polygons 
WHERE st_contains(polygons.geometry,points.geometry)
AND points.pointid BETWEEN 15000000 and 15500000

completes in 365 seconds.

But, running 5 instances of the query takes 130 seconds!!!

times

CPU Usage

As expected,  the CPU usage when one instance of PostGRES was running showed that only a single CPU was firing at its maximum.  However, when I ran the five instances, you can see that many more of the CPUs were firing.

threads

Early Conclusion

This showed me that we can actually maximize the thread use with PostGRES for completing a computationally intensive spatial task by breaking the problem into different chunks.

The main bottleneck was inserting into the table.  As you can see from the figure above, the execution time increased in each window, with a maximum of 358 seconds – I don’t believe that the data was running serially, but rather the INSERT portion of the query had to wait it’s turn to update the table.

Where to go next

I’m not ready to tackle the 37 million points just yet.  I want to see if there are some ways to speed this up even more.  In my next post, I will use an UPDATE statement instead of the INSERT statement to see if there are any differences in speed. After that, I want to try some other tasks, like determining how many points are in the polygons.  This will allow us to run the query in five threads, obtaining 5 result tables, and then running a final query to add up the results of  the five tables.  The question on my mind is what scenarios give us the best opportunities to parallelize our work activities.

If this looks promising, I have 24 computers in my lab, and if I use 5 threads for each computer, that gives me 120 threads to run the query.

3 thoughts on “A Poor Man’s Parallel Processor for GIS

  1. with the points only i have found out that pre-filtering the data just by testing x&y against an area bbox edges instead of st_contains yields good results. it is quicker than st contains. then of course preselected ‘positives’ must be checked again. Depending on the amount of the points that do actually fall into a poly this may seriously speed things up.
    spatial index on both datasets and the same srid would also help. that’s kinda obvious though

    • Yes, I played around with that as well. I even think that it might be worthwhile to have the lowerX, lowerY, higherX, higherY of the bounded box as indexed fields, and do that as part of the WHERE clause.

      Nonetheless, I find it pretty cool that we can make effective use of those CPUs just sitting there doing nothing.

      Thanks for the comments. See you back over on georeference :-)

      • speaking of indexes – indexed x&y for points also help ;)

        anyway, being able to launch a couple of processes at once indeed helps quite a lot. And splitting data into chunks is also valid form some manifold ops. Working out the best size of data can also be crucial. very often smaller dataset is better but at some point it stops paying off.

        since you mention inserts tend to be slow. maybe you could try unlogged tables? This is a short excerpt from pgsql docu:

        UNLOGGED
        If specified, the table is created as an unlogged table. Data written to unlogged tables is not written to the write-ahead log (see Chapter 29), which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.

        georeference – never been too far away ;) It’s just that i tend to comment when i have something to say otherwise i’d rather stay silent ;)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s