Did you know that you can perform GIS analysis using parallel processing on a desktop PC? For years now, computers have had multicore processors, but unless you were using gaming software, you didn't have the ability to tap into that advantage.
Within recent years, more non-gaming software has begun to utilize multicore computers for parallel processing, and this includes GIS. You can now leverage parallel processing using:
Although Esri has highlighted GeoAnalytics Server to harness the power of distributed processing, much of the same functionality exists on your desktop through the GeoAnalytics Desktop.
Starting around Postgres 14, Postgres utilizes parallel workers for some key analytical tasks that the developers of PostGIS have been able to leverage.
And for a good 5 years, Manifold 9 has parallelized their SQL and native GIS functions.
I recently purchased an Alienware gaming laptop with 20 cores. and decided to compare the offerings of each software product on a standard benchmark dataset.
The Data
To perform the analysis, I used a dataset of parking tickets and neighborhood zones in Philadelphia that Paul Ramsey explored in his excellent blog post on parallel processing and Postgres. Who knew there were so many parking tickets in Philadephia! In fact, there were over 9 million tickets distributed over around 150 neighborhoods!
The Software Solutions
For each software product I wanted to determine the number of parking tickets in each neighborhood, and used GeoAnalytics Desktop, Postgres/PostGIS, and Manifold GIS. Each solution had their benefits and limitations. But, whether one product was faster or slower than another wasn't so much the important point, but rather just how easy it was to perform parallel processing using each of them.
To run the tests, I used my Alienware gaming laptop 2.3Ghz i-7 with 32GB of RAM and 14 cores (20 logical processors).
ArcGIS Pro GeoAnalytics Desktop
ArcGIS Pro GeoAnalytics Desktop is available to anyone with an Advanced ArcGIS Pro license. It's there on your PC. It's just sitting there. What are you waiting for? Use it!
For this task, I used the Summarize Within (GeoAnalytics Desktop) tool. Notice that there are two Summarize Within tools - a standard one that does not utilize parallel processing, and a corresponding one that uses the Spark engine.
They pretty much look the same which is good - if you know how to use one, you can use the other one. Running the process with the standard Summarize Within tool completed the job in 10m 44s. We'll use this as our baseline test.
The Task Manager showed that there weren't many processing cores utilized which makes sense - ArcGIS Pro does not natively use multiple processing cores!
======== Questions about QGIS ==========
Based on a Twitter follower's question, I ran the same query is QGIS using the Count Points in Polygon wizard and QGIS completed the task in 5m and 58s (~6m).
A few days later, @nyalldawson recommended using the FlatGeoBuf file format as a way to speed the process up. Sure enough, having QGIS utilize the FlatGeoBuf format reduced the processing time to 40s. Note that the FlatGeoBuf format will dramatically increase the file size (it was now 2GB), but given the cost of disk space, I think it's worth exporting the data as a .fgb from QGIS and then run the Count Points in Polygon command.
======== Back to Esri ================
Using the GeoAnalytics Desktop tool completed the job in an astounding 37s - over 17x faster than the non-parallelized base case. And, when I opened up my Task Manager I could see all the cores that were being utilized.
And the best thing was, all that improvement was achieved with me not doing anything other than using the GeoAnalytics tool. Easy peasy, parallel processing without any heavy lifting. But, can we do better?
PostGres and PostGIS
As an enterprise class database, Postgres had developed to perform true parallel functionality. And, most of you know that I'm an SQL junkie, and find that to be the perfect language for geographers to learn. The SQL to determine the number of parking ticket locations in each zone is relatively easy to write
SELECT count(*) AS numtickets
INTO temptable
FROM zones, parking
WHERE ST_Contains(zones.geom,parking.geom)
GROUP BY name
Now, having learned a bunch from Paul's blog post, I began to issue similar commands with Postgres - however, I made a few adjustments. Paul set his minimum table scan size to 1kB - very reasonable as the neighborhood zones were only about 256kB (if the table sizes are smaller than the minimum table scan size, Postgres will not perform parallel processing as it thinks it would not be beneficial to bust the data up into multiple cores). I got a similar query plan as Paul when using the EXPLAIN function in Postgres - it used 3 cores and completed in 26 seconds. Not bad. (the EXPLAIN function asks Postgres to analyze a number of different query plans to determine the best one, so it reports back to you what it plans to do). Nonetheless, I was disappointed that my query plan was only going to use 3 workers. I certainly think that more worker processors are in order as even though we only have 150 polygons (small), we have over 9 million points to overlay (big).
Query plan with 3 cores
Increasing the number of cores
I increased the minimum table scan size to 1MB and oddly enough, Postgres then planned to utilize 7 cores.
Query plan with 7 cores
I say oddly enough because Postgres should not have gone parallel because the minimum table scan size was greater than the size for the neighborhood zones. I'm not entirely sure why that happened, and I even consulting Paul on it - I guess there is a bunch of stuff Postgres is doing behind the scenes so something is peculiar. But it worked, I was now utilizing 7 cores. And best of all, the new parameters reduced the processing time to 14s!!! That's twice as fast as the 3 worker processes, and 2.5x faster than GeoAnalytics Desktop. However, it did require me knowing a bit about worker processes, table scan sizes, workers per gather, etc., whereas ArcGIS just required me to use the tool.
Manifold GIS
Over the years I've kind of gotten a reputation as the Manifold guy. But, I'm simply just a guy who has enjoyed using Manifold GIS for almost 20 years now. There are a lot of knowledgeable people I've learned from who use Manifold GIS, and you can find many of them on geoference.org. Manifold is a low-cost (about $150) parallel processing GIS, but they also have a free version of their software product called Manifold Viewer. If you don't know anything about this software, do yourself a favor on the next rainy weekend check out the product on their website. They are doing some really innovative things. In addition to being a GIS, Manifold allows you to issue spatial SQL commands or use their GUI to perform parallel related tasks like you see below
SELECT s_name, count(*)
FROM CALL GeomOverlayContainedPar([zones.Shape], [parking.Shape], 0, ThreadConfig(SystemCpuCount()))
GROUP BY s_name
Manifold did a great job of utilizing the cores on my computer as you can see from the Task Manager
However, the results were slower than ArcGIS Pro and Postgres. One of the real joys of exploring questions like this and seeing how other software products solve problems is that every now and then you find a bottleneck in the system. In this case, there appears to be a bottleneck in the Manifold GROUP BY command - the selection actually runs really fast. I understand that the developers of Manifold GIS are aware of this, and working to improve the results. When that is done, I will update this post because I think we'll see dramatic improvement.
The simple SQL code and the GUI Join Tool in Manifold completed the task in 1m 26s.
Which one is best?
Honestly, I don't really care. They all do it, and that's the important part. With ArcGIS Pro and Manifold, it's easy to use with their GUI and requires virtually no knowledge of parallel processing to achieve dramatic improvements (that's not completely true, more about that later). And, whether it is a minute or half a minute, it's orders of magnitude over a baseline case of 10m 44s. Also, both ArcGIS and Manifold are GIS products in their own right, so you not only get speed improvements, you get all the other nice GIS things like cartography, digitizing, import/export, consumption of cloud services, etc.
With Postgres and PostGIS, we see the best performance - 46x our baseline case. But of course, Postgres isn't a full GIS product, and the results did require a bit of understanding about how Postgres works. However, for 46x improvement over the baseline case, almost 3x the improvement of GeoAnalytics Desktop, and 5x improvement over Manifold I find learning a little bit about tuning the query is worth the effort.
Finally, there is some advantage of using a language like SQL so that you aren't limited to a strict count of points in a polygon but can embed other mathematical or spatial operations within the SQL.
Conclusion
What was once reserved for a bunch of teenagers playing first-person shooter games is now now available to GIS professionals. Parallel processing is here, on the desktop, and reachable from many GIS software products. Right out of the box you can achieve really fast results over traditional GIS processes, and with a little knowledge of how your computer and the software works, you can achieve dramatic results. There is no reason you should not be using the parallel capabilities of your computer to perform GIS analysis.
Want to learn more - warning, shameless plug coming up?
Obtaining faster results using parallel processing on your desktop PC is pretty easy - but, there is way more to the story. There are a lot of best practices you can learn to better prepare your data and software to take advantage of your PC's multicore system. There are a lot of considerations like understanding your RAM, HDD, indexes, coordinate systems, and how to call off the dogs (so to speak) to not use too many cores that could be sub-optimal (that is, finding the sweet spot). If you want to learn more about how to utilize parallel processing with ArcGIS Pro, Postgres, and Manifold, you can check out my Udemy class on Parallel Processing with Desktop GIS. It's only $12.99 with the coupon from the link, and provides over 3 hours of instruction and hands-on demonstration, along with the Philadelphia data in a geodatabase, a Postgres .backup file, and a Manifold GIS .map file. Please note, this is a work in progress, and I haven't posted the Manifold videos yet - they should be up next week.
Excellent article Art - educational and entertaining! Your course on Udemy on Parallel Processing on desktops for GIS is a steal. Well done.