GIS Analysis of Overlapping Layers

overlayoverlapMy friend is attempting to quantify the area of different landuse values for different areas that are upstream from her sample points.  This means she needs sample points, landuse, and upstream areas (i.e. sub-watersheds).  The problem is, her watersheds overlap, the buffer distances around the sample points overlap themselves AND the watersheds, and she then needs to summarize the results.  It’s actually a tricky problem due to the overlaps: GIS software doesn’t really like when features within a single layer overlap one another.  Also, if a buffer for a sample point overlaps two different watersheds, that becomes tricky too.

Sure you can solve it with a few for loops,  inserting the results into a new table, but that really is a hassle.  Also, I have to do it for different distances and different land cover types.

So, I once again turned to SQL – remember what I keep telling you – spatial is not special.  It’s just another data type.  This video steps you through performing a multi-ring buffer on overlapping objects from 3 different layers: sample points, watersheds, and land use.  As we step through the SQL, you’ll see how easy it is to put the query together.  And, at the end, you’ll see how flexible the query is should you want to change your objectives.  And, for good measure, we’ll throw in a little bit of parallel processing.

QGIS GPU Processing Update

Today my students got to present some of their work at the Maryland Geographic Information Committee Quarterly meeting.  On Monday we head to the University of Maryland at College Park to present our work with the Computer Science Department.

If you recall, I outlined our objective for the summer here, and provided some updated information here.  Also, recall that we’ve posted the code on Github here.  We’ve made good progress this week, including modifying the code to utilize the GDAL libraries for reading the raster DEM, and also have things working as a valid QGIS plug-in as well as a command line function.  Stop by the github site and have a look at the code, or try running it on your own data.

If  you recall, my concern was that GIS processes are big on data volume, but light on calculations per data element. With that concern in the back of my mind, we decided to generate a hillshade as it appears to be more computationally expensive than slope.

Well, some early results shed some light on what is going on.   As you can see from Table 1, the CUDA version of our slope computation is around 1.5x faster than the serial version.  That is good, but nothing to write home about.  You’ll notice that the Total Time is less than the sum of the parts because we are not only using the GPU to perform parallel processing for the slope computation, but are also using multi-threading techniques to simultaneously read in data while we are ALSO performing the computations and writing out the data – so, this QGIS plug-in is doing massive GPU parallel processing and multi-core processing! Continue reading

More QGIS/GPU progress – it’s getting faster…

This summer I am working with three talented undergraduates as part of a National Science Foundation (NSF) program called Research Experience for Undergraduates (REU): Alex Fuerst (Xavier University), Charlie Kazer (Swarthmore College), and Billy Hoffman (Salisbury University).  This post continues to present their work in performing parallel processing with QGIS.

Today I want to continue discussing our project to create a QGIS plug-in for terrain analysis.  In my previous post, I discussed our goals and some early results.  Our first cut showed that pyCUDA was running about as fast as a serial C++ version of our code.  And, at that time, QGIS was still about as fast as our own serial C++ code and our pyCUDA code when generating slope.

My students decided to rewrite the code to create a scheduler to manage the I/O, CPU and GPU activities.  The schedule is divided into three modules:

dataLoader – The dataLoader takes in an ASCII  GeoTIFFinput file and sends data one line at a time to a buffer it shares with gpuCalc.

gpuCalc – The gpuCalc grabs data from the shared buffer with dataLoader and moves it into CUDA pagelocked memory. When the memory is full, the program transfers the data to the GPU and executes the CUDA kernel for the calculation.  When the calculation completes, the program writes the results to a second buffer it shares with dataSaver.

dataSaver – dataSaver gets the data from shared buffer in gpuCalc and writes the results to an ASCII file GeoTIFF.

An example of the architecture we are using is shown below:

gpufig

The results indicated that the GPU was running faster for slope, but still not enough to make me too happy. It was about 25% faster, and I wasn’t going to be satisfied until we were an order of magnitude faster.

We tried running our algorithms with ESRI ASCII files because it was easier to read than a GeoTIFF. Unfortunately, the input time to read the file was horrendous (like 15 minutes!). So, we spent a little time writing the algorithm to work with GeoTIFFs (much thanks to some generous souls on gis.stackechange.com who helped Alex figure out the GeoTiff read/write), and found them to run substantially faster.  Also, we decided to run the Hillshade algorithm which includes many more computations than a simple slope or aspect.  In this case, the results are shown below.

We had a breakthrough with PyCUDA, so I want to wait a another day or so to rewrite the serial version in C++, but for now, we’ll use QGIS and the terrain plugin to calculate the hillshade on a 8.3GB raster file (1.5GB as a GeoTiff):

                                                      PyCUDA                                          QGIS
Input                                                8:30                                                 2:30
Output                                             3:48                                                 5:00
Computation                                  5:09                                                 9:00
TOTAL TIME*                               7:03                                               15:30

We’ve placed a version of the code up on gitHub here.  I hope you get a chance to try it out, and maybe even collaboratively help us make this a legitimate plug-in for QGIS.

* the PyCUDA total time is not the sum of its parts because as we are using multi-threading in our code so that while we are reading data in, we are also writing data out in another thread, and also performing the computations using another thread.

 

 

Parallel Processing with QGIS

Once again, I am continuing my role as a mentor in a National Science Foundation (NSF) Research Experience for Undergraduate program.  This year we’ve decided to build a QGIS plug-in for terrain analysis, as it is embarrassingly parallel (slope, aspect, etc.).   We are doing four things as we generate slope for different size digital elevations models:

  1. A pure python implementation (for an easy plug-in)
  2. A serial-based C++ implementation (as a baseline for a non-parallel solution)
  3. A pyCUDA implementation (using the GPU for parallel processing)
  4. A C++ based parallel solution using the GPU

We plan to put our results on a shared GitHub site (we are in the process of cleaning up the code) so that people can start experimenting with it, and use our example to begin generating more parallel solutions for QGIS (or GDAL for that matter).

Here are some early results: Continue reading

From days to seconds: experiences with parallel processing and GIS (Part I, the team)

Many of you know that I have been working with parallel processing for GIS in the form of video card general purpose, graphical processing units (GPGPU).  However, this year I decided to change things up a bit, and focus more on CPU based parallel processing.  To that end we began working with Hadoop along with spatial Hadoop.

I plan to have around 7 or 8 blog posts on this over the next 4 or 5 weeks.  My initial outline is:

Part I, the team (this post)
Part II, the point-in-polygon problem
Part III, solving the problem in hours, not days
Part IV, solving the problem in minutes, not hours
Part V, solving the problem in seconds, not minutes
Part VI, lessons learned and challenge to the GIS community
Part VII, advice on building your own server

In the posts, I will tell you what we did, how we did it, and we will also assemble our code.

So, spoiler alert:  Yes, we actually took a classic GIS process that required days to complete, made some adjustments to complete the process in hours, created our own cluster of 4 computers with 16 CPUs to complete the problem in minutes using parallel processing, and finally, went all-out, and rented time on the Amazon EC2 server to complete the job in the realm of seconds (BTW, the rental time on EC2 cost around $5.00 to complete the job).

But first, I want to introduce you to the undergraduates that I worked with on this project with me.

My Research Interns

This summer, as part of a National Science Foundation (NSF) Research Experience for Undergraduates (REU), I spent my summer with two very smart guys: Alan Young, and Robbie Stancil.

Robbie Stancil

image2A Junior Math and Computer Science major, Robbie is an extraordinary student.  If there was something you could do in college, Robbie did it.  Just some of the things Robbie has done in 3 years at Salisbury University include being part of the Bellevance Honors Program, a Presidential Citizens Scholar, a Bellevance Honors Student Ambassador, Resident Assistant (RA) for the Honors Living Learning Community, Salisbury University student member of the Alumni Association, and the 2013-2014 Manokin Hall Residents Council Treasurer.  Outside of academics, Robbie has also held internships at NASA Wallops Flight Facility, and ADNET Systems.

As you might imagine, Robbie’s name has become a fixture on the Dean’s list at Salisbury University.   He sort of reminds me of when I was a student, with the only difference being my writeup would have said, Art Lembo was a mediocre college student, drank beer his freshman year, and was not smart enough to get a scholarship in college.  Other than that, we’re practically identical.

Alan Young

image1Alan Young comes to us from Berry College in Georgia, where he is a Sophomore Math and Computer Science major.  Alan holds multiple scholarships at Berry, including the Berry Academic scholarship, and the Georgia Zell Miller scholarship.  He has a job in the IT Department at Berry College, and is also involved in a number of organizations on campus including the math club, computer science club, and the Baptist collegiate ministries.  Alan has participated in a number of computer science programming competitions.  It is hard to believe he has only completed his Sophomore year in college!

Both of these guys are exceptionally smart and hard working students.  I loved working with them.  While I am thrilled that Robbie will be here next semester as a student, I am so sad to say goodbye to Alan as he returns to Berry College.  I can’t wait to start writing recommendation letters for these two in the next couple of years.

My next post will go over the problem we faced, the data we used, and the roadmap we set for ourselves. Stay tuned.

When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:

allstu

but, as you zoom in, you can see that there are a tremendous number of points:

stuzoom

The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete. Continue reading

Another Radian Test – Finding the distance between lines and areas

Following up on my previous post with ArcGIS and the Near Table, I created an SQL query in Manifold 8 to do both the near distances and group them by the number of points within specific distances (I grouped them every 50 km.).  The entire process took 47 seconds (or about 9 times faster than ArcGIS 10.1).

But, to keep things on the same playing field, I just computed the NEAR part of the query, and it ran in 40 seconds.  So, Manifold 8 was way faster than ArcGIS 10.1, but 3x slower than ArcGIS Pro.

I then wrote the following query in the Radian engine:

SELECT count(*) AS CNT, 
       first(floor(GeomDistance([L Table].[Geom (I)], 
       [P Table].[Geom (I)], 1)/50000)*50000+50000) AS DistZone, 
       [UNIQUE_ID] 
INTO bobo 
FROM [P Table] 
RIGHT JOIN [L Table] 
ON GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1) 
GROUP BY [UNIQUE_ID] 
THREADS 4

 this query took 30 seconds (or about 20% faster than Manifold 8).

 Once again, to level the playing field, I created a query to just run the NEAR aspect:

SELECT GeomDistance([L Table].[Geom (I)], [P Table].[Geom (I)], 1) AS DistZone, [UNIQUE_ID]
INTO bobo2
FROM [P Table]
RIGHT JOIN [L Table] ON 
GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1)
THREADS 4
BATCH 64

 this ran in 20 seconds.  In this case, ArcGIS Pro run slightly faster than Manifold 9 – but remember, I am still working with an alpha/beta release of Radian, and not all of the optimizations have been turned on. I can’t wait to see what the next beta will reveal.

Again, the simplicity of SQL in conjunction with the parallel nature of the Radian engine provides some very interesting opportunities for working with complex processes and large amounts of data.

A Poor Man’s Parallel Processor for GIS

In addition to SQL, I also am interested in processing large volumes of spatial data.  One of the newest rages in “big data” is Hadoop.  According to Wikipedia:

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

One way this is implemented is a programming model called MapReduce.  Don’t get too excited, it doesn’t have anything to do with maps or GIS – but, it is very clever and powerful for certain types of problems.  The concept is if you have a really large dataset, you divide and conquer that dataset in a number of steps.  For example, say we wanted to know all the people with the name “John” in the phonebook, and say we had 26 computers in a cluster – we might solve this by:

1.  Use each computer (1-26) to find all the “Johns” for the first letter in the last name (A-Z).  That way, you have effectively broken the problem into 26 smaller units.

2.  Once each computer has counted up the number of Johns, you have reduced the dataset (hence, MapReduce) to 26 variables.

3.  Now, count up the total of the 26 variables.

That is an oversimplified version of course, but it helps to illustrate what we want to do.  I understand that the University of Minnesota has created a set of functions called SpatialHadoop.  I want to test this over the summer, but for now I decided to create my own poor man’s version using PostGRES. Continue reading