Big Data Results

I wanted to revisit the taxi data example that I previously blogged about.  I had a 6GB file of 16 million taxi pickup locations and 260 taxi zones.  I wanted to determine the number of pickups in each zone, along with the sum of all the fares.  Below is a more in-depth review of what was done, but for those of you not wanting to read ahead, here are the result highlights:

Platform Command Time
ArcGIS 10.4 AddJoinManagement Out of memory
ArcGIS Pro Summarize Within 1h 27m*
ArcGIS Server Big Data GeoAnalytics with Big Data File Share Summarize Within

Aggregate Points

~2m
Manifold 9 GeomOverlayContained 3m 27s
Postgres/PostGIS ST_Contains 10m 30s
Postgres/PostGIS (optimized) ST_Contains 1m 40s
*I’m happy ArcGIS Pro ran at this speed, but I think it can do better.  This is a geodatabase straight out of the box. I think we can fiddle with indexes and even structuring the data to get things to run faster.  That is something I’ll work on next week.

I was sufficiently impressed with how some of the newer approaches were able to improve the performance.  Let’s dive in:

The Data and Computer

The data was obtained from the NYC Taxi and Limousine Commission for October 2012.  The approximately 16 million taxi pickup locations and 263 taxi zone polygons required around 6GB of storage.  I have the data in a geodatabase here.  You can see below that this is a lot of data:

taxis

I used my Cyberpower gaming PC which has a Windows 10, i7 processor (4 cores), solid-state drive, 12GB of RAM, and has a 3.0ghz processor.   So, pretty much what every teenager has in their bedroom.

The Question

The question I wanted to know was: how many taxi pickups were there for each zone, and what was the total amount of the fare?  Fair question (no pun intended!).  So, I decided to try to answer this question with ArcGIS, Manifold, and Postgres.

ArcGIS 10.4

As most of you know, ArcGIS 10.4 is a 32-bit application.  So, I wondered how well it could tackle this problem.  I attempted to perform a spatial table join (AddJoin_Management) between the taxi pickup locations and the taxi zones.  In order to give ArcGIS a fighting chance, I moved the data into a geodatabase (that way, the layers would have spatial indexes).  After running the join for a few hours, ArcGIS 10.4 reported an Out of Memory error.

ArcGIS Pro

Next, I moved on to ArcGIS Pro, which is a true 64-bit application.  Also, ArcGIS Pro has a number of tools to do exactly what I want.  One was Summarize Within.   ESRI makes it really easy to ask these sorts of questions in ArcGIS Pro.  So, I ran the function, and got a resulting table in 1h 27m.  At this point in my experiment, I was fairly pleased – at least I got an answer, and it is something I could do over a lunch break.

ArcGIS Server with GeoAnalytics Server

I knew that ESRI was touting their new GeoAnalytics Server, so I wanted to give that a try.   Unfortunately, I do not own GeoAnalytics Server.  Fortunately, a friend owns it, and was able to test it out on his computer.  To my amazement, he ran the query in about 2m.  I was astounded – hats off to ESRI.  This product is designed for big data for sure.  I would say if you have an ArcServer license, this is something worth checking out for big data processing.  Nothing cryptic like Hadoop – the same ArcGIS Pro interface is there to run the data under the GeoAnalytics server.

Manifold 9

As most of you know, I am a big fan of Manifold GIS, and have often discussed my work with the product.  Manifold 9 is designed for big data analytics.  They have a query engine that makes use of parallel processing.  The function I used was GeomOverlayContainedPar.  It actually works as a GUI, but I bypassed that and just wrote a straight-up SQL query which is a bit more flexible:

SELECT s_mfd_id AS [mfd_id], sgeom AS [Geom], sumfare, avgfare, s_zone, numrides
INTO sumtable
FROM (
SELECT s_mfd_id, count(o_mfd_id) AS numrides, avg([o_fare_amount]) AS avgfare,sum([o_fare_amount]) AS sumfare, first(s_geom) AS sgeom, first(s_zone) as s_zone
FROM 
  (
   SELECT s_zone, o_fare_amount, s_mfd_id, s_geom, o_mfd_id
   FROM CALL GeomOverlayContainedPar([taxi_zones] ([mfd_id], [zone], [Geom]),
   [pickup Drawing] ([pu_geom], [mfd_id], [fare_amount]), 0,
  ThreadConfig(SystemCpuCount()))
   )
GROUP BY s_mfd_id)

I won’t go into detail on the query, but in this case, I was using all 4 cores (actually 8, when you consider the hyperthreading) to process the data.  The query ran and returned the table in 3m 27s.  Again, I was sufficiently impressed, given that Manifold 9 sells for around $425.

I like to needle my friends at Manifold, so I sent them the data and the results, so stay tuned, I’d be willing to bet that we see them get under 3 minutes fairly soon.

Postgres/PostGIS

It’s no secret that I’m also a fan of FOSS4g software like Postgres, and I teach a number of courses in the use of Postgres.  So, I wanted to see how this would run in Postgres with PostGIS.  The first thing I did was create a straight-up SQL statement:

SELECT count(*) AS totrides,taxizones.zone, sum(taxisandy.fare_amount)
FROM taxizones, taxisandy
WHERE ST_Contains(taxizones."Geom",taxisandy.pu_geom)
GROUP BY zone

Good grief, it doesn’t get much simpler than that.   This query ran in 10m 27s.  I was pleased with this result.  I mean afterall, it’s free!  And, that query is super simple to write.  But I wasn’t done yet.  I knew there were some ways to optimize things.

Postgres/PostGIS optimized

I had already created a spatial index, so that was good.  But, there were two more things I was hoping to do: vacuum the table, and cluster the data.  So, what do these queries do:

VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done

CLUSTER physically reorders the data on the disk so that data that should be near one another in the database are actually near one another on the disk.  In other words, points in Brooklyn are now physically stored on the disk near other points in Brooklyn, and the same is true for all the other Burroughs.  I wasn’t sure if this would do anything, since I already had a solid-state drive.  A friend of mine in the Computer Science Department told me that it would.  I would tell you what he said, but quite frankly his explanation was too technical for me!

So, how did I do this.  First, I vacuumed and clustered the data:

VACUUM ANALYZE taxizones ("Geom"); 
VACUUM ANALYZE taxisandy (pu_geom);
CLUSTER taxisandy USING pugeom_idx; 
CLUSTER taxizones USING "Geom_x";

Now, running the cluster on the pickup locations did in fact take time – 18 minutes.  That is a one time expense we pay.  After that, we can run whatever query we want, over and over again.  The query is a little more involved than the previous one because I wanted to write the results to a new table so I had to rejoin the table with the zones:

SELECT taxizones."Geom", sumfare, a.zone
INTO sumtable
FROM taxizones, 
(SELECT taxizones.zone, sum(taxisandy.fare_amount) AS sumfare
FROM taxizones
JOIN taxisandy
ON ST_Contains("Geom", pu_geom)
GROUP BY zone) AS a
WHERE taxizones.zone = a.zone

Drum roll, please. The query completed in 1m 40s.  Wow!  Of course, with PostGIS you have to factor in the cost: $0.  I guess you get what you pay for????

So, what is the takeaway?  Well, GIS products are evolving, and are now positioned to handle really large data sets in ways we hadn’t been able to do before.  I’m impressed with each of these products.

Two final notes:

If you live in the Denver area, please come and visit with me as I teach two workshops on FOSS4g, big data geoanalytics, Python and SQL: one at Colorado State University in Fort Collins on October 25, and one in Denver at Metropolitan State University of Denver (October 26).  I’d love to see you there!

And as always, all my online video courses are available at www.gisadvisor.com.   For this, you might find Spatial SQL with PostGIS, and Big Data Analytics with GIS to be two very useful courses to pursue this kind of work further. 

 

 

Finding “Dangles” with PostGIS

Do you have a set of lines that you need to determine if there are any “dangle” nodes?  A dangle is a line segment that overhangs another line segment.  Now, some dangles are valid, like a pipe that terminates in a cul-de-sac.

A few people have posted about this already, but I figured I would give it a shot as well, as I think my SQL is a little more terse.  Anyway, here is the query, and we’ll talk about it line by line:

SELECT DISTINCT g1 ASINTO dangles
FROM plines, 
    (SELECT g AS g1 FROM  
         (SELECT g, count(*) AS cnt  
          FROM  
              (SELECT  ST_StartPoint(g) AS g FROM plines
               UNION ALL
               SELECT  ST_EndPoint(g) AS g FROM plines ) AS T1 
         GROUP BY g) AS T2
     WHERE cnt = 1) AS T3
WHERE ST_Distance(g1, g) BETWEEN 0.01 AND 2;

The first thing to notice is the most inner select statement.  We are using ST_StartPoint and ST_EndPoint to grab the endpoints of the lines – these we’ll call nodes.

The next line to notice is where we are getting the count of the nodes.  We are grabbing all the nodes, but using the GROUP BY function to return the number of nodes that occupy a place in space.  Now, an intersection of two lines will have 2 nodes (from the first line and the second line).  But, a “dangle” will only have one node occupying a space.  This is where the next section of SQL comes into play.

What we want to do is only select those nodes where the count (cnt) is equal to 1.  That means the node is just sitting there in space.  It is a “dangle”.  But, not all dangles are created equally, as I said above.  That final WHERE clause lets me specify how far I want a dangle node to be from another node.  In the example above, we are choosing under 2m apart.  The last bit of SQL we have to consider is the DISTINCT clause.  Nodes can be near one or more lines.  We don’t want to double count them, so using DISTINCT eliminates the duplicates.

That’s it!  Pretty easy.  Think of the ST_Distance function as a variant of the basic SQL to find dangles.  There are other variants we could add to this if we’d like, such as the length of the line the dangle touches has to be less than 5m, or something like that.  That would be just a matter of adding another WHERE clause.

 

 

Multi-Ring (non-overlapping) Buffers with PostGIS

I was interested in creating mult-ring buffers but with a twist: I didn’t want the buffers to overlap one another.  In other words, if I had concentric buffers with distances of 100, 200, and 300 around a point, I want those buffers to reflect distances of 0-100, 100-200, and 200-300.  I don’t want them overlapping one another.  You can actually do that with the PostGIS function ST_SymDifference, but there are a few nuances that you have to be aware of.

Unlike some of my longer videos, this one will start out with the answer, and then we’ll walk through all the SQL.  You’ll see it isn’t so bad.  And, you continue to see that spatial is not special!.  It’s only 20 minutes long, but the answer is shown in the first minute.

In the video I’ll slowly walk you through all the spatial SQL to create buffers for the points and trim all the overlaps so that there are no overlapping buffers.  You’ll learn some really cool Postgres commands  including:

 ST_BufferST_DifferenceSymDISTINCT ON, and SET WITH OIDS.

I found myself amazed that with a few SQL tweaks, we were able to turn ordinary buffers to more useful non-overlapping buffers.  I hope you enjoy the video.

I’d like to create more videos like that – please leave so comments below so that I know others want me to continue these kinds of tutorials.

 If you want to learn more about SQL, programming, open source GIS, or Manifold GIS, check out courses at www.gisadvisor.com.  

Cartography in ArcGIS and QGIS

meghanToday I want to introduce you to another one of my students, Meghan Murphy.  Meghan is an outstanding student, and one of the top undergraduates I have ever worked with (I know, I say that a lot, but they just keep getting better and better).  Even as a Sophomore, Meghan was always helping other students out, even the Seniors – students would seem to wait for Meghan to organize everyone together to study for upcoming exams.

She also has an innate ability to work with GIS, and pick up new things: one day she has never programmed in Python, and the next day, she has a couple of hundred line Python script created and running in ArcGIS!  So, I was so happy when Meghan said she wanted to take a special course in Open Source GIS that I was offering this semester.  We covered QGIS, Postgres/PostGIS, GDAL, and Geoserver.  For her final project, Meghan decided she wanted to compare the cartography capabilities of ArcGIS and QGIS, and make a video about it (maybe she was inspired by my videos, or maybe she just figured after watching Lembo’s videos, how could I do worse!).

Whatever her reason, like everything else she does, this turned out great, especially since she had never done a live tutorial like this.  So, I encourage you to watch the side-by-side comparisons for creating a basic cartographic product in both ArcGIS and QGIS.  It’s about 40 minutes long, but worth every minute: I found that I learned some things I hadn’t known regarding some cartographic tools.  And, on that note, I’ll have more videos from my undergraduates shortly (some built web maps, others built an enterprise GIS with Postgres.

If you want to learn more about open source GIS, Python programming, Spatial SQL, or Spatial Statistics, check out my online courses at www.gisadvisor.com.  

The Web Duel – Last Thoughts

The Web GIS Duel: Final Thoughts

This is a continuation of Mark Balwanz’s blog posts on his creation of web mapping sites using both ESRI and Geoserver.  Today he will talk about his experience creating the site using open source technology.

Over the first three parts of this blog series (here, here, and here), I have laid out my project plan, walked you through my ESRI implementation, and my Open Source implementation. During this fourth and final blog I plan on sharing with you my overall thoughts on both implementations in regards to what I liked and disliked. As I mentioned in Part 1 of this series, everything I share here is just my opinion and is based on this one project. I also want to point out that most of my previous experience, both academic and professional, is based in ESRI and is probably shaping some of my opinions.

I will start by sharing my opinions about my experience building the ESRI version. I found the ESRI build to be quite simple thanks to the enormous amount of information that can be found online. Since the entire stack is from the same company the integration between ArcGIS Desktop, ArcGIS Server, and the API was seamless and very easy to navigate. Publishing data to ArcGIS Server was a breeze thanks to the publishing tools that they have included in the desktop software. I also found it extremely easy to pull ArcGIS Server map services into the application that was built with the API. None of this should come as a surprise since these things were built by the same company and are meant to work together. Another positive I found with ESRI is their documentation. The ArcGIS for Developers website and specifically the JavaScript samples and API reference page contained everything I needed to build my application. The online community of users is quite large as well which provides a tremendous wealth of information within forum posts. I think one of the huge benefits that the ESRI implementation provided was the ArcGIS Server capabilities; the identify task, geometry service (although I did not use it), and many others. I found using ArcGIS for Server to be much easier than GeoServer when it came to writing JavaScript code to query the services. I really did not run into anything with the ESRI implementation that made frustrated me. Of course, if I was building something like this for a company I would have to factor in the cost of the systems as well. The ESRI stack has the advantage of being built by a single company with a single vision of how everything should fit together, and that is something that by its nature the Open Source stack will never have.

So as you can tell I enjoyed my time building the ESRI version, but was really excited to see how the Open Source one would compare. I think I mentioned this in Part 3 of this series, but I was once again very impressed with QGIS. I found it very intuitive and the amount of plugins that you can add is very impressive. The GeoServer Manager plugin made the publishing of WMS to GeoServer a trivial task. As mentioned above, I do prefer ArcGIS Server over GeoServer, but that being said the GeoServer manager page is very easy to use and seems to be well thought out. The open source JavaScript libraries were also quite impressive. Although I ended up switching from Leaflet to OpenLayers 3, I still came away impressed by how easy it was to use Leaflet was and will be looking to use it in the future. The one problem I had with OpenLayers was how hard it was to find accurate information about OpenLayers 3, online. Most everything I found was for OpenLayers 2 and that does not migrate very well to OpenLayers 3. Even the book I bought for OpenLayers 3 had some code that is not included in the library anymore and therefore did not work. The samples page on the OpenLayers 3 website was helpful, but overall I was not impressed by their documentation. Another area where I felt was lacking was the integration of GeoServer and OpenLayers. I found it complicated to perform relatively simple spatial queries against a GeoServer WMS from within OpenLayers. What made this more difficult is that I had to search through two sets of documentation to solve my problem (GeoServer and OpenLayers 3) rather than just one. I do think that some of these problems would probably decrease the more I used GeoServer and OpenLayers, but better documentation would make it easier for people to jump into the Open Source world. After completing this project though, I am pretty confident that the Open Source world has the capabilities to match ESRI and the fact that the software is free is a huge benefit. Additional work to more seamlessly integrate these Open Source projects would go a long in making them more user friendly.

As someone who has spent most of his GIS life working with ESRI, I have to admit it was a little uncomfortable to move into the Open Source world. However, being uncomfortable was a good thing as it pushed me to learn a lot of new technologies. I think Open Source can be a great way for organizations to start using GIS as there is a lot less upfront cost and with a little research you can find the Open Source project that best fits your need.

I hope you have enjoyed this blog series and have maybe learned something that you can use. Please feel free to leave comments if you have any questions and thanks for reading!

Workshops at the Maryland Geospatial Conference

tugisThe Maryland’s Geospatial Conference  () is on March 20/21, 2017.  I first attended TUgis in 1990, and it is always a great conference.  It is not too large, so it is  great way to have extended time with people.  So, if you had a technical question for someone from say ESRI, you could simply stop by their booth and have a chat.

This year I was asked to support the pre-conference workshops.  I will be presenting two workshops with the help of my students.  If you recall, my students are quite good at instructing others about GIS technology.  I’m really looking forward to the conference and interacting with people during the workshop.  Keep in mind, this is not something we are just throwing together – we’ve been spending a lot of time thinking about how to effectively move people through the material so that beginners do not get lost, and more technically savvy people are sufficiently challenged.  We are fanatical about making sure people’s learning experience is excellent.

A description of the courses are found here:

Spatial SQL: A Language for Geographers:  Are you stuck in a rut of only knowing how to use a GIS GUI? Do you want to learn how to automate tasks, but are afraid of computer programming. If so, SQL is the most powerful tool you can learn to help you perform complex GIS tasks. This hands-on course is designed to teach you how SQL can replicate many database and GIS tasks. We will start at a very basic overview and then proceed to more advanced topics related to GIS.

Topics to include:

  • Spatial is NOT Special
  • SQL Data Types
  • Traditional SQL
  • Spatial SQL for Vector and Raster Analysis
  • Spatial SQL for Classic Geographic Analysis

For this class, we’ll be using spatiaLite which is the spatial extension used with SQLite.  This is a great way to get started, as it is very similar to the functionality of Postgres/PostGIS.  If you want to move to enterprise GIS with Postgres or even Oracle or SQLServer, you’ll be in really good shape.

Python for Geospatial: If you are in the field of GIS, you’ve probably heard everyone talking about Python, whether it’s Arcpy in ArcGIS or special Python packages for doing things in open source.  In this hands-on workshop you will learn how Python is used to perform GIS analysis. The workshop will be an introduction to Python, with emphasis on integrating multiple Python plug-ins with ArcGIS and open source GIS.

Topics to include:

  • An overview of Python (variables, statements, I/O, writing code)
  • Python plug-ins for Geospatial (numpy, geocoder, pygal, Postgres)
  • A Taste of Arcpy
  • A Data Analytics Project with Python (for this, we will geocode addresses using Python, perform analysis with open source GIS, take the results into Arcpy to do more GIS analysis, compute statistical results with Python calling Excel, and then create charts and graphs of the results for use on the Internet—without ever opening up a single GIS product.)

If you want to learn more about how to use GIS technology, check out the 9 courses at gisadvisor.com.  

A Typical Class Project at Salisbury University: Evaluating Geocoding Accuracies

I’ve always been proud of our Salisbury University GIS students, and love to push them as far as their little minds can handle it.  You may recall that last Spring I had my Advanced GIS students perform independent GIS projects and present those projects as posters at an Undergraduate Research Conference.  Well, this Fall I am teaching GIS Programming, and have 7 awesome students (pictures and bios to follow).  We started the year off learning spatial SQL with Postgres and PostGIS.  We have now moved into Python, which includes Arcpy as well as other Python packages.

The semester was going so well, and the students were so responsive to anything I asked, I said what the heck, let’s try something crazy.  So, I showed the students how to use two Python geocoding packages (geocoder and censusgeocode) and then said:

why don’t we conduct a research project over the next week to test the match rates and positional accuracies of the Google API and the United Census Bureau API.  

So yeah, I gave them a week to put this together: design, code, analyze, and write.  And, like most of my students at this level, they didn’t disappoint me.  This meant they had to integrate a lot of what they have learned over the years (programming, GIS, statistics, etc.).

I just uploaded their work onto researchgate:

 Click for ResearchGate Article  

I was surprised by how little there is out there in terms of quantitative assessment of geocoding accuracies.  I hope you have a chance to click on the link and check out the working paper (we will submit it to a journal sometime soon).  Also, I included a short abstract below so that you can see the results of our work (note: our paper includes the original data and the source code for performing the geocoding):

Undergraduate Research in Action: Evaluating the positional differences between the Google Maps and the United States Census Bureau geocoding APIs

Abstract:  As part of a class assignment in GIS Programming at Salisbury University, students evaluated 106 geographically known addresses to determine the match rate and positional accuracy obtained using the Google and the United States Census Bureau geocoding application programming interface (API)s. The results showed that 96.2% of the addresses supplied by the Google API were successfully geocoded, while 84% of the addresses supplied by the Census Bureau API were successfully geocoded.  Further, the Google API matched 90% of the addresses with a ROOFTOP designation.  The average positional accuracy of the Google derived addresses were 80m overall, and 65m for those geocoded with the ROOFTOP designation while the Census Bureau positional accuracy was 271.09m.  

So yeah, this is what you can do with 7 GIS undergraduates at Salisbury University: they work hard, fast, and are a very creative bunch.

paper

 

 

QGIS GPU Processing Update

Today my students got to present some of their work at the Maryland Geographic Information Committee Quarterly meeting.  On Monday we head to the University of Maryland at College Park to present our work with the Computer Science Department.

If you recall, I outlined our objective for the summer here, and provided some updated information here.  Also, recall that we’ve posted the code on Github here.  We’ve made good progress this week, including modifying the code to utilize the GDAL libraries for reading the raster DEM, and also have things working as a valid QGIS plug-in as well as a command line function.  Stop by the github site and have a look at the code, or try running it on your own data.

If  you recall, my concern was that GIS processes are big on data volume, but light on calculations per data element. With that concern in the back of my mind, we decided to generate a hillshade as it appears to be more computationally expensive than slope.

Well, some early results shed some light on what is going on.   As you can see from Table 1, the CUDA version of our slope computation is around 1.5x faster than the serial version.  That is good, but nothing to write home about.  You’ll notice that the Total Time is less than the sum of the parts because we are not only using the GPU to perform parallel processing for the slope computation, but are also using multi-threading techniques to simultaneously read in data while we are ALSO performing the computations and writing out the data – so, this QGIS plug-in is doing massive GPU parallel processing and multi-core processing! Continue reading

Parallel Processing with QGIS

Once again, I am continuing my role as a mentor in a National Science Foundation (NSF) Research Experience for Undergraduate program.  This year we’ve decided to build a QGIS plug-in for terrain analysis, as it is embarrassingly parallel (slope, aspect, etc.).   We are doing four things as we generate slope for different size digital elevations models:

  1. A pure python implementation (for an easy plug-in)
  2. A serial-based C++ implementation (as a baseline for a non-parallel solution)
  3. A pyCUDA implementation (using the GPU for parallel processing)
  4. A C++ based parallel solution using the GPU

We plan to put our results on a shared GitHub site (we are in the process of cleaning up the code) so that people can start experimenting with it, and use our example to begin generating more parallel solutions for QGIS (or GDAL for that matter).

Here are some early results: Continue reading

Advanced Editing Capabilities with PostGIS

Today I want to introduce you to another one of my really great undergraduate students: Jeff Scarmazzi.  Jeff participated in the SUSRC conference that I posted about previously, but unfortunately, his photo and poster got deleted from my camera, so I never posted information about his project – I may eventually get around to posting Jeff’s poster, but for now it will have to wait.

 

Jeff is an exceptional student, and at this point can run circles around me with PostGIS, Leaflet, and Postgres – at the moment, I am still reserving the right to say that my spatial SQL is better than his, but that may not be true.  And, now that Jeff secured a prestigious internship with NASA over the summer, once the Fall rolls around it definitely won’t be true!

What I wanted to show you was a short presentation that Jeff gave on his work with Postgres and PostGIS in my Advanced GIS class.  For this project I had students create a geodatabase with ArcGIS that includes domains, subtypes, topology rules, etc.  Jeff decided to extend his work on this assignment and recreate the same kind of features using Postgres and PostGIS.  This presentation is just a small piece of what Jeff and his fellow students (Austin Barefoot, Carl Flint, and Jordan Roose) created, and focuses specifically on writing triggers and creating roles to automate and quality check digitizing procedures using Postgres.

The presentation is only about 15 minutes long, and you’ll be amazed to see what you can do with PostGIS to creating robust editing rules.

Warning Shameless plug: In case you wanted to learn how to implement PostGIS in an enterprise setting, check out my online course here.  Also, I have another course on how to write spatial SQL with PostGIS here.