Big Data Results

I wanted to revisit the taxi data example that I previously blogged about.  I had a 6GB file of 16 million taxi pickup locations and 260 taxi zones.  I wanted to determine the number of pickups in each zone, along with the sum of all the fares.  Below is a more in-depth review of what was done, but for those of you not wanting to read ahead, here are the result highlights:

Platform Command Time
ArcGIS 10.4 AddJoinManagement Out of memory
ArcGIS Pro Summarize Within 1h 27m*
ArcGIS Server Big Data GeoAnalytics with Big Data File Share Summarize Within

Aggregate Points

~2m
Manifold 9 GeomOverlayContained 3m 27s
Postgres/PostGIS ST_Contains 10m 30s
Postgres/PostGIS (optimized) ST_Contains 1m 40s
*I’m happy ArcGIS Pro ran at this speed, but I think it can do better.  This is a geodatabase straight out of the box. I think we can fiddle with indexes and even structuring the data to get things to run faster.  That is something I’ll work on next week.

I was sufficiently impressed with how some of the newer approaches were able to improve the performance.  Let’s dive in:

The Data and Computer

The data was obtained from the NYC Taxi and Limousine Commission for October 2012.  The approximately 16 million taxi pickup locations and 263 taxi zone polygons required around 6GB of storage.  I have the data in a geodatabase here.  You can see below that this is a lot of data:

taxis

I used my Cyberpower gaming PC which has a Windows 10, i7 processor (4 cores), solid-state drive, 12GB of RAM, and has a 3.0ghz processor.   So, pretty much what every teenager has in their bedroom.

The Question

The question I wanted to know was: how many taxi pickups were there for each zone, and what was the total amount of the fare?  Fair question (no pun intended!).  So, I decided to try to answer this question with ArcGIS, Manifold, and Postgres.

ArcGIS 10.4

As most of you know, ArcGIS 10.4 is a 32-bit application.  So, I wondered how well it could tackle this problem.  I attempted to perform a spatial table join (AddJoin_Management) between the taxi pickup locations and the taxi zones.  In order to give ArcGIS a fighting chance, I moved the data into a geodatabase (that way, the layers would have spatial indexes).  After running the join for a few hours, ArcGIS 10.4 reported an Out of Memory error.

ArcGIS Pro

Next, I moved on to ArcGIS Pro, which is a true 64-bit application.  Also, ArcGIS Pro has a number of tools to do exactly what I want.  One was Summarize Within.   ESRI makes it really easy to ask these sorts of questions in ArcGIS Pro.  So, I ran the function, and got a resulting table in 1h 27m.  At this point in my experiment, I was fairly pleased – at least I got an answer, and it is something I could do over a lunch break.

ArcGIS Server with GeoAnalytics Server

I knew that ESRI was touting their new GeoAnalytics Server, so I wanted to give that a try.   Unfortunately, I do not own GeoAnalytics Server.  Fortunately, a friend owns it, and was able to test it out on his computer.  To my amazement, he ran the query in about 2m.  I was astounded – hats off to ESRI.  This product is designed for big data for sure.  I would say if you have an ArcServer license, this is something worth checking out for big data processing.  Nothing cryptic like Hadoop – the same ArcGIS Pro interface is there to run the data under the GeoAnalytics server.

Manifold 9

As most of you know, I am a big fan of Manifold GIS, and have often discussed my work with the product.  Manifold 9 is designed for big data analytics.  They have a query engine that makes use of parallel processing.  The function I used was GeomOverlayContainedPar.  It actually works as a GUI, but I bypassed that and just wrote a straight-up SQL query which is a bit more flexible:

SELECT s_mfd_id AS [mfd_id], sgeom AS [Geom], sumfare, avgfare, s_zone, numrides
INTO sumtable
FROM (
SELECT s_mfd_id, count(o_mfd_id) AS numrides, avg([o_fare_amount]) AS avgfare,sum([o_fare_amount]) AS sumfare, first(s_geom) AS sgeom, first(s_zone) as s_zone
FROM 
  (
   SELECT s_zone, o_fare_amount, s_mfd_id, s_geom, o_mfd_id
   FROM CALL GeomOverlayContainedPar([taxi_zones] ([mfd_id], [zone], [Geom]),
   [pickup Drawing] ([pu_geom], [mfd_id], [fare_amount]), 0,
  ThreadConfig(SystemCpuCount()))
   )
GROUP BY s_mfd_id)

I won’t go into detail on the query, but in this case, I was using all 4 cores (actually 8, when you consider the hyperthreading) to process the data.  The query ran and returned the table in 3m 27s.  Again, I was sufficiently impressed, given that Manifold 9 sells for around $425.

I like to needle my friends at Manifold, so I sent them the data and the results, so stay tuned, I’d be willing to bet that we see them get under 3 minutes fairly soon.

Postgres/PostGIS

It’s no secret that I’m also a fan of FOSS4g software like Postgres, and I teach a number of courses in the use of Postgres.  So, I wanted to see how this would run in Postgres with PostGIS.  The first thing I did was create a straight-up SQL statement:

SELECT count(*) AS totrides,taxizones.zone, sum(taxisandy.fare_amount)
FROM taxizones, taxisandy
WHERE ST_Contains(taxizones."Geom",taxisandy.pu_geom)
GROUP BY zone

Good grief, it doesn’t get much simpler than that.   This query ran in 10m 27s.  I was pleased with this result.  I mean afterall, it’s free!  And, that query is super simple to write.  But I wasn’t done yet.  I knew there were some ways to optimize things.

Postgres/PostGIS optimized

I had already created a spatial index, so that was good.  But, there were two more things I was hoping to do: vacuum the table, and cluster the data.  So, what do these queries do:

VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done

CLUSTER physically reorders the data on the disk so that data that should be near one another in the database are actually near one another on the disk.  In other words, points in Brooklyn are now physically stored on the disk near other points in Brooklyn, and the same is true for all the other Burroughs.  I wasn’t sure if this would do anything, since I already had a solid-state drive.  A friend of mine in the Computer Science Department told me that it would.  I would tell you what he said, but quite frankly his explanation was too technical for me!

So, how did I do this.  First, I vacuumed and clustered the data:

VACUUM ANALYZE taxizones ("Geom"); 
VACUUM ANALYZE taxisandy (pu_geom);
CLUSTER taxisandy USING pugeom_idx; 
CLUSTER taxizones USING "Geom_x";

Now, running the cluster on the pickup locations did in fact take time – 18 minutes.  That is a one time expense we pay.  After that, we can run whatever query we want, over and over again.  The query is a little more involved than the previous one because I wanted to write the results to a new table so I had to rejoin the table with the zones:

SELECT taxizones."Geom", sumfare, a.zone
INTO sumtable
FROM taxizones, 
(SELECT taxizones.zone, sum(taxisandy.fare_amount) AS sumfare
FROM taxizones
JOIN taxisandy
ON ST_Contains("Geom", pu_geom)
GROUP BY zone) AS a
WHERE taxizones.zone = a.zone

Drum roll, please. The query completed in 1m 40s.  Wow!  Of course, with PostGIS you have to factor in the cost: $0.  I guess you get what you pay for????

So, what is the takeaway?  Well, GIS products are evolving, and are now positioned to handle really large data sets in ways we hadn’t been able to do before.  I’m impressed with each of these products.

Two final notes:

If you live in the Denver area, please come and visit with me as I teach two workshops on FOSS4g, big data geoanalytics, Python and SQL: one at Colorado State University in Fort Collins on October 25, and one in Denver at Metropolitan State University of Denver (October 26).  I’d love to see you there!

And as always, all my online video courses are available at www.gisadvisor.com.   For this, you might find Spatial SQL with PostGIS, and Big Data Analytics with GIS to be two very useful courses to pursue this kind of work further. 

 

 

21 thoughts on “Big Data Results

    • that is a great comparison to make. It is out of my wheel house. I’d be happy to cut some data for you to look at. BTW, your blog looks excellent. I enjoyed reading through some of what you are doing.

      • Hi Art, this is awesome. I’d like to expand on this challenge and give this a try in Cosmos DB, as well as in Geopandas. Can you send me the data so that we can do an apples to apples comparison?

    • I get your point, but in actuality, I am comparing what I typically use to solve GIS problems. For years I’ve used each product. This gives some insight into how I will approach these things. There are good things to say about each product (although, 10.4 seemed to be a dead end, which leads to my next point).

      I have been hesitant to move our lab to ArcGIS Pro. But, I think after these tests, it is clear that I’m going to move our schedule up a little and get Pro working in all our labs – it is just a better option for an ESRI solution over 10.4.

      So, it provides a nice tradeoff – the GUI tools in the ESRI products are right there for the taking. I’m still not done with that – I think we can squeeze a little more water out of that rock. I’m going to do some fiddling with the data to try and get ArcGIS Pro to yield faster results. I’m certain that it will. The 1h 27m is simply taking too long for my liking, and I think there might be some better ways to deal with it (I’ve got about 2 or 3 modifications I plan to make next week).

      While Postgres proved to be fastest, there isn’t a GUI so some people may choose to avoid that solution.

      But again, our GIS community are using each of these products regularly. And, Postgres, while a database, is almost treated like a desktop application by most people – the install has become relatively easy. That reminds me, I probably should try this in SQLite with Spatialite.

      • Hi Art – both ArcMap and ArcGIS Pro can run on the same computer, so you could have both in your lab :)

        The GeoAnalytics team here at Esri have enjoyed reading this article. We’d be thrilled to see the comparison expanded from one month of data, to a year or five years of data.

        Randall Whitman, GeoAnalytics developer

      • Randall – that would be awesome. I actually have ArcMap and ArcGIS Pro on the same computer. That enabled me to get the processing done in ArcPro. I tried using a polygon layer derived from ST_SubDivide to get ArcGIS Pro to run faster – it was the same speed. I suppose that is good, meaning your spatial index is doing all it can with the bounding boxes already. The trick will see what other strategies we can use to get ArcPro to lower those numbers.

        I totally want to move to 1 or 5 years of data – but, that means we’ll be talking about 200GB of data or more. I think we can work on that over Christmas break. I’ve been really impressed with the GeoAnalytics videos I’ve seen. You guys are tearing through this NYC data really nicely. That is why I mentioned that for people with an ArcServer license, they really need to check out the GeoAnalytics Server. So, what I don’t know is if these other products I’ve been playing with (Manifold and Postgres) can handle 5 years of the data. Your videos prove that GA server can do it. Bottom line: if the software is free, but can’t do the job, then it’s not the right software to use :-)

        Also, feel free to reach me offline. We just built a high performance computing lab at Salisbury University with the Computer Science Department. I’d love to get a GeoAnalytics Server license so that we can experiment with stuff like this.

        Obviously, leveraging GeoAnalytics Server is the way to go for this really enormous stuff. I like that the tools like Summarize Within or Aggregate, etc., are all accessible right from a desktop, hitting against the BDFS. We implemented Hadoop once, and I don’t want to do that again. Plus, we were only able to answer the one question that the system was designed to do. So, I like the flexibility that ArcPro, accessing BDFS offers. However, a quick Google search of me will tell you that I am very partial to the even greater flexibility that SQL offers – I’d like to see ESRI develop a SQL engine.

        Thanks again to your team for the really great videos.

    • Not really. That I think is some of what Christos was getting at above – desktop vs. database. Postgres, as a database, has that capability. I think that ArcServer with the BDFS is doing just that – hence, why it was so fast.

      I think Manifold will be able to do the same thing if they want. Within the .map file they can probably reorder the data to have them near one another.

      But, you bring up a fascinating point. If Postgres can reorder the location based on the index, what would happen if I read the data from Postgres into ArcGIS? Would it do the same? Also, ArcGIS has a lot more flexibility with the creation of the spatial index, allowing different grid sizes. What happens when we attempt to leverage that? It is certainly something to check on. As I told Christos, I’m pretty certain we can fiddle with things to get it to run faster.

  1. Pingback: Bigdata2 | gisadvising

  2. Pingback: Rahoille vastinetta | Paikkatieto

  3. When using the ArcGIS Server is the program sending the data into the cloud to use cloud computing resources or is it just being calculated locally on your machine?

    • Hi Simon,
      As indicated in my post, I don’t have ArcGIS Server. But, the person who did this for me indicated that all the data was on their local computer. Also, they read the .csv file from the Taxi database directly into a BDFS.

  4. Pingback: Análise de performance de Big Data com PostGIS e ArcGIS – Blog do Fernando Quadro

  5. I used a similar machine to ask the same question on QGIS. I converted the data to geopackage and them used QGIS 3 Join attributes by location (summary) counting and summing the total_amount field. It runs in 20 minutes.

    To have something to compare, I tested on PostgreSQL using vaccum and clusteriing. My machine runs it in 3 minutes. So may QGIS in your can make the 12 minutes mark or something.

    QGIS does not allow parallel processing like manifold (yet), but the job runs in a separate thread, allowing me to keep working on other stuf.

    My computer is a kubuntu 18.04 , with 16 Gb of ram, SSD and 2.6 Ghz processor (8 cpus).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s