Big Data Results

I wanted to revisit the taxi data example that I previously blogged about.  I had a 6GB file of 16 million taxi pickup locations and 260 taxi zones.  I wanted to determine the number of pickups in each zone, along with the sum of all the fares.  Below is a more in-depth review of what was done, but for those of you not wanting to read ahead, here are the result highlights:

Platform Command Time
ArcGIS 10.4 AddJoinManagement Out of memory
ArcGIS Pro Summarize Within 1h 27m*
ArcGIS Server Big Data GeoAnalytics with Big Data File Share Summarize Within

Aggregate Points

Manifold 9 GeomOverlayContained 3m 27s
Postgres/PostGIS ST_Contains 10m 30s
Postgres/PostGIS (optimized) ST_Contains 1m 40s
*I’m happy ArcGIS Pro ran at this speed, but I think it can do better.  This is a geodatabase straight out of the box. I think we can fiddle with indexes and even structuring the data to get things to run faster.  That is something I’ll work on next week.

Continue reading

When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:


but, as you zoom in, you can see that there are a tremendous number of points:


The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete. Continue reading

Another Radian Test – Finding the distance between lines and areas

Following up on my previous post with ArcGIS and the Near Table, I created an SQL query in Manifold 8 to do both the near distances and group them by the number of points within specific distances (I grouped them every 50 km.).  The entire process took 47 seconds (or about 9 times faster than ArcGIS 10.1).

But, to keep things on the same playing field, I just computed the NEAR part of the query, and it ran in 40 seconds.  So, Manifold 8 was way faster than ArcGIS 10.1, but 3x slower than ArcGIS Pro.

I then wrote the following query in the Radian engine:

SELECT count(*) AS CNT, 
       first(floor(GeomDistance([L Table].[Geom (I)], 
       [P Table].[Geom (I)], 1)/50000)*50000+50000) AS DistZone, 
INTO bobo 
FROM [P Table] 
ON GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1) 

 this query took 30 seconds (or about 20% faster than Manifold 8).

 Once again, to level the playing field, I created a query to just run the NEAR aspect:

SELECT GeomDistance([L Table].[Geom (I)], [P Table].[Geom (I)], 1) AS DistZone, [UNIQUE_ID]
INTO bobo2
FROM [P Table]
GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1)

 this ran in 20 seconds.  In this case, ArcGIS Pro run slightly faster than Manifold 9 – but remember, I am still working with an alpha/beta release of Radian, and not all of the optimizations have been turned on. I can’t wait to see what the next beta will reveal.

Again, the simplicity of SQL in conjunction with the parallel nature of the Radian engine provides some very interesting opportunities for working with complex processes and large amounts of data.

A quick look at ArcGIS Pro

One of my undergraduates was interested in incorporating a little more sophistication into an analysis he was conducting in Southeast Asia.  The long and the short of it is that he has polygons and lines, and he wants to quantify the number of lines within certain distances of each polygon (i.e. 5km, 10km, 20km, etc.).

An example dataset looks like this:


This was rather small – 11,000 lines, and 120 polygons.  There were a bunch of steps to perform, and I won’t take time to talk about it here.  The most important step was to determine the distance between each line and each polygon that were within 5km of one another.

So, the first thing I did was to run the Generate Near Table tool to find the distance of every polygon to every line.  In ArcGIS 10.1, this took 420 seconds – I was disappointed because we have much bigger datasets to worry about.

So, I decided to give it a shot in ArcGIS Pro (here is a screen shot):


The Generate Near Table in ArcGIS Pro ran in 17 seconds!  That is 24x faster than ArcGIS 10.1.  I’m going to be running some more tests on this over the summer and will report on what I find.  Next Fall, in my GIS Programming class, our undergraduates are going to write Arcpy scripts in both 10.1 and ArcGIS Pro, and we will show you the results.