Big Data Results

I wanted to revisit the taxi data example that I previously blogged about.  I had a 6GB file of 16 million taxi pickup locations and 260 taxi zones.  I wanted to determine the number of pickups in each zone, along with the sum of all the fares.  Below is a more in-depth review of what was done, but for those of you not wanting to read ahead, here are the result highlights:

Platform Command Time
ArcGIS 10.4 AddJoinManagement Out of memory
ArcGIS Pro Summarize Within 1h 27m*
ArcGIS Server Big Data GeoAnalytics with Big Data File Share Summarize Within

Aggregate Points

~2m
Manifold 9 GeomOverlayContained 3m 27s
Postgres/PostGIS ST_Contains 10m 30s
Postgres/PostGIS (optimized) ST_Contains 1m 40s
*I’m happy ArcGIS Pro ran at this speed, but I think it can do better.  This is a geodatabase straight out of the box. I think we can fiddle with indexes and even structuring the data to get things to run faster.  That is something I’ll work on next week.

I was sufficiently impressed with how some of the newer approaches were able to improve the performance.  Let’s dive in:

The Data and Computer

The data was obtained from the NYC Taxi and Limousine Commission for October 2012.  The approximately 16 million taxi pickup locations and 263 taxi zone polygons required around 6GB of storage.  I have the data in a geodatabase here.  You can see below that this is a lot of data:

taxis

I used my Cyberpower gaming PC which has a Windows 10, i7 processor (4 cores), solid-state drive, 12GB of RAM, and has a 3.0ghz processor.   So, pretty much what every teenager has in their bedroom.

The Question

The question I wanted to know was: how many taxi pickups were there for each zone, and what was the total amount of the fare?  Fair question (no pun intended!).  So, I decided to try to answer this question with ArcGIS, Manifold, and Postgres.

ArcGIS 10.4

As most of you know, ArcGIS 10.4 is a 32-bit application.  So, I wondered how well it could tackle this problem.  I attempted to perform a spatial table join (AddJoin_Management) between the taxi pickup locations and the taxi zones.  In order to give ArcGIS a fighting chance, I moved the data into a geodatabase (that way, the layers would have spatial indexes).  After running the join for a few hours, ArcGIS 10.4 reported an Out of Memory error.

ArcGIS Pro

Next, I moved on to ArcGIS Pro, which is a true 64-bit application.  Also, ArcGIS Pro has a number of tools to do exactly what I want.  One was Summarize Within.   ESRI makes it really easy to ask these sorts of questions in ArcGIS Pro.  So, I ran the function, and got a resulting table in 1h 27m.  At this point in my experiment, I was fairly pleased – at least I got an answer, and it is something I could do over a lunch break.

ArcGIS Server with GeoAnalytics Server

I knew that ESRI was touting their new GeoAnalytics Server, so I wanted to give that a try.   Unfortunately, I do not own GeoAnalytics Server.  Fortunately, a friend owns it, and was able to test it out on his computer.  To my amazement, he ran the query in about 2m.  I was astounded – hats off to ESRI.  This product is designed for big data for sure.  I would say if you have an ArcServer license, this is something worth checking out for big data processing.  Nothing cryptic like Hadoop – the same ArcGIS Pro interface is there to run the data under the GeoAnalytics server.

Manifold 9

As most of you know, I am a big fan of Manifold GIS, and have often discussed my work with the product.  Manifold 9 is designed for big data analytics.  They have a query engine that makes use of parallel processing.  The function I used was GeomOverlayContainedPar.  It actually works as a GUI, but I bypassed that and just wrote a straight-up SQL query which is a bit more flexible:

SELECT s_mfd_id AS [mfd_id], sgeom AS [Geom], sumfare, avgfare, s_zone, numrides
INTO sumtable
FROM (
SELECT s_mfd_id, count(o_mfd_id) AS numrides, avg([o_fare_amount]) AS avgfare,sum([o_fare_amount]) AS sumfare, first(s_geom) AS sgeom, first(s_zone) as s_zone
FROM 
  (
   SELECT s_zone, o_fare_amount, s_mfd_id, s_geom, o_mfd_id
   FROM CALL GeomOverlayContainedPar([taxi_zones] ([mfd_id], [zone], [Geom]),
   [pickup Drawing] ([pu_geom], [mfd_id], [fare_amount]), 0,
  ThreadConfig(SystemCpuCount()))
   )
GROUP BY s_mfd_id)

I won’t go into detail on the query, but in this case, I was using all 4 cores (actually 8, when you consider the hyperthreading) to process the data.  The query ran and returned the table in 3m 27s.  Again, I was sufficiently impressed, given that Manifold 9 sells for around $425.

I like to needle my friends at Manifold, so I sent them the data and the results, so stay tuned, I’d be willing to bet that we see them get under 3 minutes fairly soon.

Postgres/PostGIS

It’s no secret that I’m also a fan of FOSS4g software like Postgres, and I teach a number of courses in the use of Postgres.  So, I wanted to see how this would run in Postgres with PostGIS.  The first thing I did was create a straight-up SQL statement:

SELECT count(*) AS totrides,taxizones.zone, sum(taxisandy.fare_amount)
FROM taxizones, taxisandy
WHERE ST_Contains(taxizones."Geom",taxisandy.pu_geom)
GROUP BY zone

Good grief, it doesn’t get much simpler than that.   This query ran in 10m 27s.  I was pleased with this result.  I mean afterall, it’s free!  And, that query is super simple to write.  But I wasn’t done yet.  I knew there were some ways to optimize things.

Postgres/PostGIS optimized

I had already created a spatial index, so that was good.  But, there were two more things I was hoping to do: vacuum the table, and cluster the data.  So, what do these queries do:

VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done

CLUSTER physically reorders the data on the disk so that data that should be near one another in the database are actually near one another on the disk.  In other words, points in Brooklyn are now physically stored on the disk near other points in Brooklyn, and the same is true for all the other Burroughs.  I wasn’t sure if this would do anything, since I already had a solid-state drive.  A friend of mine in the Computer Science Department told me that it would.  I would tell you what he said, but quite frankly his explanation was too technical for me!

So, how did I do this.  First, I vacuumed and clustered the data:

VACUUM ANALYZE taxizones ("Geom"); 
VACUUM ANALYZE taxisandy (pu_geom);
CLUSTER taxisandy USING pugeom_idx; 
CLUSTER taxizones USING "Geom_x";

Now, running the cluster on the pickup locations did in fact take time – 18 minutes.  That is a one time expense we pay.  After that, we can run whatever query we want, over and over again.  The query is a little more involved than the previous one because I wanted to write the results to a new table so I had to rejoin the table with the zones:

SELECT taxizones."Geom", sumfare, a.zone
INTO sumtable
FROM taxizones, 
(SELECT taxizones.zone, sum(taxisandy.fare_amount) AS sumfare
FROM taxizones
JOIN taxisandy
ON ST_Contains("Geom", pu_geom)
GROUP BY zone) AS a
WHERE taxizones.zone = a.zone

Drum roll, please. The query completed in 1m 40s.  Wow!  Of course, with PostGIS you have to factor in the cost: $0.  I guess you get what you pay for????

So, what is the takeaway?  Well, GIS products are evolving, and are now positioned to handle really large data sets in ways we hadn’t been able to do before.  I’m impressed with each of these products.

Two final notes:

If you live in the Denver area, please come and visit with me as I teach two workshops on FOSS4g, big data geoanalytics, Python and SQL: one at Colorado State University in Fort Collins on October 25, and one in Denver at Metropolitan State University of Denver (October 26).  I’d love to see you there!

And as always, all my online video courses are available at www.gisadvisor.com.   For this, you might find Spatial SQL with PostGIS, and Big Data Analytics with GIS to be two very useful courses to pursue this kind of work further. 

 

 

GIS Analysis of Overlapping Layers

overlayoverlapMy friend is attempting to quantify the area of different landuse values for different areas that are upstream from her sample points.  This means she needs sample points, landuse, and upstream areas (i.e. sub-watersheds).  The problem is, her watersheds overlap, the buffer distances around the sample points overlap themselves AND the watersheds, and she then needs to summarize the results.  It’s actually a tricky problem due to the overlaps: GIS software doesn’t really like when features within a single layer overlap one another.  Also, if a buffer for a sample point overlaps two different watersheds, that becomes tricky too.

Sure you can solve it with a few for loops,  inserting the results into a new table, but that really is a hassle.  Also, I have to do it for different distances and different land cover types.

So, I once again turned to SQL – remember what I keep telling you – spatial is not special.  It’s just another data type.  This video steps you through performing a multi-ring buffer on overlapping objects from 3 different layers: sample points, watersheds, and land use.  As we step through the SQL, you’ll see how easy it is to put the query together.  And, at the end, you’ll see how flexible the query is should you want to change your objectives.  And, for good measure, we’ll throw in a little bit of parallel processing.

Work smarter – not larger

When you were in Statistics 101, and the Professor said ok, we are now going to learn about the Central Limit Theorem, did you tune out? Did you sarcastically say when is someone going to grab me and order me to tell them about the Central Limit Theorem? Come on, admit it, you did.  Well, so did I – I was 18 years old, and couldn’t care less.

Well, you know what? Understanding the Central Limit Theorem has really big implications for big data analytics. Check out this 20 minute video, and you’ll see that by applying the Central Limit Theorem and some statistical theory, you can approximate the results of an expensive multi-server implementation for interrogating really large databases.

I’ll show you how you can obtain very precise estimates on really large databases by simply applying some basic statistics you should have learned Freshman year (but you were too busy partying, weren’t you?)

 

stay tuned, I’ll be coming out with a big data analytics class in the New Year.  If you want to learn more about SQL, programming, open source GIS, or Manifold, check out courses at www.gisadvisor.com.  

Big data geo-analytics with SQL

I’m getting ready to create a course in big data analytics with GIS.  I have lots of ideas as to what to do, but one thing I know is that I will be using spatial databases and SQL.  I’ll also be using Manifold Future.

ESRI has recently introduced their ArcGIS GeoAnalytics Server, which will introduce many GIS professionals to big data analytics with GIS.  They have some interesting scenarios and example data using NYC taxi cabs.  I think these will be really good case studies.

This video (just shy of 20 minutes) will use SQL and Manifold to try and address these big data problems.

Keep an eye on my blog as I will be rolling out new ideas as I prepare my course for the Spring.

if you like the video, and want to learn more about how to improve your spatial database skills, check out my videos at www.gisadvisor.com.

k-nearest neighbor with SQL in Radian Studio

I wanted to give you another look at some features that Radian Studio will offer. I’ve shown how we can use SQL to replicate the ARC/INFO NEAR function, and how to perform Nearest Neighbor Analysis. But, another useful took is the ability to identify k-nearest neighbors. That is, rather than just identifying the nearest neighbor, you might want to identify the two, three, or k nearest neighbors.

Radian will allow that functionality by using the COLLECT aggregate clause. The COLLECT aggregate collects values from a subgroup into a table, returning a table with one or more fields.

it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT. This is stuff that the real grown up databases like Oracle use, and Manifold is going to give it to us as part of Radian Studio.


SELECT park_no1,
SPLIT(COLLECT park_no2, dist
ORDER BY dist ASC FETCH 3
)
FROM
(
SELECT a.name AS park_no1, b.name AS park_no2,
GeomDistance(a.[geom (i)], b.[geom (i)], 0) AS dist
FROM [parks Table] AS a , [parks Table] AS b
WHERE a.name <> b.name
)
GROUP BY park_no1

Continue reading

Doing a GROUP BY in ArcGIS

As most of you know, I am a big fan of spatial SQL.  It is my go-to tool whenever working with GIS.  But, I have seen too many people using ArcGIS get tripped up with trying to summarize the results of a spatial operation because ArcGIS does not support SQL.  So today, to spare my ArcGIS friends the trouble of writing large “for” loops in Arcpy to populate data that takes hours to run, I want to show you two lines of Arcpy to very quickly replicate the GROUP BY function in SQL:

Using my favorite GIS data set in Tompkins County, NY, assume we have two layers: parcels2007 and watersheds:parwat

  Continue reading

When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:

allstu

but, as you zoom in, you can see that there are a tremendous number of points:

stuzoom

The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete. Continue reading

New Books – How do I do that in PostGIS, How do I do that in Manifold SQL

I have two new books out – How do I do that in PostGISand How do I do that in Manifold SQL.  

From the back cover of How do I do that in PostGIS:

For those who are unsure if SQL is a sufficient language for performing GIS tasks, this book is for you. This guide follows the topic headings from the book How do I do that in ArcGIS/Manifold, as a way to illustrate the capabilities of the PostGIS SQL engine for accomplishing classic GIS tasks. With this book as a resource, users will be able to perform many classic GIS functions using nothing but SQL.

Continue reading

Another Radian Test – Finding the distance between lines and areas

Following up on my previous post with ArcGIS and the Near Table, I created an SQL query in Manifold 8 to do both the near distances and group them by the number of points within specific distances (I grouped them every 50 km.).  The entire process took 47 seconds (or about 9 times faster than ArcGIS 10.1).

But, to keep things on the same playing field, I just computed the NEAR part of the query, and it ran in 40 seconds.  So, Manifold 8 was way faster than ArcGIS 10.1, but 3x slower than ArcGIS Pro.

I then wrote the following query in the Radian engine:

SELECT count(*) AS CNT, 
       first(floor(GeomDistance([L Table].[Geom (I)], 
       [P Table].[Geom (I)], 1)/50000)*50000+50000) AS DistZone, 
       [UNIQUE_ID] 
INTO bobo 
FROM [P Table] 
RIGHT JOIN [L Table] 
ON GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1) 
GROUP BY [UNIQUE_ID] 
THREADS 4

 this query took 30 seconds (or about 20% faster than Manifold 8).

 Once again, to level the playing field, I created a query to just run the NEAR aspect:

SELECT GeomDistance([L Table].[Geom (I)], [P Table].[Geom (I)], 1) AS DistZone, [UNIQUE_ID]
INTO bobo2
FROM [P Table]
RIGHT JOIN [L Table] ON 
GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1)
THREADS 4
BATCH 64

 this ran in 20 seconds.  In this case, ArcGIS Pro run slightly faster than Manifold 9 – but remember, I am still working with an alpha/beta release of Radian, and not all of the optimizations have been turned on. I can’t wait to see what the next beta will reveal.

Again, the simplicity of SQL in conjunction with the parallel nature of the Radian engine provides some very interesting opportunities for working with complex processes and large amounts of data.