Follow up to my big data test – improving PostGIS performance

Just a quick follow-up to my big data test.  If you remember, I was able to determine the number of taxi pickups and the sum of the fares for each zone using Postgres and PostGIS in 1m 40s.  Some of the taxi zones are a little large, so the containment query might actually take a little longer when comparing the bounding boxes in the spatial index.  To get around that, I used ST_SubDivide to break the larger taxi zones into smaller polygons:

tsub

this meant that my taxi zone polygons went from 263 to 4,666.  Now, on the face level, what idiot would do an overlay with 4,666 polygons when 263 is smaller – this idiot!  To understand this, you should read my blog post on When More is Less, you’ll see there is good logic behind the madness.  Well, anyway, that’s what I did, and we went from 1m 40s down to 1m 3s.

For those of you interested, I broke the zones up as follows:

SELECT ST_Subdivide("Geom", 50) AS geom, zone
into taxisub FROM taxizones;

CLUSTER taxisub USING geom_idx;

and yes, CLUSTER once again made a big difference.

I guess I should explain the SQL this time around, as it enables us to do some clever things.  Remember, taxisub has 4,666 polygons because it has subdivided the 263 polygons in taxizones.

SELECT taxizones."Geom" AS geom, count(id) AS numrides, sumfare, a.zone
INTO sumtable
FROM taxizones, 
   (SELECT taxisub.zone, sum(taxisandy.fare_amount) AS sumfare
    FROM taxisub
    JOIN taxisandy
    ON ST_Contains(geom, pu_geom)
    GROUP BY zone) AS a
WHERE taxizones.zone = a.zone

In the above query, the inner portion is the straight-up SQL to determine the total rides and sum of the fares in each polygon in taxisub.  However, taxisub has over 4000 polygons – we don’t want to write that out.  So, the outer portion of the query is joining the original taxizones (the one with only 263 polygons), and writing it out to a final table.

Again, if you want to learn how to do more spatial SQL like this, check out my courses at http://www.gisadvisor.com.  

 

3 thoughts on “Follow up to my big data test – improving PostGIS performance

  1. May I ask you to retry:

    – with PostGIS 2.5, which has different ST_Subdivide method that leaves less extra points in subdivision contour,

    – don’t do CLUSTER, but instead add ORDER BY geom into ST_Subdivide generating query? That should yield better spatial clustering.

  2. Application of triggers on “big” EHR data may aid in identifying patients experiencing delays in diagnostic evaluation of chest imaging results suspicious for malignancy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s