Follow up to my big data test – improving PostGIS performance

Just a quick follow-up to my big data test.  If you remember, I was able to determine the number of taxi pickups and the sum of the fares for each zone using Postgres and PostGIS in 1m 40s.  Some of the taxi zones are a little large, so the containment query might actually take a little longer when comparing the bounding boxes in the spatial index.  To get around that, I used ST_SubDivide to break the larger taxi zones into smaller polygons:

this meant that my taxi zone polygons went from 263 to 4,666.  Now, on the face level, what idiot would do an overlay with 4,666 polygons when 263 is smaller – this idiot!  To understand this, you should read my blog post on When More is Less, you’ll see there is good logic behind the madness.  Well, anyway, that’s what I did, and we went from 1m 40s down to 1m 3s.

For those of you interested, I broke the zones up as follows:

SELECT ST_Subdivide("Geom", 50) AS geom, zone
into taxisub FROM taxizones;

CLUSTER taxisub USING geom_idx;

and yes, CLUSTER once again made a big difference.

I guess I should explain the SQL this time around, as it enables us to do some clever things.  Remember, taxisub has 4,666 polygons because it has subdivided the 263 polygons in taxizones.

SELECT taxizones."Geom" AS geom, count(id) AS numrides, sumfare, a.zone
INTO sumtable
FROM taxizones,
(SELECT taxisub.zone, sum(taxisandy.fare_amount) AS sumfare
FROM taxisub
JOIN taxisandy
ON ST_Contains(geom, pu_geom)
GROUP BY zone) AS a
WHERE taxizones.zone = a.zone

In the above query, the inner portion is the straight-up SQL to determine the total rides and sum of the fares in each polygon in taxisub.  However, taxisub has over 4000 polygons – we don’t want to write that out.  So, the outer portion of the query is joining the original taxizones (the one with only 263 polygons), and writing it out to a final table.

Again, if you want to learn how to do more spatial SQL like this, check out my courses at http://www.gisadvisor.com.

6 thoughts on “Follow up to my big data test – improving PostGIS performance”

1. May I ask you to retry:

– with PostGIS 2.5, which has different ST_Subdivide method that leaves less extra points in subdivision contour,

– don’t do CLUSTER, but instead add ORDER BY geom into ST_Subdivide generating query? That should yield better spatial clustering.

2. Thanks. I’ll give it a try.

3. Application of triggers on “big” EHR data may aid in identifying patients experiencing delays in diagnostic evaluation of chest imaging results suspicious for malignancy.

4. We applied a trigger in a repository hosting EHR data from all Department of Veterans Affairs health-care facilities and analyzed data from seven facilities. Using literature reviews and expert input, we refined previously developed trigger criteria designed to identify patients potentially experiencing delays in diagnostic evaluation of chest imaging flagged as “suspicious for malignancy.” The trigger then excluded patients in whom further evaluation was unnecessary (eg, those with terminal illnesses or with already completed biopsies). The criteria were programmed into a computerized algorithm. Reviewers examined a random sample of trigger-positive (ie, patients with trigger-identified delay) and trigger-negative (ie, patients with an abnormal imaging result but no delay) records and confirmed the presence or absence of delay or need for additional tracking (eg, repeat imaging in 6 months). Analysis included calculating the trigger’s diagnostic performance (ie, positive predictive value, negative predictive value, sensitivity, specificity).