How I used SQL to generate an accumulated sum of spatial data
My friend Tomas does work with business analytics, and wanted to find a way to perform trade area analysis. He had a bunch of stores, and a map of census areas with populations. What he wanted to figure out was:
how far do I have to radiate out from each store before I hit 5,000 customers
So for each store, he wanted to basically generate concentric buffers and sum up the population of the census areas before he hit 5,000. Once he found the census areas, he also wanted to generate a convex hull to show the area. ESRI has a really nice tool that performs this as part of their business analyst package, called threshold trade areas.
Check it out, it seems pretty cool.
Well, to help my friend I was thinking that I would determine the distance from every store to every census area, and then write a script with a giant for loop to iterate through each record, accumulating the results (of course, I would have to integrate some kind of do..while loop and an if…then to make sure I accumulated different counts for each store until I hit the threshold I wanted. At that point I began asking myself so, how good of friend is Tomas?
What I did instead was write an SQL script to do it. I’ve color coded it below to explain what I was doing….
SELECT ST_ConvexHull(ST_Collect(g)) as geometry, max(sumpop) as sumpop, nameINTO tradeareaFROM ( SELECT a.name, SUM(a.totpop) AS sumpop, ST_Collect(a.geometry) as g FROM (SELECT stores.name, censusunit.totpop, censusunit.geometry, ST_Distance(censusunit.geometry,stores.geometry) as dist FROM stores, censusunit ) AS a,
(SELECT stores.name, censusunit.totpop, censusunit.geometry, ST_Distance(censusunit.geometry,stores.geometry) as dist FROM stores, censusunit ) AS b WHERE a.name = b.name AND a.dist <= b.dist GROUP BY a.name, b.dist ) AS T1WHERE sumpop < 5000GROUP BY name
The middle portion in orange collects the names of the stores, the population of the census areas, the distance between each store and each census area, and the geometry of the census areas for each combination of stores and census areas. So, if you have 5 stores and 1,000 census areas, you would have a table with 5,000 rows:
Table 1. Result of orange query, sorted by store name and distance for clarity.
Now, I’ve issued the command twice because I wanted to create two tables of the same information. One table I am calling A, and the other table I am calling B.
The next part in green is a query where I select from tables A and B the store name, the sum of the populations for the census areas, and a collection of the geometries (the ST_Collect function basically creates a multi-part geometry of all the geometries that are selected). In the WHERE clause I make sure that the store names are the same and, the distance between the store in Table A is less than or equal to the distance to the store in Table B (this is a clever little trick that gets me all the areas in one table that are less than a particular distance from the other table). Finally, the GROUP BY statement allows me to obtain the sum of all the distances that are less than my target distance, and also the collection of geometries that meet the criteria.
Table 2. Each store now has the joined geometries and the sum of the population for each distance.
Notice what is going on here. For the store Max Liberia, each row shows a progressively increasing sumpop. Look back at Table 1, and you’ll notice that the population at 0m was 654 and the population at 204m was 419. So, 654 + 419 = 1074! This little trick actually created an accumulated sum. I’m still shaking my head over this!
We are now on to the red part of the query (getting ready to bring this thing home). At this point we now have Table 2 in memory, so lets see what the query is doing. We are taking all the geometries in the column g, and and creating a convex hull – now, we get one single polygon. We are also getting the maximum totpop value (remember, that value represents the accumulated population of all the census areas that are closer than that particular distance), and the name of the store. But, that doesn’t happen by magic. At the bottom of the query, we issue a WHERE clause that finds all the records where sumpop is less than 5,000 (our target population). The GROUP BY simply creates records for each store and allows us to perform those aggregate clauses of max and ST_ConvexHull. The resulting table looks like this:
Table 3. Resulting table of 3 convex hulls representing census areas that have an accumulated total population less than 5,000 people.
so the GROUP BY allowed for the creation of a table with 3 records (because there were 3 stores), and the geometry field is a convex hull of all the census areas that met the criteria for having a totpop less than 5,000.
I then brought the resulting table into Manifold and QGIS for display.
Table 4. Trade area zones for each store overlain on top of census areas. The trade areas are a convex hull of each stores territory that covers an accumulated population less than 5,000.
Final Thoughts
While this analysis runs in under 3 seconds, what is most impressive is that I conceived of, and wrote the SQL query in about 8-10 minutes. If you follow my logic, you’ll see why: I started with the orange part of the query – just a simple distance query, and then asked myself how can I get the accumulated sum of the populations, which led me to the green part of the query, which then made the final red part of the query a no-brainer.
If you want to learn how to create spatial SQL like this, please consider taking my SQL course – it’s comprehensive, and cheap! The course is Spatial SQL: A Language for Geographers, and this link gets you the course for under $30.
Comments