k-Nearest Neighbor with PostGRES

Well, even though you didn’t ask for it, I decided to create a k-nearest neighbor analysis in PostGRES/PostGIS.  This has lots of potential: find the 5 nearest ATMs to every fast food restuarant; find the 3 closest medical clinics to each school; or find the 10 closest bars to each college.

Now, the LIMIT clause does allow us to find the k closest features to a particular feature.  But in this case, I’m not interested in a particular feature, I want the results for all features.

To do this in Postgres, I’m going to show you two things that I haven’t used until today: rank() and partition. Continue reading

k-nearest neighbor with SQL in Radian Studio

I wanted to give you another look at some features that Radian Studio will offer. I’ve shown how we can use SQL to replicate the ARC/INFO NEAR function, and how to perform Nearest Neighbor Analysis. But, another useful took is the ability to identify k-nearest neighbors. That is, rather than just identifying the nearest neighbor, you might want to identify the two, three, or k nearest neighbors.

Radian will allow that functionality by using the COLLECT aggregate clause. The COLLECT aggregate collects values from a subgroup into a table, returning a table with one or more fields.

it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT. This is stuff that the real grown up databases like Oracle use, and Manifold is going to give it to us as part of Radian Studio.


SELECT park_no1,
SPLIT(COLLECT park_no2, dist
ORDER BY dist ASC FETCH 3
)
FROM
(
SELECT a.name AS park_no1, b.name AS park_no2,
GeomDistance(a.[geom (i)], b.[geom (i)], 0) AS dist
FROM [parks Table] AS a , [parks Table] AS b
WHERE a.name <> b.name
)
GROUP BY park_no1

Continue reading

Trade Area Analysis in Postgres / PostGIS

How I used SQL to generate an accumulated sum of spatial data

My friend Tomas does work with business analytics, and wanted to find a way to perform trade area analysis.  He had a bunch of stores, and a map of census areas with populations.  What he wanted to figure out was:

how far do I have to radiate out from each store before I hit 5,000 customers

So for each store, he wanted to basically generate concentric buffers and sum up the population of the census areas before he hit 5,000.  Once he found the census areas, he also wanted to generate a convex hull to show the area.  ESRI has a really nice tool that performs this as part of their business analyst package, called threshold trade areas.

Check it out, it seems pretty cool.

Well, to help my friend I was thinking that I would determine the distance from every store to every census area, and then write a script with a giant for loop to iterate through each record, accumulating the results (of course, I would have to integrate some kind of do..while loop and an if…then to make sure I accumulated different counts for each store until I hit the threshold I wanted.  At that point I began asking myself so, how good of friend is Tomas?

What I did instead was write an SQL script to do it. I’ve color coded it below to explain what I was doing….

SELECT ST_ConvexHull(ST_Collect(g)) as geometry, max(sumpop) as sumpop, name
INTO tradearea
FROM
 (
  SELECT a.name, SUM(a.totpop) AS sumpop, ST_Collect(a.geometry) as g
  FROM 
     (SELECT stores.name, censusunit.totpop, censusunit.geometry,
      ST_Distance(censusunit.geometry,stores.geometry) as dist
      FROM stores, censusunit
      ) AS a,
     (SELECT stores.name, censusunit.totpop, censusunit.geometry,
      ST_Distance(censusunit.geometry,stores.geometry) as dist
      FROM stores, censusunit
      ) AS b
  WHERE a.name = b.name
  AND a.dist <= b.dist
  GROUP BY a.name, b.dist
  ) AS T1
WHERE sumpop < 5000
GROUP BY name

The middle portion in orange collects the names of the stores, the population of the census areas, the distance between each store and each census area, and the geometry of the census areas for each combination of stores and census areas.  So, if you have 5 stores and 1,000 census areas, you would have a table with 5,000 rows: Continue reading

SQL and ogr to create 4D lines

A friend recently asked me to help him generate polyline shapefiles with Z and M values that he could deliver to a customer.  The problem was, the software he was using supported the import of Z and M values, but did not support the export of those files.  The other problem was, he has zillions of data tables that he needed to export!

Fortunately, PostGIS allows us to create Polyline ZM data rather easily.  The next part was to figure out how to get it exported to a shapefile.  So, here goes:

Assume you have a table with 4 columns: lineID, Z, M, and geometry.  The geometry field is a point feature that represent vertices on a line.  The lineID separates all those points by which line they are part of, and the Z and M values are, well, Z and M.

Make the Polyline ZM

The SQL command to create a 4D line is:

SELECT  lineid, 
ST_MakeLine(ST_MakePoint(ST_X(geometry), 
                         ST_Y(geometry), z, m)
            ) AS g
FROM zpts
GROUP BY lineid;

In this case, I am making a line (ST_MakeLine) as a series of X (ST_X), Y (ST_Y), Z, and M values.  Grouping the results of the query by the lineid allows me to create a line for each set of points. Continue reading

Doing a GROUP BY in ArcGIS

As most of you know, I am a big fan of spatial SQL.  It is my go-to tool whenever working with GIS.  But, I have seen too many people using ArcGIS get tripped up with trying to summarize the results of a spatial operation because ArcGIS does not support SQL.  So today, to spare my ArcGIS friends the trouble of writing large “for” loops in Arcpy to populate data that takes hours to run, I want to show you two lines of Arcpy to very quickly replicate the GROUP BY function in SQL:

Using my favorite GIS data set in Tompkins County, NY, assume we have two layers: parcels2007 and watersheds:parwat

  Continue reading

Spatial is Not Special – Quadrat Analysis

In our book we illustrated the use of quadrat analysis for determining whether points were random, clustered, or distributed.  Figure 14.9 from the book showed a point sample of 2,500 points, and Table 14.4 showed the mathematical calculation for quadrat analysis.

Image

 

 

Image

The calculations look pretty daunting, don’t they?  But, in actuality, its basic arithmetic.  In this blog I am only going to illustrate how we obtained the correct variance to mean ratio using spatial SQL.  If you want to understand quadrat analysis, check out the book, or do a web search. Continue reading

Spatial is Not Special – Nearest Neighbor Index

 

It is nice to get back to the book, and start talking about Statistical Problem Solving in Geography again.  Today we are going to look at the Nearest Neighbor Index.  You can refer to chapter 14 where we illustrate the computation of the nearest neighbor index using a set of 7 points:

 nnfig

Then, for each point we determine the nearest neighbor, its distance, and ultimately the average nearest neighbor distance over all the points:

 nncalc

To develop the SQL, we will take it one step at a time.  First, we want to compute the distance from every point to every other point:

SELECT a.pt, b.pt,distance(a.[Geom (I)],b.[Geom (I)]) AS dist
FROM points AS a, points AS b
WHERE a.pt <> b.pt
ORDER BY dist

This query gives us a table of the distance from every point to every other point.  We also play that game again where we rename the table “points” as “a” and “b” so that SQL thinks we have two different tables.  We also have to put a WHERE clause in to make sure we aren’t measuring from one point to itself – because the distance will be 0. Continue reading

ARC/INFO Functions in SQL – IDENTITY

The ARC/INFO IDENTITY operation seen here is sometimes confusing because it takes ALL of the Input feature, and the part of the IDENTITY feature that intersects the input feature and merges it into a new feature class.  I recreated the features in the ESRI help documentation so that we will work with a feature class called “Circle”, and one called “Rectangles”.

identity

To show how this is done in SQL, we should simply focus on the geometry aspect first, then bring the attributes in later.

SELECT
      ClipIntersect
(circle.id,rectangles.id) g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles
UNION ALL
SELECT
ClipSubtract
(rectangles.id,circle.id) AS g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles

The IDENTITY operation takes two geometric operations.  First, we have to Intersect the two features to find the intersection area.  Then, we have to perform a ClipSubract to only include those areas of the input feature.  Now, the UNION ALL clause in SQL creates a new table, and sticks it under the previous table – therefore, we must have the same columns in order to do it.  The above SQL will create a geometric representation like this one:

Identityresult

So, this is what it should LOOK like, but the real power is that the attributes from the inputs are retained in the output. To do that, we take our above query, and just wrap it in a RIGHT JOIN clause:

SELECT * FROM
(
SELECT

      ClipIntersect
(circle.id,rectangles.id) g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles
UNION ALL
SELECT
ClipSubtract
(rectangles.id,circle.id) AS g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles
)
RIGHT JOIN [circle] ON circle.id = cid
RIGHT JOIN [rectangles] ON rectangles.id = rid
WHERE IsArea(g)

Go ahead, and give it a try with the example I have on my website.  Also, don’t forget our idea of spatial is not special.  This IDENTITY command isn’t some self contained function for which you have no control – you can always add other interesting clauses in the WHERE statement to select out certain features first, or some other interesting query.

Note:  A recent discussion on georeference.org illustrated that a lot of null values were returned.  The reason for this is because when we ask to return a ClipIntersect, if two objects don’t intersect, we’ve still asked the query engine to give us that result, and the result is in fact a null value.  So, I’ve added one line of code to only return the geometry if it is an area feature – I’ve written this in RED.  That is all you need, and it will work with more complicated features.

ARC/INFO Functions in SQL – INTERSECT

I promised you that I would address the attributes.  Remember, our previous query only performed the ClipSubract for the geometries, and did not add any additional attribute columns.  There are two ways to add the attributes.  The simplest way is to just call the columns individually or with a *:

SELECT ClipSubtract(square.[Geom (I)],box.[Geom (I)]) AS g, square.*
FROM square, box

That works fine for this example, but when we start merging attributes from multiple tables, things get a little trickier.  So, we can perform a RIGHT-JOIN on the resulting geometric table to stuff the attributes at the end of the table.  For example, to add the attributes, we simply write:

SELECT * FROM
(
SELECT ClipSubtract(square.[Geom (I)],box.[Geom (I)]) AS g,
square.id AS sid
FROM square, box
)
RIGHT JOIN [square] on square.id = sid

Here we are returning the geometric erase, but also the unique ID value and calling that sid.  The RIGHT JOIN clause takes the the attributes from the original [square] table and links it back in to the resulting geometric result in the inner portion of the query.

OK, now lets do the same thing with INTERSECT (here we will use the CIRCLE and RECTANGLE shapes that are like the ESRI help example:

SELECT * FROM
(
SELECT ClipIntersect(a.id,b.id) AS g, int(a.id) cid, int(b.id) AS rid
FROM circle AS a, rectangles AS b
)
RIGHT JOIN [circle] ON circle.id = cid
RIGHT JOIN [rectangles] ON rectangles.id = rid

Since we are dealing with more than one table, I decided to make things a little more generic and renamed the individual geometry tables a and b so to distinguish them from the original [circle] and [rectangles] tables.  Also, since we are dealing with more than one unique ID, I decided to call the [Circle] ID value cid, and the [Rectangles] ID value rid.