k-nearest neighbor with SQL in Radian Studio

I wanted to give you another look at some features that Radian Studio will offer. I’ve shown how we can use SQL to replicate the ARC/INFO NEAR function, and how to perform Nearest Neighbor Analysis. But, another useful took is the ability to identify k-nearest neighbors. That is, rather than just identifying the nearest neighbor, you might want to identify the two, three, or k nearest neighbors.

Radian will allow that functionality by using the COLLECT aggregate clause. The COLLECT aggregate collects values from a subgroup into a table, returning a table with one or more fields.

it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT. This is stuff that the real grown up databases like Oracle use, and Manifold is going to give it to us as part of Radian Studio.


SELECT park_no1,
SPLIT(COLLECT park_no2, dist
ORDER BY dist ASC FETCH 3
)
FROM
(
SELECT a.name AS park_no1, b.name AS park_no2,
GeomDistance(a.[geom (i)], b.[geom (i)], 0) AS dist
FROM [parks Table] AS a , [parks Table] AS b
WHERE a.name <> b.name
)
GROUP BY park_no1


So what’s happening here? Let’s walk through it:

Selecting distances from each park to every other park

First, the inner query (lines 7-10) is getting us the distances from every point to every other point. In this case, I’m using the same layer (parks) and therefore playing the little trick I’ve shown you before – renaming the parks table as A and B so that the SQL engine thinks they are two different tables.  So, the result would look something like this:

innerq

Table 1. Result of a query to find the distance from every park to every other park

Collecting the Results

Remember that the COLLECT aggregate collects values from a subgroup into a table – we can’t actually see the table from the COLLECT aggregate – we have to split the table using the SPLIT function.  But you’ll notice that in the COLLECT portion (lines 1-3) we fetch 3 of the nearest neighbors for each park (notice that we are sorting the data in ASCENDING order so that the three closest parks are selected).  Notice the GROUP BY in the last line -that will quantify the results by each park number so that the resulting table looks like this:

qresult

Table 2.  Resulting query that selects the three nearest parks to every other park.

This is pretty powerful stuff. There are all kinds of ways to use it – finding the three closest ATMs to every pub in the city, or finding the 5 closest  fire stations to every factory. Try and return the 3 nearest neighbors for each feature in another GIS, and you’ll be scratching your head for awhile (there is a reason why I’ve demonstrated nearest neighbor in the past, but not demonstrated k-neighbors!).

And of course, the best part of SQL is you can simply wrap the entire query in parentheses, and add another line like:

SELECT avg(dist) FROM

and you have yourself a nearest neighbor calculation for a k-nearest neighbor.

One thought on “k-nearest neighbor with SQL in Radian Studio

  1. Thanks for this helpful code Art. Thought you might like to see how fetching the top N rows was tackled by Tim in Manifold 8 in case you missed it. http://www.georeference.org/forum/t122582.14
    As Adam states “The technique of joining a query with a copy of itself with an ordering condition, then grouping and using COUNT to establish ranking is very useful (and generic), indeed.”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s