ARC/INFO Functions in SQL – IDENTITY

The ARC/INFO IDENTITY operation seen here is sometimes confusing because it takes ALL of the Input feature, and the part of the IDENTITY feature that intersects the input feature and merges it into a new feature class.  I recreated the features in the ESRI help documentation so that we will work with a feature class called “Circle”, and one called “Rectangles”.

identity

To show how this is done in SQL, we should simply focus on the geometry aspect first, then bring the attributes in later.

SELECT
      ClipIntersect
(circle.id,rectangles.id) g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles
UNION ALL
SELECT
ClipSubtract
(rectangles.id,circle.id) AS g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles

The IDENTITY operation takes two geometric operations.  First, we have to Intersect the two features to find the intersection area.  Then, we have to perform a ClipSubract to only include those areas of the input feature.  Now, the UNION ALL clause in SQL creates a new table, and sticks it under the previous table – therefore, we must have the same columns in order to do it.  The above SQL will create a geometric representation like this one:

Identityresult

So, this is what it should LOOK like, but the real power is that the attributes from the inputs are retained in the output. To do that, we take our above query, and just wrap it in a RIGHT JOIN clause:

SELECT * FROM
(
SELECT

      ClipIntersect
(circle.id,rectangles.id) g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles
UNION ALL
SELECT
ClipSubtract
(rectangles.id,circle.id) AS g, int(circle.id) AS cid, int(rectangles.id) AS rid
FROM circle, rectangles
)
RIGHT JOIN [circle] ON circle.id = cid
RIGHT JOIN [rectangles] ON rectangles.id = rid
WHERE IsArea(g)

Go ahead, and give it a try with the example I have on my website.  Also, don’t forget our idea of spatial is not special.  This IDENTITY command isn’t some self contained function for which you have no control – you can always add other interesting clauses in the WHERE statement to select out certain features first, or some other interesting query.

Note:  A recent discussion on georeference.org illustrated that a lot of null values were returned.  The reason for this is because when we ask to return a ClipIntersect, if two objects don’t intersect, we’ve still asked the query engine to give us that result, and the result is in fact a null value.  So, I’ve added one line of code to only return the geometry if it is an area feature – I’ve written this in RED.  That is all you need, and it will work with more complicated features.

ARC/INFO Functions in SQL – INTERSECT

I promised you that I would address the attributes.  Remember, our previous query only performed the ClipSubract for the geometries, and did not add any additional attribute columns.  There are two ways to add the attributes.  The simplest way is to just call the columns individually or with a *:

SELECT ClipSubtract(square.[Geom (I)],box.[Geom (I)]) AS g, square.*
FROM square, box

That works fine for this example, but when we start merging attributes from multiple tables, things get a little trickier.  So, we can perform a RIGHT-JOIN on the resulting geometric table to stuff the attributes at the end of the table.  For example, to add the attributes, we simply write:

SELECT * FROM
(
SELECT ClipSubtract(square.[Geom (I)],box.[Geom (I)]) AS g,
square.id AS sid
FROM square, box
)
RIGHT JOIN [square] on square.id = sid

Here we are returning the geometric erase, but also the unique ID value and calling that sid.  The RIGHT JOIN clause takes the the attributes from the original [square] table and links it back in to the resulting geometric result in the inner portion of the query.

OK, now lets do the same thing with INTERSECT (here we will use the CIRCLE and RECTANGLE shapes that are like the ESRI help example:

SELECT * FROM
(
SELECT ClipIntersect(a.id,b.id) AS g, int(a.id) cid, int(b.id) AS rid
FROM circle AS a, rectangles AS b
)
RIGHT JOIN [circle] ON circle.id = cid
RIGHT JOIN [rectangles] ON rectangles.id = rid

Since we are dealing with more than one table, I decided to make things a little more generic and renamed the individual geometry tables a and b so to distinguish them from the original [circle] and [rectangles] tables.  Also, since we are dealing with more than one unique ID, I decided to call the [Circle] ID value cid, and the [Rectangles] ID value rid.

ARC/INFO Functions in SQL – ERASE, INTERSECT, IDENTITY, UNION

Today we are beginning another four-part session on the classical ESRI topological overlay functions of Erase, Intersect, Identity, and Union.  If you were an old ARC/INFO user, this is what made you get into GIS in the first place.  In fact, in the 1980s, if someone had a GIS the first thing you asked them was “can you do an INTERSECT?”  If you are not familiar with this, please, please, please visit the ESRI online help to learn about the functions here.

OK, a little name dropping here…..

In 1990 I was in Redlands, CA benchmarking ARC/INFO.  We had a really “difficult” task for them: overlay 25,000 hexagons with 25,000 triangles.  They were running it on a Sun  Sparcstation 1 – so, back then, this was very difficult.  They started the task before lunch, and it was done when we got back – very impressive.  After the two day benchmark was over, we were off to Ottawa to visit with GeoVision to run the benchmark there.  As we were leaving, Jack Dangermond grabbed my arm and said “you make them do the intersection test“.  I curtly nodded, and he pointed his finger at me and reiterated “no, I’m serious, you make them do it“.

Lo and behold, when the time came to run the test, the President of GeoVision acknowledged that based on their timing estimation, it would take around 3 days to complete the task – so they opted not to do it.  Back then, ARC/INFO was the only software that could do a task like that.

Now of course, other software products can do these tests, but I don’t think there is anything as fast as ARC/INFO doing overlays on a COVERAGE.

I have a Manifold .map file located here.  I’ve attempted to create the same shapes as is illustrated in the ESRI user manual.  Following the order in the ESRI manual, lets tackle ERASE first.

I did some heads-up digitizing of the box and the square used in the ESRI help file:

squarebox

The SQL for performing the ERASE is:

SELECT ClipSubtract(square.[Geom (I)],box.[Geom (I)]) AS g
FROM square, box

That’s it!  The result looks like this:

erase

Now, we haven’t done anything with the attributes, and this post has gotten a little long due to my earlier self-aggrandizing statements.  So I’ll tell you what – before we do INTERSECT in the next post, I will address the attributes.

Spatial is Not Special – Spatial Interaction completion

We’ve spent some time looking at how to create spatial interaction models using PIVOT tables and spatial constructs in SQL.  The one last thing I wanted to add was to create an interaction model for point data. In this case, we will use the cities file we looked at previously.  I’ve highlighted in red the only change we have to make.  You’ll notice, we took our previous SQL code and changed the adjacency clause to a Distance clause.  In this example, if two cities are within 50 miles, we consider them neighbors.

TRANSFORM COUNT(*)
SELECT [Cities].name
FROM [Cities], [Cities] AS [Cities2]
WHERE DistanceEarth(Cities.[Geom (I)],Cities2.[Geom (i)],”mi”) < 50
GROUP BY [Cities].[Name]
PIVOT [Cities].[Name]

which gives us a binary, symmetrical matrix like:

cityadj

 

You can see that based on our 50 mile distance, Binghamton is a neighbor of Elmira and Ithaca, but is not considered a neighbor of Auburn, Rochester, or Syracuse.

In all of these examples, we have created a spatial weight matrix of some form – whether it be distance or adjacency.  These matrices are often used in fields like transportation modeling, economic geography, environmental modeling, or spatial regression.  I hope you have an opportunity to explore some of the really cool models that are used in these fields, and if you do, you now know how to create a spatial interaction matrix to feed that beast!

 

 

Spatial Interaction (adjacency)

We recently looked at spatial interaction in the form of inverse distance weighting.  However, some spatial interaction matrices are binary in nature (0,1), with 0 meaning the two geographic objects are not neighbors, and 1 meaning they are.  A neighbor in this case could be an adjacent are (i.e. New York and Pennsylvania are neighbors, while New York and Virginia are not) or within a specified distance (i.e. with a specified distance of 100 miles, Rochester and Syracuse are neighbors, but Syracuse and Boston are not).

In our example, we have 6 New England States:

Image

The SQL for creating the binary adjacency matrix is:

TRANSFORM COUNT(*)
SELECT [States].name
FROM [States], [States] AS [States2]
WHERE Adjacent(states.[Geom (I)],states2.[Geom (i)])
GROUP BY [States].[Name]
PIVOT [States2].[Name]

The only differences is that instead of computing the Min(Distance…), we are just obtaining the aggregate COUNT of the adjacent States by using the WHERE clause for spatial adjacency.  This makes sense because Vermont for example is adjacent to New Hampshire and Massachusetts.   But, the count of the number of times Vermot is adjacent to New Hampshire is just once – the same holds true for the other states.

The result is an adjacenty table like the following:

Image

Spatial is Not Special – Spatial interaction

In my book Introduction to Statistical Problem Solving in Geography, we talk about relationships among geographic objects. These relationships are often used in a spatial adjacency matrix.  In the book we refer to the age-old question of who is my neighbor, to better understand what signifies a neighbor.  Well, neighbors could be next to each other (adjacent), they could be within a specific distance (near), or they could be contained within one another.  Also, once we have a neighbor we need to determine what kind of neighbor it is – is it a strong neighbor or a weak neighbor?  That is known as a spatial interaction model.  The next three posts are going to look at forming a spatial interaction model for distance by examining distance matrices.  If you are going to be a quantitative geographer, you are going to have to learn about distance matrices.  I am going to use a Manifold .map file located here.

Lets assume that we have a table of 6 cities called Cities, with a geometry column called ‘g’.

sixcities

The basic distance computation for a set of cities is:

SELECT int(DistanceEarth([Cities].g, [Cities2].g,”mi”)) AS [Miles],
[Cities].name, [Cities2].name
FROM [Cities], [Cities] AS [Cities2]

The DistanceEarth function is a spatial function that computes the ellipsoidal distance between two points, and in our example we are returning the distance as miles.  Because we want to compute the distance from each city to every other city, we select the data from the same table twice, but we call one of the tables [Cities2] so that SQL thinks we are working with a different table.  The result of this query is a table that has “from” city and the “to” city along with the distance in miles.

cityqry

If you are paying attention, you would say “wait, that isn’t a matrix”.  That’s right. To create a matrix, we have to turn the resulting table into a PIVOT table to present the data in a matrix form (we did this in an earlier post):

TRANSFORM Min(int(DistanceEarth([Cities].g, [Cities2].g, “mi”)))
SELECT [Cities].name
FROM [Cities], [Cities] AS [Cities2]
GROUP BY [Cities].[Name]
PIVOT [Cities2].[Name]

In this case, we grab the Min Distance for each pair of cities, and GROUP it by the City name.  This way, we have the minimum distance for each pair of cities.  The PIVOT function will pivot the table so that we create an n x n symmetrical matrix of the distances.  The result is a table that looks like this:

crossmatrix

In our next post, we will look at an inverse and inverse distance squared matrix.  Our final post will create an adjacency matrix both for adjacent geographic objects, and those within a specified lag distance.

Spatial is Not Special – Variograms with SQL

Every geographer knows Tobler’s Law.  Near things are more similar to things that are far away.  From this, we often begin our discussions of spatial autocorrelation.  I’m going to actually show you how to do join count and Moran’s I in SQL, but today I want to show how we can create a variogram using SQL.  But, before you get too excited, I want to present a different take on the variogram, and that is the perspective of the geographer and not the geostatistician.

Here is the problem, the variogram is a really cool tool to assess the spatial relationship that features have with one another.  But, usually only reserve the variogram for higher level geostatistics and most geographers run away screaming from it.  What if instead we simply used the variogram as a descriptive measure for how a geographic feature correlations over space.

In my book Introduction to Statistical Problem Solving in Geography, we introduce the variogram in this simplified descriptive way.  To do that, we illustrate the idea with the last spring frost dates in the Southeastern United States:

Image

We show the students how a simple variogram illustrates how the last spring frost dates are spatially correlated from a descriptive standpoint.  In our dataset, one can see that the data is spatially correlated up to around 400 miles, but after that, it becomes random.

Image

Surprisingly (although if you have been reading this blog, not surprisingly), it is relatively easy to generate a variogram in SQL.  The code is (assuming we have a set of point in a table called ‘p’ with an attribute field [lsf]):

SELECT dist,avg(abs(diff))/2 AS semivariance
FROM
   (SELECT Floor(DistanceEarth(P.id,P3.id,”mi”)/50) AS dist,
(p.[lsf] – p3.[lsf]) AS diff
   FROM p, p AS p3)
GROUP BY dist

Once again, we will dive into the sub-query.  We want to compare every point with every other point, so we need two tables.  Issuing the  FROM p, p AS p3  portion will treat the table [p] like it is another table called [p3] – sneaky.  We are also computing for each point the distance between itself and every other point in the file in miles.  In addition, we are looking at the differences between the last spring frost date and called that result diff.

As for the distances, we are determining the distance between each point, and dividing it by 50 and also using the Floor function.  This is actually going to truncate the points at 50 miles (this is how we trick SQL into giving us lag distance.

Now, all of this is finally wrapped into the aggregate clause that shows the differences in distance grouped by the lag (multiples of 50) and their average differences.  The query returns two columns which we use as a scatterplot to show our vairogram.

Spatial is Not Special – Modifiable Area Unit Problem (MAUP)

I wrote a paper in 2006 about using SQL for the Modifiable Area Unit Problem (MAUP).  The example was an agricultural experiment field, and I showed how SQL could be used to calculate the descriptive statistics rather easily, and we also used SQL to develp a trend surface map to better understand the underlying trends of the data and determine the best areal units to use for grouping the data.  So, we thought we would use that example in the book.  MAUP is a conundrum that geographers have been thinking about for a long time, and quite frankly we will think about in the future.  I won’t go into what MAUP is here, for that you can buy the book!  But, one of our figures showed the data for 3 different blocking patterns and the corresponding descriptive statistics:

maup

The key is to develop descriptive statistics for each point that is located within each block.  This is easily accomplished through the CONTAINS and the GROUP BY clauses as follows:

SELECT avg(AVG_P) AS AVG_P, avg(AVG_K) AS AVG_K, var(AVG_K) AS VAR_K, var(AVG_P) AS VAR_P
FROM
(
   SELECT grid.id, avg(p) AVG_P, avg(k) AVG_K
   FROM grid, samples
   WHERE contains(grid.id,samples.id)
  GROUP BY grid.id
)

The subquery within the parenthesis computes the average Phosphorous (P) and Potassium (K) values for the soils, and groups it by the grid the points are contained in.  Based on our figure above, that means we will have 6 values for K and P.  If you wanted to leave it at that, you could put those values into a scatter plot to view the correlation, or wrap or copy the data to Excel and compute the correlation coefficient.  But, for us, we will take the 6 results from the sub query and compute the averages and variances of P and K for the overall grouped data.

The ARC/INFO NEAR Function in SQL

Lets take another break from my textbook, and focus on a classic GIS operation.  If you are like me, you learned GIS with ARC/INFO.  I started with ARC/INFO 3.2 on a DEC VAX PDP 11.  Actually, my first ESRI product was AUTOMAP II on a Burroughs mainframe with the CANDE operating system – we even had to use a SYMAP ruler to plot our points for digitizing.

Anyway, for those of you who used ARC/INFO, you probably found that the NEAR function was one of your best friends.  I used it every week to transfer attributes from points to lines, and vice-verse.  This is pretty easy to do in SQL, but I will step through it for you to make it more understandable.  Lets assume we have two vector drawings: Points and Lines.  Obviously, the Points drawing has points in it, and the Lines drawing has lines in it.  The SQL to replicate the NEAR function is:

1. SELECT min(dist), first(lines.id),points.[ID] FROM
2.     (SELECT lines.id, points.id,distance(points.[Geom (I)],lines.[Geom (I)]) AS dist
3.       FROM points, lines
4.       ORDER BY dist)
5. GROUP BY points.id

Continue reading

Stream Sinuosity with Spatial SQL

In case you are getting tired of reading examples from my book (did I mention that I have a textbook you can buy from Amazon?), I thought I would throw a post in here and there that illustrates other aspects of SQL and GIS.  A friend recently sent me a file of stream segments, and wanted to determine the sinuosity of each stream segment.  You may recall the sinuosity formula as the length of the segment / distance between the start and endpoint of a line.  This is essentially a ratio of the theoretical straight line distance between the endpoints and the actual distance along the segment.

 Lets suppose we have a vector file named Stream_Sample. Sinuosity is easily calculated with SQL and a few spatial constructs as:

SELECT Length([Geom (I)]) / Distance(StartPoint([Geom (I)]),EndPoint([Geom (I)])) AS ss
FROM [Stream_Sample]

Continue reading