A New Online Course in Statistical Problem Solving in Geography

I am so happy to see the uptake in universities that are using my textbook An Introduction to Statistical Problem Solving in Geography.  Not only are more schools using the book, but the feedback from the faculty has been tremendous:

Just yesterday, a Professor wrote:

I use your textbook for our required “Statistical Research Methods in Geography” course at [Name of College Withheld] and am so glad that you are continuing to update and support it as I find it the best book on the subject out there.

 As you know, I also put together a corresponding workbook that includes laboratory exercises for each of the book chapters.  A number of schools are using the book, and I provide answer keys to the professors.  This has worked out well, allowing students to obtain a textbook and a laboratory workbook for under $70.

To support the faculty using my book, I provide them with my lecture notes.  That way, they can get their classes spun-up a lot quicker.  I just send out an updated version of my lecture notes this week. Continue reading

Workbook for Statistical Problem Solving in Geography

My workbook for Statistical Problem Solving in Geography (ISBN-10: 1502982293, ISBN-13: 978-1502982292) is now available for purchase through:

logo-csp-no-tm   lulu_logo_retina       navAmazonLogoFooter._V169459313_

The workbook contains the full semester of exercises I use in my courbookcoverse GEOG 204 – Statistical Problem Solving in Geography.  The workbook follows the sequencing presented in An Introduction to Statistical Problem Solving in Geography, published by Waveland Press.  And best of all, the workbook is only $19.99 – most workbooks for college courses are over $50!

The data for the exercises is located HERE.  All of the data is in Microsoft Excel format with tabs for each chapter and exercise.

If you purchase this book, do not pay for the more expensive shipping – regular mail, which will take about a week after the book is printed, costs around $3.00.  To help you get started right away I have provided the first three chapters of the workbook HERE for free.  Therefore, even if you wait until the first day of class to purchase the workbook, you will be able to use the first 3 exercises before your book arrives. Continue reading

Spatial is Not Special – Quadrat Analysis

In our book we illustrated the use of quadrat analysis for determining whether points were random, clustered, or distributed.  Figure 14.9 from the book showed a point sample of 2,500 points, and Table 14.4 showed the mathematical calculation for quadrat analysis.





The calculations look pretty daunting, don’t they?  But, in actuality, its basic arithmetic.  In this blog I am only going to illustrate how we obtained the correct variance to mean ratio using spatial SQL.  If you want to understand quadrat analysis, check out the book, or do a web search. Continue reading

Spatial is Not Special – Nearest Neighbor Index


It is nice to get back to the book, and start talking about Statistical Problem Solving in Geography again.  Today we are going to look at the Nearest Neighbor Index.  You can refer to chapter 14 where we illustrate the computation of the nearest neighbor index using a set of 7 points:


Then, for each point we determine the nearest neighbor, its distance, and ultimately the average nearest neighbor distance over all the points:


To develop the SQL, we will take it one step at a time.  First, we want to compute the distance from every point to every other point:

SELECT a.pt, b.pt,distance(a.[Geom (I)],b.[Geom (I)]) AS dist
FROM points AS a, points AS b
WHERE a.pt <> b.pt

This query gives us a table of the distance from every point to every other point.  We also play that game again where we rename the table “points” as “a” and “b” so that SQL thinks we have two different tables.  We also have to put a WHERE clause in to make sure we aren’t measuring from one point to itself – because the distance will be 0. Continue reading

Spatial is Not Special – Spatial interaction

In my book Introduction to Statistical Problem Solving in Geography, we talk about relationships among geographic objects. These relationships are often used in a spatial adjacency matrix.  In the book we refer to the age-old question of who is my neighbor, to better understand what signifies a neighbor.  Well, neighbors could be next to each other (adjacent), they could be within a specific distance (near), or they could be contained within one another.  Also, once we have a neighbor we need to determine what kind of neighbor it is – is it a strong neighbor or a weak neighbor?  That is known as a spatial interaction model.  The next three posts are going to look at forming a spatial interaction model for distance by examining distance matrices.  If you are going to be a quantitative geographer, you are going to have to learn about distance matrices.  I am going to use a Manifold .map file located here.

Lets assume that we have a table of 6 cities called Cities, with a geometry column called ‘g’.


The basic distance computation for a set of cities is:

SELECT int(DistanceEarth([Cities].g, [Cities2].g,”mi”)) AS [Miles],
[Cities].name, [Cities2].name
FROM [Cities], [Cities] AS [Cities2]

The DistanceEarth function is a spatial function that computes the ellipsoidal distance between two points, and in our example we are returning the distance as miles.  Because we want to compute the distance from each city to every other city, we select the data from the same table twice, but we call one of the tables [Cities2] so that SQL thinks we are working with a different table.  The result of this query is a table that has “from” city and the “to” city along with the distance in miles.


If you are paying attention, you would say “wait, that isn’t a matrix”.  That’s right. To create a matrix, we have to turn the resulting table into a PIVOT table to present the data in a matrix form (we did this in an earlier post):

TRANSFORM Min(int(DistanceEarth([Cities].g, [Cities2].g, “mi”)))
SELECT [Cities].name
FROM [Cities], [Cities] AS [Cities2]
GROUP BY [Cities].[Name]
PIVOT [Cities2].[Name]

In this case, we grab the Min Distance for each pair of cities, and GROUP it by the City name.  This way, we have the minimum distance for each pair of cities.  The PIVOT function will pivot the table so that we create an n x n symmetrical matrix of the distances.  The result is a table that looks like this:


In our next post, we will look at an inverse and inverse distance squared matrix.  Our final post will create an adjacency matrix both for adjacent geographic objects, and those within a specified lag distance.

Spatial is Not Special – Variograms with SQL

Every geographer knows Tobler’s Law.  Near things are more similar to things that are far away.  From this, we often begin our discussions of spatial autocorrelation.  I’m going to actually show you how to do join count and Moran’s I in SQL, but today I want to show how we can create a variogram using SQL.  But, before you get too excited, I want to present a different take on the variogram, and that is the perspective of the geographer and not the geostatistician.

Here is the problem, the variogram is a really cool tool to assess the spatial relationship that features have with one another.  But, usually only reserve the variogram for higher level geostatistics and most geographers run away screaming from it.  What if instead we simply used the variogram as a descriptive measure for how a geographic feature correlations over space.

In my book Introduction to Statistical Problem Solving in Geography, we introduce the variogram in this simplified descriptive way.  To do that, we illustrate the idea with the last spring frost dates in the Southeastern United States:


We show the students how a simple variogram illustrates how the last spring frost dates are spatially correlated from a descriptive standpoint.  In our dataset, one can see that the data is spatially correlated up to around 400 miles, but after that, it becomes random.


Surprisingly (although if you have been reading this blog, not surprisingly), it is relatively easy to generate a variogram in SQL.  The code is (assuming we have a set of point in a table called ‘p’ with an attribute field [lsf]):

SELECT dist,avg(abs(diff))/2 AS semivariance
   (SELECT Floor(DistanceEarth(P.id,P3.id,”mi”)/50) AS dist,
(p.[lsf] – p3.[lsf]) AS diff
   FROM p, p AS p3)

Once again, we will dive into the sub-query.  We want to compare every point with every other point, so we need two tables.  Issuing the  FROM p, p AS p3  portion will treat the table [p] like it is another table called [p3] – sneaky.  We are also computing for each point the distance between itself and every other point in the file in miles.  In addition, we are looking at the differences between the last spring frost date and called that result diff.

As for the distances, we are determining the distance between each point, and dividing it by 50 and also using the Floor function.  This is actually going to truncate the points at 50 miles (this is how we trick SQL into giving us lag distance.

Now, all of this is finally wrapped into the aggregate clause that shows the differences in distance grouped by the lag (multiples of 50) and their average differences.  The query returns two columns which we use as a scatterplot to show our vairogram.

Spatial is Not Special – Modifiable Area Unit Problem (MAUP)

I wrote a paper in 2006 about using SQL for the Modifiable Area Unit Problem (MAUP).  The example was an agricultural experiment field, and I showed how SQL could be used to calculate the descriptive statistics rather easily, and we also used SQL to develp a trend surface map to better understand the underlying trends of the data and determine the best areal units to use for grouping the data.  So, we thought we would use that example in the book.  MAUP is a conundrum that geographers have been thinking about for a long time, and quite frankly we will think about in the future.  I won’t go into what MAUP is here, for that you can buy the book!  But, one of our figures showed the data for 3 different blocking patterns and the corresponding descriptive statistics:


The key is to develop descriptive statistics for each point that is located within each block.  This is easily accomplished through the CONTAINS and the GROUP BY clauses as follows:

   SELECT grid.id, avg(p) AVG_P, avg(k) AVG_K
   FROM grid, samples
   WHERE contains(grid.id,samples.id)
  GROUP BY grid.id

The subquery within the parenthesis computes the average Phosphorous (P) and Potassium (K) values for the soils, and groups it by the grid the points are contained in.  Based on our figure above, that means we will have 6 values for K and P.  If you wanted to leave it at that, you could put those values into a scatter plot to view the correlation, or wrap or copy the data to Excel and compute the correlation coefficient.  But, for us, we will take the 6 results from the sub query and compute the averages and variances of P and K for the overall grouped data.

Spatial is not special – Weighted Standard Distance

This post is part of a series entitled Spatial is not Special, where I will illustrate how spatial constructs in SQL provide us with a rich toolset for geographers doing spatial analysis.  To illustrate these concepts, I will be showing examples from my book Statistical Problem Solving in Geography.  Even though PostGRES, SQLServer, MySQL, spatialite, and Oracle all support spatial SQL, these examples will be presented using Manifold GIS.  The example dataset is a Manifold 8.0 .map file and can be found here.

As we continue to move through my book Statistical Problem Solving in Geography, we extend the computation of the standard distance to that of the weighted standard distance. Following our example from the book, we have an attribute column called f, which represents the frequency.  The computation is presented as:


The SQL for computing the weighted standard distance is:

                       SUM(f*x^2)/(SELECT sum(f)
                                            FROM Points)(SUM(x*f)/(SELECT sum(f)
                                                                                 FROM Points))^2
+ SUM(f*y^2)/(SELECT SUM(f)
                                         FROM Points)(SUM(y*f)/(SELECT sum(f)
                                                                                       FROM Points))^2
FROM Points

This is almost identical to the standard distance computation, but instead of dividing by the number of features, we divide by the total frequency. We also subtract the weighted mean center coordinates, instead of simply the mean center.  And finally, we multiply each coordinate by the frequency. Once again, the proper placement of parentheses allow us to wrap that calculation inside the square root function. I’ve colored the parentheses to help understand which ones correspond to one another – hopefully the colors are more helpful than distracting.  This is what is so nice about having a worked example – putting this into SQL took me a little time because of the parenthesis issue.

So, this gets us the weighted standard distance. LIke the previous example of weighted mean center, we can wrap the query in the buffer function to create a geometry of the weighted standard distance centered on the weighted mean center.

SELECT buffer(newpoint((SUM(x*f)/(SELECT sum(f)
FROM Points))^2,
(SUM(y*f)/(SELECT sum(f)
FROM Points))^2),
                                             (SUM(f*x^2)/(SELECT sum(f)
                                             FROM Points) – (SUM(x*f)/(SELECT sum(f)
                                                                                          FROM Points))^2
                                             + SUM(f*y^2)/(SELECT SUM(f)
                                                                     FROM Points) – (SUM(y*f)/(SELECT sum(f)
                                                                                                  FROM Points))^2))
        FROM Points

Remember from earlier posts that the buffer function requires a geometry and a distance [buffer([geometry],distance)]. So, for the [geometry] field, we are passing our mean center, and for our distance, we are passing our standard distance. Again, the hardest thing about this is getting the parentheses correct!!!  Always work with a small example so you can make sure your answers are correct – this will scale just fine once you have it working.

Hopefully, with the gray text, you can see how easy it is to take the results from the standard distance query along with the mean center geometry, and just insert it into the buffer function.

Statistical Problem Solving in Geography – Table of Contents Available Online

bookcoverSome people have been asking to see a Table of Contents for my book, Statistical Problem Solving in Geography.  I’ve uploaded the file, and you can see it here:

Statistical Problem Solving in Geography Front Matter

In addition to the Table of Contents, I’ve included the Preface in order to give you an idea as to our approach in writing the book.  My favorite quote is:

What began as a fresh coat of paint on what we considered
a good book, ended with a complete rewrite over
a 3-year period. 

While I hoped the book would have been finished earlier, I think the results were well worth the wait.

Spatial is Not Special – Standard Distance

This post is part of a series entitled Spatial is not Special, where I will illustrate how spatial constructs in SQL provide us with a rich toolset for geographers doing spatial analysis.  To illustrate these concepts, I will be showing examples from my book Statistical Problem Solving in Geography.  Even though PostGRES, SQLServer, MySQL, spatialite, and Oracle all support spatial SQL, these examples will be presented using Manifold GIS.  The example dataset is a Manifold 8.0 .map file and can be found here.

As we continue to move through my book  Statistical Problem Solving in Geography, we come to another important descriptive spatial statistic called the standard distance.  Following our example from the book, we see the computation presented as:


Continue reading