Big data geo-analytics with SQL

I’m getting ready to create a course in big data analytics with GIS.  I have lots of ideas as to what to do, but one thing I know is that I will be using spatial databases and SQL.  I’ll also be using Manifold Future.

ESRI has recently introduced their ArcGIS GeoAnalytics Server, which will introduce many GIS professionals to big data analytics with GIS.  They have some interesting scenarios and example data using NYC taxi cabs.  I think these will be really good case studies.

This video (just shy of 20 minutes) will use SQL and Manifold to try and address these big data problems.

Keep an eye on my blog as I will be rolling out new ideas as I prepare my course for the Spring.

if you like the video, and want to learn more about how to improve your spatial database skills, check out my videos at www.gisadvisor.com.

k-nearest neighbor with SQL in Radian Studio

I wanted to give you another look at some features that Radian Studio will offer. I’ve shown how we can use SQL to replicate the ARC/INFO NEAR function, and how to perform Nearest Neighbor Analysis. But, another useful took is the ability to identify k-nearest neighbors. That is, rather than just identifying the nearest neighbor, you might want to identify the two, three, or k nearest neighbors.

Radian will allow that functionality by using the COLLECT aggregate clause. The COLLECT aggregate collects values from a subgroup into a table, returning a table with one or more fields.

it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT. This is stuff that the real grown up databases like Oracle use, and Manifold is going to give it to us as part of Radian Studio.


SELECT park_no1,
SPLIT(COLLECT park_no2, dist
ORDER BY dist ASC FETCH 3
)
FROM
(
SELECT a.name AS park_no1, b.name AS park_no2,
GeomDistance(a.[geom (i)], b.[geom (i)], 0) AS dist
FROM [parks Table] AS a , [parks Table] AS b
WHERE a.name <> b.name
)
GROUP BY park_no1

Continue reading

Doing a GROUP BY in ArcGIS

As most of you know, I am a big fan of spatial SQL.  It is my go-to tool whenever working with GIS.  But, I have seen too many people using ArcGIS get tripped up with trying to summarize the results of a spatial operation because ArcGIS does not support SQL.  So today, to spare my ArcGIS friends the trouble of writing large “for” loops in Arcpy to populate data that takes hours to run, I want to show you two lines of Arcpy to very quickly replicate the GROUP BY function in SQL:

Using my favorite GIS data set in Tompkins County, NY, assume we have two layers: parcels2007 and watersheds:parwat

  Continue reading

When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:

allstu

but, as you zoom in, you can see that there are a tremendous number of points:

stuzoom

The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete. Continue reading

New Books – How do I do that in PostGIS, How do I do that in Manifold SQL

I have two new books out – How do I do that in PostGISand How do I do that in Manifold SQL.  

From the back cover of How do I do that in PostGIS:

For those who are unsure if SQL is a sufficient language for performing GIS tasks, this book is for you. This guide follows the topic headings from the book How do I do that in ArcGIS/Manifold, as a way to illustrate the capabilities of the PostGIS SQL engine for accomplishing classic GIS tasks. With this book as a resource, users will be able to perform many classic GIS functions using nothing but SQL.

Continue reading

Another Radian Test – Finding the distance between lines and areas

Following up on my previous post with ArcGIS and the Near Table, I created an SQL query in Manifold 8 to do both the near distances and group them by the number of points within specific distances (I grouped them every 50 km.).  The entire process took 47 seconds (or about 9 times faster than ArcGIS 10.1).

But, to keep things on the same playing field, I just computed the NEAR part of the query, and it ran in 40 seconds.  So, Manifold 8 was way faster than ArcGIS 10.1, but 3x slower than ArcGIS Pro.

I then wrote the following query in the Radian engine:

SELECT count(*) AS CNT, 
       first(floor(GeomDistance([L Table].[Geom (I)], 
       [P Table].[Geom (I)], 1)/50000)*50000+50000) AS DistZone, 
       [UNIQUE_ID] 
INTO bobo 
FROM [P Table] 
RIGHT JOIN [L Table] 
ON GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1) 
GROUP BY [UNIQUE_ID] 
THREADS 4

 this query took 30 seconds (or about 20% faster than Manifold 8).

 Once again, to level the playing field, I created a query to just run the NEAR aspect:

SELECT GeomDistance([L Table].[Geom (I)], [P Table].[Geom (I)], 1) AS DistZone, [UNIQUE_ID]
INTO bobo2
FROM [P Table]
RIGHT JOIN [L Table] ON 
GeomWithin([L Table].[Geom (I)],[P Table].[Geom (I)], 500000,1)
THREADS 4
BATCH 64

 this ran in 20 seconds.  In this case, ArcGIS Pro run slightly faster than Manifold 9 – but remember, I am still working with an alpha/beta release of Radian, and not all of the optimizations have been turned on. I can’t wait to see what the next beta will reveal.

Again, the simplicity of SQL in conjunction with the parallel nature of the Radian engine provides some very interesting opportunities for working with complex processes and large amounts of data.

Spatial is Not Special – Quadrat Analysis

In our book we illustrated the use of quadrat analysis for determining whether points were random, clustered, or distributed.  Figure 14.9 from the book showed a point sample of 2,500 points, and Table 14.4 showed the mathematical calculation for quadrat analysis.

Image

 

 

Image

The calculations look pretty daunting, don’t they?  But, in actuality, its basic arithmetic.  In this blog I am only going to illustrate how we obtained the correct variance to mean ratio using spatial SQL.  If you want to understand quadrat analysis, check out the book, or do a web search. Continue reading

Spatial is Not Special – Nearest Neighbor Index

 

It is nice to get back to the book, and start talking about Statistical Problem Solving in Geography again.  Today we are going to look at the Nearest Neighbor Index.  You can refer to chapter 14 where we illustrate the computation of the nearest neighbor index using a set of 7 points:

 nnfig

Then, for each point we determine the nearest neighbor, its distance, and ultimately the average nearest neighbor distance over all the points:

 nncalc

To develop the SQL, we will take it one step at a time.  First, we want to compute the distance from every point to every other point:

SELECT a.pt, b.pt,distance(a.[Geom (I)],b.[Geom (I)]) AS dist
FROM points AS a, points AS b
WHERE a.pt <> b.pt
ORDER BY dist

This query gives us a table of the distance from every point to every other point.  We also play that game again where we rename the table “points” as “a” and “b” so that SQL thinks we have two different tables.  We also have to put a WHERE clause in to make sure we aren’t measuring from one point to itself – because the distance will be 0. Continue reading

ARC/INFO Functions – Symmetrical Difference

I was recently asked by danb to illustrate the ARC/INFO function symmetrical difference. We basically want to find those areas in layer A that don’t intersect layer B, and also find those areas in layer B that don’t intersect layer A.  Its pretty easy to do, as it is created in two parts: subtracting the first layer from the second layer and then subtracting the second layer from the first layer and UNIONING them together.  I’ve added the new layers here so you can test it out.

SELECT * FROM
(
SELECT a.id AS aid, b.id AS bid, clipsubtract(a.id,b.id) AS g
FROM a, b

UNION ALL

SELECT a.id AS aid,b.id AS bid, clipsubtract(b.id,a.id) AS g
FROM a, b
)
RIGHT JOIN B ON bid = b.id
RIGHT JOIN A ON aid = a.id