When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:

allstu

but, as you zoom in, you can see that there are a tremendous number of points:

stuzoom

The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete. Continue reading

New Books – How do I do that in PostGIS, How do I do that in Manifold SQL

I have two new books out – How do I do that in PostGISand How do I do that in Manifold SQL.  

From the back cover of How do I do that in PostGIS:

For those who are unsure if SQL is a sufficient language for performing GIS tasks, this book is for you. This guide follows the topic headings from the book How do I do that in ArcGIS/Manifold, as a way to illustrate the capabilities of the PostGIS SQL engine for accomplishing classic GIS tasks. With this book as a resource, users will be able to perform many classic GIS functions using nothing but SQL.

Continue reading

A Poor Man’s Parallel Processor for GIS

In addition to SQL, I also am interested in processing large volumes of spatial data.  One of the newest rages in “big data” is Hadoop.  According to Wikipedia:

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

One way this is implemented is a programming model called MapReduce.  Don’t get too excited, it doesn’t have anything to do with maps or GIS – but, it is very clever and powerful for certain types of problems.  The concept is if you have a really large dataset, you divide and conquer that dataset in a number of steps.  For example, say we wanted to know all the people with the name “John” in the phonebook, and say we had 26 computers in a cluster – we might solve this by:

1.  Use each computer (1-26) to find all the “Johns” for the first letter in the last name (A-Z).  That way, you have effectively broken the problem into 26 smaller units.

2.  Once each computer has counted up the number of Johns, you have reduced the dataset (hence, MapReduce) to 26 variables.

3.  Now, count up the total of the 26 variables.

That is an oversimplified version of course, but it helps to illustrate what we want to do.  I understand that the University of Minnesota has created a set of functions called SpatialHadoop.  I want to test this over the summer, but for now I decided to create my own poor man’s version using PostGRES. Continue reading

Spatial is Not Special – Quadrat Analysis

In our book we illustrated the use of quadrat analysis for determining whether points were random, clustered, or distributed.  Figure 14.9 from the book showed a point sample of 2,500 points, and Table 14.4 showed the mathematical calculation for quadrat analysis.

Image

 

 

Image

The calculations look pretty daunting, don’t they?  But, in actuality, its basic arithmetic.  In this blog I am only going to illustrate how we obtained the correct variance to mean ratio using spatial SQL.  If you want to understand quadrat analysis, check out the book, or do a web search. Continue reading

Spatial is Not Special – Nearest Neighbor Index

 

It is nice to get back to the book, and start talking about Statistical Problem Solving in Geography again.  Today we are going to look at the Nearest Neighbor Index.  You can refer to chapter 14 where we illustrate the computation of the nearest neighbor index using a set of 7 points:

 nnfig

Then, for each point we determine the nearest neighbor, its distance, and ultimately the average nearest neighbor distance over all the points:

 nncalc

To develop the SQL, we will take it one step at a time.  First, we want to compute the distance from every point to every other point:

SELECT a.pt, b.pt,distance(a.[Geom (I)],b.[Geom (I)]) AS dist
FROM points AS a, points AS b
WHERE a.pt <> b.pt
ORDER BY dist

This query gives us a table of the distance from every point to every other point.  We also play that game again where we rename the table “points” as “a” and “b” so that SQL thinks we have two different tables.  We also have to put a WHERE clause in to make sure we aren’t measuring from one point to itself – because the distance will be 0. Continue reading

ARC/INFO Functions in SQL – Line Density

Dale posted a request to see Line Density recreated in SQL.  Please take a look at the ESRI help topic for line density.

I will warn you, the SQL for this is a total mess – I think I can simplify it, but I only spent about 15 minutes writing the query, and it’s finals week and I am tired :-)

So, I have the code below, but first let me outline the steps I need to follow – we assume a raster called rastersurface, and a vector layer called lines

Image

1.  Take a raster (here we call it rastersurface)

2.  For each pixel, we have to obtain a point (this is where we issue the NewPoint function)

3.  For each point, we have to buffer it (in this case, I am using the buffer function with 1000m)

4.  For each of the buffered points, I am performing a clipintersect with the lines.

5.  The clipintersect of the lines and the buffers are for each base pixel – so, we sum the area of each line, and group it by the unique centerX and centerY or the base pixel.

6.  We wrap that entire thing inside an UPDATE query and update the [Height (I)] feature for each pixel.  To update the correct pixel value, we have to join the table where the centerX and centerY of the pixel equals the centerX and centerY of the points from which the buffer and summation of the areas are computed. Continue reading

ARC/INFO Functions – Symmetrical Difference

I was recently asked by danb to illustrate the ARC/INFO function symmetrical difference. We basically want to find those areas in layer A that don’t intersect layer B, and also find those areas in layer B that don’t intersect layer A.  Its pretty easy to do, as it is created in two parts: subtracting the first layer from the second layer and then subtracting the second layer from the first layer and UNIONING them together.  I’ve added the new layers here so you can test it out.

SELECT * FROM
(
SELECT a.id AS aid, b.id AS bid, clipsubtract(a.id,b.id) AS g
FROM a, b

UNION ALL

SELECT a.id AS aid,b.id AS bid, clipsubtract(b.id,a.id) AS g
FROM a, b
)
RIGHT JOIN B ON bid = b.id
RIGHT JOIN A ON aid = a.id

ARC/INFO Functions – UNION

We are going to conclude our overlay posts with the ARC/INFO UNION command.  This should be a lot easier than it actually is.  I think the problem is that Manifold may have a bug in the ClipSubract command.  What I discovered was that the ClipSubract clause only seems to work on the first geometry in a layer.  With that error, it would appear that we are dead in the water.  However, if Manifold’s ClipSubract only works on the first geometry, well, we can manipulate the query to pass ClipSubract a single geometry.  To do that, we use the UnionAll command to take all the geometries in one layer and union them together into a single geometry.

So, the ARC/INFO UNION command in SQL needs three things:

1.  The intersection of the two layers (to get the intersection of the two layers)
2.  The clipsubtract of layer 1 and layer 2 (to get the part of layer 1 not intersecting layer 2)
3.  The clipsubract of layer 2 and layer 1 (to get the part of layer 2 not intersecting layer 1)

But, we are still not out of the woods yet.  We need to attach the attributes to the layers. To do that, we have to fake out Manifold a little to think that we actually have a unique ID for the the subtract layer.  The ClipSubract clause correctly clips the geometries, but it will not return the ID for the subtract layer since we UNIONEDed the entire thing into a single geometry.  So, to fake Manifold out, we return a value of 0 as the id field for the layer.

SELECT * FROM
(
  SELECT ClipIntersect(circle.id,rectangles.id) AS g, circle.id AS cid, rectangles.id AS rid
  FROM circle, rectangles
UNION ALL
 SELECT Clipsubtract(circle.[Geom (I)],(SELECT UnionAll(id) FROM rectangles)) AS g, circle.id AS cid, 0 AS rid
 FROM circle
UNION ALL
 SELECT ClipSubtract(rectangles.[Geom (I)],(SELECT unionall(id) FROM circle)) AS g, 0 AS cid, rectangles.id AS rid
 FROM rectangles
)
LEFT JOIN [circle] ON circle.id = cid
LEFT JOIN [rectangles] ON rectangles.id = rid