Big Data Geo-Analytics with Postgres workshop Evaluation

Great fun at the Maryland State GIS conference (@tugis) where I had over 20 people attend my Big Data Geo-Analytics with Postgres workshop. And, like other workhops I’ve taught at Tugis (see here, here, and here), the evaluations were really strong. I should remind you that this was not your run of the mill workshop. This was really advanced stuff where we discussed indexing, parallel processing, multi-processing, the central limit theorem, and also processed gigabytes of data. I warned people beforehand that this was not an introductory workshop. And, people responded perfectly. Everyone in attendance was prepared for the material, and that is what made the workshop go so well. The full workshop results are here. But, as far as highlights, here are the main take-aways:
I love it when professionals taking my workshop feel as though it has value to their career. The reality is, why take a workshop that won’t help you in your career? I’m so happy that people see this as valuable to their career.
This response was great. Usually, I get about 30% of the people to give the workshop a 10/10 (I get other 9s and 8s, of course). But in this case, 50% of the people gave this a 10/10. That is really huge. So, I know that I am teaching this content in the right way. When asked what they liked best, some of my favorite positive quotes are:
  • Technical and Detailed. Great teacher explanation. Really good!
  • Practical advice, in-depth enough to really learn something useful (most one-day workshops do not provide as much useful info and advice as this one).
  • The optimization of the database and processing (parallel processing in particular).
  • Learning about Postgres and the ability to run sql queries rather than run step by step in ArcGIS
  • Using Postgres to utilize data organization and data manipulation was great insight. It showed me postgres is a great alternative to SQL server or Oracle
  • Art knows his material and keeps the class engaged. Lots of new information.
  • The discussions on multiprocessing, indexes, and using statistical estimation were most useful to me.
When asked to provide ways to make the workshop better, these were some of my favorite quotes:
  • Nothing – good job. Thanks
  • Slower pace of the lessons. It was like drinking water out of a firehose.
  • Make it two days, or a week long, or a full semester!
  • I thought it was pretty good as is, I can’t think of anything off hand.
  • Have the materials available ahead of time for review. Would be useful to go deeper into how to use this at work and what we need to get started. Nothing – good job. Thanks
  • Slower pace of the lessons. It was like drinking water out of a firehose.
  • Make it two days, or a week long, or a full semester!
  • I thought it was pretty good as is, I can’t think of anything off hand.
So there you have it. Another successful workshop teaching GIS professionals about big data analytics. If you want to learn more about free and open source GIS, whether its QGIS, Postgres/PostGIS, GDAL, Geoserver, or Python and SQL, take a look at the courses I offer through gisadvisor.com. Finally, I want to start offering this big data analytics workshop with Postgres and PostGIS during the year. I would be happy to come to your city or GIS conference to teach the class. Just send me a note, and we can work out a way to get me to your area to introduce GIS professionals to more FOSS4g contents.

Big Data Results

I wanted to revisit the taxi data example that I previously blogged about.  I had a 6GB file of 16 million taxi pickup locations and 260 taxi zones.  I wanted to determine the number of pickups in each zone, along with the sum of all the fares.  Below is a more in-depth review of what was done, but for those of you not wanting to read ahead, here are the result highlights:

Platform Command Time
ArcGIS 10.4 AddJoinManagement Out of memory
ArcGIS Pro Summarize Within 1h 27m*
ArcGIS Server Big Data GeoAnalytics with Big Data File Share Summarize Within

Aggregate Points

~2m
Manifold 9 GeomOverlayContained 3m 27s
Postgres/PostGIS ST_Contains 10m 30s
Postgres/PostGIS (optimized) ST_Contains 1m 40s
*I’m happy ArcGIS Pro ran at this speed, but I think it can do better.  This is a geodatabase straight out of the box. I think we can fiddle with indexes and even structuring the data to get things to run faster.  That is something I’ll work on next week.

Continue reading

Big data analytics with GIS – the CSULB pilot

Well folks, it’s happening. I’m about to take one of my most adventurist steps into these training classes yet.

With the release of Manifold 9, I’m going to offer a big data analytics class that includes gigabytes of data, multi-databases, statistical processing, and parallel processing. And, it is something you will be able to participate in using only freely available software. Imagine that, a big data analytics class with free software

Delivering 20GB of data at a bring your own device (BYOD) training class is a challenge. Also, with this high level work, it is a further challenge to decide what can fit into a one day workshop.

Thankfully, the California State University at Long Beach provided me with an opportunity to teach my workshop to their students this week. It was a blast!

More importantly I learned a lot about how to put together a deep-dive of a class like this together. 8 hours is simply too short!!

The students loved the workshop, and I loved teaching it. Stay tuned, as a live workshop will be up coming to a city near you, and an abbreviated online workshop will roll out in the next month.

csuposter

 

 

 

Work smarter – not larger

When you were in Statistics 101, and the Professor said ok, we are now going to learn about the Central Limit Theorem, did you tune out? Did you sarcastically say when is someone going to grab me and order me to tell them about the Central Limit Theorem? Come on, admit it, you did.  Well, so did I – I was 18 years old, and couldn’t care less.

Well, you know what? Understanding the Central Limit Theorem has really big implications for big data analytics. Check out this 20 minute video, and you’ll see that by applying the Central Limit Theorem and some statistical theory, you can approximate the results of an expensive multi-server implementation for interrogating really large databases.

I’ll show you how you can obtain very precise estimates on really large databases by simply applying some basic statistics you should have learned Freshman year (but you were too busy partying, weren’t you?)

 

stay tuned, I’ll be coming out with a big data analytics class in the New Year.  If you want to learn more about SQL, programming, open source GIS, or Manifold, check out courses at www.gisadvisor.com.  

Big Data GeoAnalytics – adding data

Continuing my series on big data geoanalytics, I wanted to show how to bring in large data sets so that we can start working with them. The data set we’ll use is the NYC taxi data that includes information on pickup and dropoffs. There are about 13 million records in a 2.2GB .csv file. That is not insanely large, but it is large enough for us to start messing around with it (don’t worry, I have a few 20GB+ data sets that I am working with and will eventually show that to you as well).

This video below will walk you through the steps I took to load and prepare the NYC taxi data inside of Manifold Future. My next posts will begin to look at how we can begin interrogating the data source to find meaningful information.

I hope you enjoy the video. Please comment below – I’d love to hear what people think.

 

Big Data GeoAnalytics – Turning Points to Lines

In my last video, I gave a short of mile-high view of how SQL can be used for big data geoanalytics.  I want to dive a little deeper, and explore the idea of create linear features from a time-series of points.

Once again, using some basic SQL and spatial SQL, we can perform basic time-series analysis.

I’m enjoying making these videos, as they are helping me put my course on big data and GIS together.  I hope you like them too.  Please comment down below so that I know this is something the user community enjoys and is learning from.

Also, if you are interested in learning more about how to perform spatial SQL in Microsoft SQL Server, Postgres, or Manifold, visit my other site, www.gisadvisor.com to sign up for my online video courses.

Big data geo-analytics with SQL

I’m getting ready to create a course in big data analytics with GIS.  I have lots of ideas as to what to do, but one thing I know is that I will be using spatial databases and SQL.  I’ll also be using Manifold Future.

ESRI has recently introduced their ArcGIS GeoAnalytics Server, which will introduce many GIS professionals to big data analytics with GIS.  They have some interesting scenarios and example data using NYC taxi cabs.  I think these will be really good case studies.

This video (just shy of 20 minutes) will use SQL and Manifold to try and address these big data problems.

Keep an eye on my blog as I will be rolling out new ideas as I prepare my course for the Spring.

if you like the video, and want to learn more about how to improve your spatial database skills, check out my videos at www.gisadvisor.com.

Parallel Processing with QGIS

Once again, I am continuing my role as a mentor in a National Science Foundation (NSF) Research Experience for Undergraduate program.  This year we’ve decided to build a QGIS plug-in for terrain analysis, as it is embarrassingly parallel (slope, aspect, etc.).   We are doing four things as we generate slope for different size digital elevations models:

  1. A pure python implementation (for an easy plug-in)
  2. A serial-based C++ implementation (as a baseline for a non-parallel solution)
  3. A pyCUDA implementation (using the GPU for parallel processing)
  4. A C++ based parallel solution using the GPU

We plan to put our results on a shared GitHub site (we are in the process of cleaning up the code) so that people can start experimenting with it, and use our example to begin generating more parallel solutions for QGIS (or GDAL for that matter).

Here are some early results: Continue reading

When More is Less…. lessons from processing large data files

My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area.  He also want to know which country a mangrove location is in.  An overview of the area is here:

allstu

but, as you zoom in, you can see that there are a tremendous number of points:

stuzoom

The problem

You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was.  Big time trouble.  In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete. Continue reading