GeospatialPython.com: spatial

Showing posts with label spatial. Show all posts

Tuesday, May 29, 2012

SBN Mystery - Solved!

Last October I asked you all for help in figuring out the spatial indexing algorithm used to create Esri sbn files. Using Pyshp, I had successfully decoded the file formats which I provided. However I could not figure out the algorithm used to create and populate the spatial bins within these files.

Today I'm pleased to announce this challenge has been answered. The GIS community now has access to both the sbn and sbx file format as well as the algorithm for grouping features in a shapefile into "spatial bins".

I'm glad I asked for help as this challenge turned out to be quite difficult. The brain behind this operation is Marc Pfister with some good insights from Si Parker. Marc worked tirelessly on this problem for months with a cross-country move and complete career change thrown in to make it interesting. Marc did all the heavy intellectual lifting with me playing an inquisitive, but usually short-sighted Watson to his Holmes. I generated endless series of shapefiles and one-off scripts to help Marc try and flush out the algorithm based only on subtle changes in the number of bins and features they contained as well as his past experience with spatial indexing.

When I figured out the file formats I had hoped I was just a Wikipedia search away from recognizing the spatial tree algorithm. But the solution turned out to be much more complex than that. Esri uses a sort of balanced tree that exhibits traits of several different algorithms. The system seems carefully designed but is by no means obvious. I will publish Marc's findings as soon as I can.

There are still a few shapefile cases which create puzzling but insignificant results. However we are at the 98% mark. The project goal of compatibility has been reached. There is no longer any reason to hold off on sharing the results. We are fairly certain that we are able to create sbn and sbx files which sufficiently fool ArcMap as well as other Esri packages so other software can read, use, and generate these indexes alongside the Esri suite. There is more testing to do but it seems we are out of the woods.

What we haven't done is nicely packaged all of this work up. But Marc posted a small set of Python scripts on github which demonstrate the algorithm and file handling needed to copy this capability. Over the coming months I will fold this code into Pyshp, produce better documentation on the algorithm, and provide posts on how to deal with these indexes. But for now here's what you've been waiting for:

https://github.com/drwelby/hasbeen

By the way, Marc does freelance programming. In my job, I get the opportunity to work with lots of really bright geospatial programmers and mathematicians and this guy is one of the very best I've ever seen. If you have a tough geospatial project and need some E=MC2 smarts definitely look him up.

Friday, November 4, 2011

Deleting Shapefile Features

Sometimes, usually as a server-based operation, you need to delete all of the features in a shapefile. All you want left is the shapefile type, the dbf schema, and maybe the overall bounding box. This shapefile stub can then be updated by other processes. Pyshp currently doesn't have an explicit "delete" method. But because pyshp converts everything to native Python data types (strings, lists, dicts, numbers) you can usually manipulate things fairly easily. The solution is very similar to merging shapefiles but instead you are writing back to the same file instead of a new copy. There's only one hitch in this operation resulting from a minor difference in the pyshp Reader and Writer objects. In the reader the "bbox" property returns a static array of [xmin, ymin, xmax, ymax]. The Writer also has a "bbox" property but it is a method that is called when you save the shapefile. The Writer calculates the bounding box on the fly by reading all of the shapes just before saving. But in this case there are no shapes so the method would throw an error. So what we do is just override that method with a lambda function to return whatever bbox we want whether it be the original bbox or a made up one.

import shapefile 
# Read the shapefile we want to clear out
r = shapefile.Reader("myshape") 
# Create a Writer of the same type to save out as a blank
w = shapefile.Writer(r.shapeType) 
# This line will give us the same dbf schema 
w.fields = r.fields 
# Use the original bounding box in the header 
w.bbox = lambda: r.bbox 
# Save the featureless, attribute-less shapefile
w.save("myshape")

Instead of using the original bounding box we could just populate it with 0's and let the update process repopulate it:

w.bbox = lambda: [0.0, 0.0, 0.0, 0.0]

Note coordinates in a shapefile must always be floating-point numbers. Sometimes you may not want to delete all of the features. You may want to select certain features by attribute or using a spatial operation.

Tuesday, October 4, 2011

Your Chance to Make GIS History

Update 6/13/12: We finally reverse engineered the sbn indexing algorithm to an Esri-compatible level! Please see this post: SBN Mystery Solved!

Update 3/2/12: Getting closer still... Marc has determined there are two binning algorithms at work to determine which spatial bin a feature ends up in when a node on the tree is split as well as when a split should happen. Also, another developer, named Francisco, has recently begun a C# implementation and seems to be making good progress. In the comments on this post Francisco and Marc have been discussing the two algorithms. It's hard to give a time estimate but Marc feels that this last binning/splitting case should complete the picture.

Update 11/3/11: Work continues on this issue. I hope to post soon on the remaining mysteries of the algorithm but we are pretty close. I hope to do a more technical post on this issue soon to hopefully generate another round of feedback. I will continue to post any breakthroughs here as well as create a new post when we are done.

Update 10/27/11: Marc Pfister responded first and has been working tirelessly on this challenge. He's made a good bit of progress in unlocking the algorithm. Si Parker joined us recently and brought some great insight that added some new direction to Marc's work. While those guys worked on the hard stuff I created an sbx/sbn reader and writer module that can copy and manipulate these files. I believe at this point we could fool ESRI software but we are systematically closing in on the complete indexing scheme. It's been fascinating to watch the techniques used which I'll share when it's done.

This post is an open challenge to clever geospatial programmers everywhere.

Background

A plot produced by sbn.py for a "world cities" shapefile spatial index

The tremendously successful shapefile format is generally considered an open format due to the fact that the shp, shx, and dbf files are documented. But the 1998 "ESRI Shapefile Technical Description" doesn't tell the whole story. Another part of the shapefile format is the spatial index.

Spatial indexes create groups of features based on a given spatial clustering algorithm and define these clusters using a bounding box, usually mapped to an integer grid to avoid using floating point numbers. When performing spatial operations you can eliminate irrelevant features by doing quick checks against rectangles before performing expensive point-in-polygon or nearest-neighbor operations.

The shapefile spatial index consists of two files: ".sbn" and ".sbx" - short for "spatial bins" and "spatial bin index". ESRI never documented these file formats. And whenever you edit a shapefile these indexes must be updated.  The open source community worked around this issue by creating an open spatial index for shapefiles called ".qix" - short for Quadtree index after the quadtree algorithm it uses. But ESRI doesn't use qix and 3rd party shapefile software can't use sbn files.   This perputal incompatibility is a pain for everyone involved and results in corrupt indexes or extra coding to remove the incompatible files.

The Challenge

I am very close to ending this spatial index stalemate between ESRI software and open source. But I have hit a wall. I have been able to completey reverse engineer both the sbx and sbn file formats. The problem is I don't fully understand a small portion of the sbn file.  I'm hoping the community at large will recognize the structure ESRI uses and open this format once and for all.

The Facts

The headers of both the sbx and sbn files are nearly identical to the shp and shx files.

The fixed-length record format of the sbx file is nearly identical to the documented shx file but references entries in the sbn file. Just like the shp file any length fields are 16-bit words which you must double to get bytes.

The sbn file, as you might have guessed by now, contains "bins" which contain bounding boxes of individual features followed by the corresponding record number of that feature in the shp file.  For the bounding boxes in these bins ESRI did something very smart. Most spatial index formats map floating point coordinates to an integer grid which is more efficient. The point of a spatial index is not precision but relative accuracy. However instead of using integers ESRI used chars allowing them to map coordinates to a 255x255 grid using only a single byte per coordinate instead of 4.   This trick stumped me for a while because I was looking for at least 4-byte ints.

After the header of the sbn file all of the bins are listed with a bin number and the number of records that bin contains. After this portion the bin number is listed followed by each features bounding box and record number. Fairly straight forward except for one thing. In the bin list ESRI inserts empty bin records of bin # -1 to 0 with 0 # of shapes. These "spacers" do not appear in the actual bin structure and seem to follow regular patterns of 1 to 14 empty bins in a row at different points. I considered that these empty bins were artifacts left over when an index is updated by ESRI software after editing. However if you copy a shapefile, delete the index, and regenerate it using ArcGIS the structure looks exactly the same. These blank bins are intentional.

Theories

I'm nearly certain ESRI did not use a quadtree for these bins because if you plot their extents they overlap which a quadtree avoids to my knowledge. It is possible it is some version of an RTree and that the empty bins, if counted, represent tree nodes. Because RTrees are built recursively and can contain hierarchial "empty" nodes this structure might make since. But I just don't know enough about quadtree and RTree implementations to know what I'm looking at. Once these zero bins are deciphered creating an ESRI-compatible reader and writer will not be difficult. I also suspect that if we understand the structure of the file that the clustering algorithm used won't matter as long as the resulting index format is compliant.

What You Need to Get Started

I have created a zip file with enough tools and information to get a good look inside sbn and sbx files.  Download the Spatial Index Kit from the "Downloads" section of the pyshp Google Code site. In this zip file you will find:

A very simple sbn format specification following the main shapefile spec conventions in pdf and excel format
A very simple sbx format specification following the main shapefile spec conventions in pdf and excel format
A directory full of test shapefiles
A script, sbn.py, which reads a directory with sbn files and dumps the information in each sbn file into a text file with the same name as the shapefile. It also plots the bounding box of each bin and optionally each features box into an image using PIL and named as the shapefile. Configuration is done using variables inside the script.
A script, sbx.py, which also translates the contents of each sbx file into text files.
A script, fmtDecode.py, which is a brute force script I used to cycle through all possible data types when stepping through a binary file. I'm vaguely aware of better tools for reverse engineering but this dumb script did the trick.

Good Luck! Let me know if you have any questions. I will update this post and create another post if anybody comes up with the answer.

Tuesday, August 23, 2011

Point in Polygon 2: Walking the line

Credit: SpikedMath.com

This post is a follow-up to my original article on testing if a point is inside a polygon. Reader Sebastian V. pointed out the ray-casting alogrithm I used does not test to see if the point is on the edge of the polygon or one of the verticies. He was even nice enough to send a PHP script he found which uses an indentical ray-casting method and includes a vertex and edge test as well.

These two checks are relatively simple however whether they are necessary is up to you and how you apply this test. There are some cases where a boundary point would not be considered for inclusion. Either way now you have an option. This function could even be modified to optionally check for boundary points.

# Improved point in polygon test which includes edge
# and vertex points

def point_in_poly(x,y,poly):

   # check if point is a vertex
   if (x,y) in poly: return "IN"

   # check if point is on a boundary
   for i in range(len(poly)):
      p1 = None
      p2 = None
      if i==0:
         p1 = poly[0]
         p2 = poly[1]
      else:
         p1 = poly[i-1]
         p2 = poly[i]
      if p1[1] == p2[1] and p1[1] == y and x > min(p1[0], p2[0]) and x < max(p1[0], p2[0]):
         return "IN"
      
   n = len(poly)
   inside = False

   p1x,p1y = poly[0]
   for i in range(n+1):
      p2x,p2y = poly[i % n]
      if y > min(p1y,p2y):
         if y <= max(p1y,p2y):
            if x <= max(p1x,p2x):
               if p1y != p2y:
                  xints = (y-p1y)*(p2x-p1x)/(p2y-p1y)+p1x
               if p1x == p2x or x <= xints:
                  inside = not inside
      p1x,p1y = p2x,p2y

   if inside: return "IN"
   else: return "OUT"

# Test a vertex for inclusion
poligono = [(-33.416032,-70.593016), (-33.415370,-70.589604),
(-33.417340,-70.589046), (-33.417949,-70.592351),
(-33.416032,-70.593016)]
lat= -33.416032
lon= -70.593016

print point_in_poly(lat, lon, poligono)

# test a boundary point for inclusion
poly2 = [(1,1), (5,1), (5,5), (1,5), (1,1)]
x = 3
y = 1
print point_in_poly(x, y, poly2)

You can download this script here.

Tuesday, August 16, 2011

Geospatial News Apps

The Tribune's "Englewood" module helps you create very
scalable dot-desnity maps and is named after a well-known
Chicago neighborhood.

My background started in the newspaper business so I was pleased to see "The Chicago Tribune" has its own developers who maintain a newspaper technology blog. Newspapers have always worked with census data to back up news stories but the Tribune staff takes it much further. Through their blog they document apps they have created and release them as open source. In an age when many newspapers have folded because of advances in technology this team is using it to take Journalism in interesting new directions.

In a recent post on the Tribune's "News App Blog", they published a module for creating elaborate dot-density maps named "Englewood". They referenced my post "Dot Density Maps with Python and OGR" and turned that sample into the Englewood module named after the beleaguered Chicago neighborhood which often appears in the news.

The Tribune team pulls in several other tools and goes through the details of going all the way from census data to online dot-density maps. In addition to the basic how-to of producing the data they cover how they made the production really fast and deployed it to a massively-scalable S3 Amazon server. The blog gives a lot of insight into how a newspaper uses technology to apply geospatial technology in support of the news. Way more info than you get from your typical code-snippet blog. Fascinating stuff.

Monday, March 21, 2011

Working with Elevation Values in Shapefiles

I've had several questions about working with elevation values in shapefiles. Awhile back I posted a brief example in the wiki on the google code project site and felt it should be included here so this information is more visible.

Elevation values, known as "Z" values allow you to create 3-dimensional shapefiles. Working with elevation values in the Python Shapefile Library is easy but requires a step which hasn't been well documented until now. Make sure you are working with at least version 1.0 (revision 24 in the repository) as previous versions have a bug that prevents elevation values from being written properly as well as another bug which occurs when creating certain polygons.

The shapefile specification allows for three types of 3-dimensional shapefiles:

PointZ - a 3D point shapefile, PolylineZ - a 3D line shapefile, and PolygonZ - a 3D polygon shapefile.

The most important thing to remember is to explicitly set the shape type of each record to the correct shape type or it will default to a 2-dimensional shape. Eventually the library will automatically detect the elevation type.

Here is a brief example of creating a PolygonZ shapefile:

import shapefile
w = shapefile.Writer(shapeType=shapefile.POLYGONZ)
# OR you can type
# w = shapefile.Writer(shapeType=15)
w.poly([[[-89.0, 33, 12], [-90, 31, 11], [-91, 30, 12]]], shapeType=15)
w.field("NAME")
w.record("PolyZTest")
w.save("MyPolyZ")

When you read 3D shapefiles the elevation values will be stored separately:

>>>import shapefile
>>>r = shapefile.Reader("MyPolyZ")
>>>r.shapes()[0].points
[[-89.0, 33.0], [-90.0, 31.0], [-91.0, 30.0]]
>>>r.shapes()[0].z
[12, 11, 12]

So remember: explicitly set the shapetype for each shape you add to ensure the z values are maintained. And know that the z values are stored in the "z" property of each shape instead of being integrated with the x,y value lists.

Thursday, July 3, 2008

The GeospatialPythonosphere

Google indexed and pageranked the Internet. Then Google bought the obscure program "Keyhole 2 Lt" and branded it "GoogleEarth". Next Google hired Plone developer Alexander Limi. And finally they hired Python's creator Guido von Rossum.

Of course other significant things happened too like ESRI making Python an official scripting language for their flaghship product, but Google caused the explosion of interest in geographic information technology and Python.

Despite this explosion if you are interested in the combination of Python and GIS/Remote Sensing there is surprisingly little discussion on this topic given the independent popularity of these two technologies.

One reason is the the combination of Python and GIS is still new and niche despite Python's success and the growth of GIS. Another reason is the few people out there deeply interested in the advancement of Python as a GIS technology are too busy writing the few pieces of software out there to spend a lot of time blogging and making presentations at mainstream conferences.

This fact struck me at the 2006 ESRI User Conference in San Diego. There was a lot of buzz and curiosity about ESRI adopting Python in ArcGIS. At the conference bookstore I eavesdropped on a lady complaining to the cashier that there were no books on using Python in ArcGIS. There were hundreds of ESRI Press books, case studies, and well-known GIS textbooks but the only Python related material was a handful of O'Reilly Python books.

The number of professionally published books about a technology are a good indicator of mainstream interest. And to my disappointment geospatial python hasn't even reached the bar needed to publish O'Reilly's "Perl for Bioinformatics" series. Tyler Mitchell's excellent "Web Mapping Illustrated" came close but I want to see Python in the title. But you'd think there'd be more GP blogs out there given the dead-simple medium of blogging.

This blog intends to tip that scale of general disinterest ever so slightly. Python provides the lightweight fits-your-brain agility needed to keep up with the booming geographic technology industry while GIS provides an endless realm of killer applications waiting to be developed.

If you're starving for geospatial python information diet this post provides the available rations out there for a steady diet.

The Blogs

One of the most prolific GP writers and developers out there is Sean Gilles who posts frequently to his blog "import cartography". Sean has been a leader in the open source GIS arena and happens to have an interest in Python. His contributions to GP are too numerous to list but include Python Mapscript, Cartographic Objects for Zope, the Python Cartographic Library, PrimaGIS, Shapely, Pleiades, and more all linked at gispython.org.

Howard Butler with his consulting blog hobu.biz is another must-have RSS feed. Hobu has been blogging about Python and GIS topics since the turn of the century on his original blog "Fanning Mill". In addition to his many open source GIS contributions Howard's GP contributions are also industry staples including Python bindings for software and libraries including ArcView 3.x, GDAL, shapelib, Proj4, and ArcSDE to name a few. It is rumored Howard's Python bindings to the old ArcView 3.x dlls which were published in ESRI's ArcUser magazine are one of the factors that led to the use of Python in ArcGIS.

Chris Tweedie publishes a steady stream of open source GIS topics on his blog "Chris GISmo's". Many of his posts are Python related but all of them are worth reading.

The fourth geospatial python news source is simply gis+python on del.icio.us as an rss feed. It will turn up a lot of links to interesting software and posts you wouldn't find otherwise.

The Feeds

And here are the feeds:

import cartography: http://zcologia.com/news/
hobu.biz: http://hobu.biz/feed/
Chirs GISmo's: http://chris.narx.net/feed/
Del.icio.us: http://del.icio.us/rss/tag/gis+python

Pages