Tuesday, October 13, 2009

Ed Parsons on Spatial Data Infrastructure

I recently attended Surveying & Spatial Sciences Institute Biennial International Conference in Adelaide and was privileged to see Ed Parsons’ presentation. For those who don’t know Ed, his bio describes him as “… the Geospatial Technologist of Google, with responsibility for evangelising Google's mission to organise the world's information using geography, and tools including Google Earth, Google Maps and Google Maps for Mobile.” He delivered a very enlightening and truly evangelistic presentation outlining his views on the best approach to building Spatial Data Infrastructures. The following paragraphs summarise the key, thought provoking points from the presentation – with some comments from my perspective.

The essence of Ed’s position is that the currently favoured approach of building highly structured and complex to the n-th degree “digital libraries” to manage spatial information is very inefficient and simply does not work. There is much better framework to use – the web – which is readily available and can deliver exactly what is needed by the community, and in a gradual and evolutionary fashion rather than as a pre-designed and rigid solution.

I could quote many examples of failed or less than optimal implementations of SDI initiatives in support of Ed’s views. There is no doubt that there are many problems with the current approach. New initiatives are continuously launched to overcome the limitations of previous attempts to catalogue collections of spatial information. And it is more than likely that none of the implementations is compatible with the others. The problem is that metadata standards are too complex and inflexible and data cataloguing software is not intelligent enough to work with less than perfectly categorised information. I recently had first hand experience with it. I tried to use approved metadata standards for my map catalogue project, hoping it will make the task easier and the application fully interoperable, but in the end, I reverted to adding my own “interpretations and extensions” (and proving, at least to myself, that “one-fit-all” approach is almost impossible). I will not even mention the software issues…

Ed argued that most SDI initiatives are public sector driven and since solution providers are primarily interested in “selling the product”, therefore by default it all centres on data management aspect of the projects. In other words, the focus is on producers rather than users, on Service Oriented Architecture (SOA) rather than on “discoverability” of relevant information. All in all, current SDI solutions are built on the classic concept of a library where information about the data (metadata) is separated from the actual data. Exactly as in a local library, where you have an electronic or card based catalogue with book titles and respective index numbers and rows of shelves with books organised according to those catalogue index numbers. For small, static collections of spatial data this approach may work, but not in the truly digital age, where new datasets are produced in terabytes, with myriad of versions (eg. temporal datasets), formats and derivations. And this is why most SDI initiatives do not deliver what is expected of them at the start of the project.

Ed made a point that it is much better to follow an evolutionary approach (similarly to how web developed over time) rather than strict, “documentation driven” process, as is the case with most current SDI projects. The simple reason is that you don’t have to understand everything up-front to build your SDI. The capabilities may evolve as needs expand. And you can adjust your “definitions” as you discover more and more about the data you deal with. In an evolutionary rather than prescriptive way. It is a very valid argument since it is very, very hard to categorise the data according to strict rules, especially if you cannot predict how the data will evolve over time.

[source: Ed Parsons, Google Geospatial Technoloist]

The above table contrasts the two approaches. On one side you have traditional SDIs with strict OGC/ISO metadata standards and web portals with search functionality - all built on Service Oriented Architecture (SOA) principles and with SOAP service (Simple Object Access Protocol) as the main conduit of information. Actually, the whole set up is much more complex as, in order to work properly, it requires formalised “discovery” module - a registry that follows Universal Description, Discovery and Integration (UDDI) protocol and a “common language” for describing available services (that is, Web Service Description Language or WSDL in short). And IF you can access the data (big “if” because most public access SDI projects do not go as far) it will most likely be in a “heavy duty” Geographic Markup Language (GML) format (conceived over a decade ago but still mostly misunderstood by software vendors as well as potential users). No wonder that building SDI based on such complex principles poses a major challenge. And even in this day and age the performance of such constructed SDI may not be up to scratch as it involves very inefficient processes (“live” multidimensional queries, multiple round trips of packets of data, etc).

On the other side you have the best of web, developed in an evolutionary fashion over the last 15 years: unstructured text search capabilities delivered by Google and other search engines (dynamically indexed and heavily optimised for performance), simple yet efficient RESTful service (according to Ed Parsons, not many are choosing to use SOAP these days) and simpler and lighter data delivery formats like KML, GeoRSS or GeoJSON (that have a major advantage – the content can be indexed by search engines and therefore making the datasets discoverable!). As this is much simpler setup it is gaining a widespread popularity amongst “lesser geeks”. US government portal data.gov is the best example of where this approach is proving its worth.

The key lesson is, if you want to get it working – keep it simple and do not separate metadata from your data to allow easy discovery of the information. And let the community of interest define what is important rather than prescribe upfront a rigid solution. The bottom line is that Google strength is in making sense of chaos that is in cyberspace so it should be no surprise that Ed is advocating similar approach to dealing with chaos of spatial data. But can the solution be really so simple?

The key issue is that most of us, especially scientists, would like to have a definite answer when we search for the right information. That is: “There are 3 data sets matching your search criteria” rather than: “There are 30,352 datasets found, first 100 closest matches are listed below…” (ie. the “Google way”). There is always that uncertainty, “Is there something better/ more appropriate out there or should I accept what Google is serving as the top search result?... What if I choose the incomplete or not the latest version of the dataset?”… So the need for highly structured approach to classify and manage spatial information is understandable but it comes at a heavy cost (both time and money) and in the end it can serve only the needs of a small and well defined group of users. “The web” approach can certainly bring quick results and open up otherwise inaccessible stores of spatial information to masses but I doubt it can easily address the issue of “the most authoritative source” that is so important with spatial information. In the end, the optimal solution will probably be a hybrid of the two approaches but one thing is certain, we will arrive at that optimal solution by evolution and not by design!