I have long been enamored with the idea of joining and reconciling distributed datasets. With the rise of machine learning in the mapping space, coupled with an ever increasing trend towards crowd sourced mapping (individuals, NGOs, and governments), finding the thread that ties these datasets together is more important than ever. There is more open data in the mapping space than anyone knows what to do with and it is being generated faster than centralized projects like OpenStreetMap are able to ingest it. For example, nearly a year ago, Bing released an ML derived dataset of 125 million building footprints under an ODBL license, which is compatible with OpenStreetMap (UPDATE: another 12 million Canadian footprints were released moments after I publish this piece). The data has yet to be imported due to a host of problems that arise when attempting to process so many changes at once. While maintenance burden is an oft-cited pain point, an even bigger problem is how to detect and resolve conflicts in the data. In other words, how do we come to consensus on our view of the world in the age of global scale sensor networks and competing entities exerting soft power through the map? Is it even possible?
In traditional centralized mapping, deriving consensus is pretty straightforward. Whoever made the last change wins, and aside from the occasional edit war, this model is powerful in its lack of ambiguity. As a community, each contributor submits to the “last edit monarchy”, because there is so much to gain by having access to an internally consistent global street map of curated data. This model starts to break down, however, when the concepts of “time” and “edit” become fuzzy. The last edit rule only works when we operate on individual human level scales for time and edit complexity, since the curation and reconciliation models require human review. When the supply side dam breaks and 100s of millions of detailed edits can be generated in a few minutes by a swarm of GPUs, the human reviewer model cannot scale to meet the volume. Whether it's buildings extracted with machine learning from aerial imagery at a rate of millions per minute, or massive street datasets generated by cars and phones collecting ambient GPS telemetry as users go about their day, the volume of useful data is exploding exponentially. How useful is the concept of “last edit” when the last edit included 10 million miles of street centerlines, and no one is quite sure how they overlap with the existing 50 million miles in OpenStreetMap meticulously mapped out over the last decade? What happens when that volume is regenerated daily?
Distributed editing of global scale street maps is an unsolved problem, but we can look to similar domains for inspiration. One area where conflict management was similar to the present state of mapping is code authoring. Before git
, most of us used source control systems that required “checking out” a file before editing, which put a global lock on the file, almost like checking out a book from the library. This provided a simple method for avoiding conflict, in many ways analogous to the last edit rule used in mapping today. The HOT tasking manager even uses the check out model explicitly where an editor is able to make an exclusive lock on a specific tile while adding data in a dense region. Conflicts are impossible when there is an unambiguous process for an individual to become the arbiter of truth. While arguments can still be had over data accuracy, there is a single unambiguous state that cannot be disputed at any given point in time. Git flipped this model on its head, in large part by taking on an immense responsibility: detecting and fixing conflicts in competing branches of code.
Mapping follows a similar pattern to code authoring, in that conflicts are dangerous, but not the norm. It is rare to have multiple coders working on the same lines in a file at the same time, just as it is rare to have multiple editors working on the same object in OpenStreetMap, in comparison to the average. A system that could at least detect issues, could forward most changes through, and this is precisely what makes git merge
so powerful. This strategy is commonly used in bespoke OpenStreetMap imports, where code will be written to identify imported features that are far away from existing features, and therefore pose minimal conflict risk.
When a conflict is detected in git
, a best effort attempt will be made to resolve the two branches. If this is not possible, the merge will fail, and the user will be asked to manually decide which branch wins. I have come to believe that these failure modes are inevitable in mapping as well, requiring human intervention. The goal of a resolution strategy should then be to:
The fewer edits get pushed to a human reviewer, the more democratized our future maps will be. Evidence for this can be found in Wikipedia's use of automated vandalism detection, such as Cluebot NG, as well as Wikipedia's edit ranking systems, such as ORES, which have drastically reduced the need for human review through efficiency gains. While OpenStreetMap suffers less from vandalism than Wikipedia, handling of bulk imports and automated import streams represents a similar existential resourcing challenge.
The problem with the strategy above is in phase 2, automated conflict resolution. Different organizations will have varying risk tolerances, quality requirements, and access to human editors with expertise to resolve complex conflicts. This means that it may be fundamentally impossible to resolve branching map datasets in a way that is satisfactory to everyone. The vigorous debate around bulk imports in the OpenStreetMap community illustrates how divergent different community’s requirements can be when it comes to this problem.
Conflict-free replicated data types (CRDT) are one potential solution to this set of issues, as suggested by the brilliant folks working on peermaps, a fully p2p global street map used in some impactful offline heavy editing scenarios in the Amazon basin. The idea with the CRDT approach more broadly is to link edits together which could potentially conflict, and materialize a de-conflicted view of each feature on the fly. This materialization step would be separate from the primary data storage responsibilities of the map editing infrastructure, opening up the possibility for various resolution strategies depending on the use case. For example, an organization could weight its own editors' edits more heavily or a cautious individual wary of ML derived edits could filter based on source. Default strategies would likely be provided as well, similar to the golden path git merge strategy.
The CRDT strategy or similar is common in certain niches of mapping, such as crowd sourced POIs or ETA modeling where conflicting data is the norm, but significant research is still needed to make this a reality for features with strict topology requirements such as street geometry or tightly packed buildings. While some may think of p2p mapping as a niche use case, decentralized mapping generally is bringing the issues potentially addressed by CRDT to the forefront for anyone interested in large scale mapping collaborations.
The conflation problem is an undertaking that no single entity is capable of solving it alone. I recently joined SharedStreets, an independent non profit building an open street identifier protocol. I joined SharedStreets specifically to focus on the problem of linking branching street map data by common identifier via a fuzzy linear reference system. I believe robust common reference systems are a critical first step to building a distributed global base map (git
got off easy on this step, since code has line numbers). In addition, I'm increasingly excited by possibilities involving ultra flexible routing engines that are able to adapt to the rapidly accelerating velocity of data generation, either through smart merging/tasking or development in CRDT resolution on the fly.
Thanks for reading! Questions or comments can be emailed to [email protected] or tweeted to @morganherlocker.
03-07-19