By Luis J. Villanueva on Tue, 10/20/2020 - 09:56
Today the Global Biodiversity Information Facility (GBIF) announced that a project by the Digitization Program Office (DPO) won a Third Place prize in their 2020 challenge. GBIF selected the Mass Georeferencing Tool project among several entries in their 2020 Ebbe Nielsen Challenge, a competition for projects that "leverage biodiversity data and tools from the GBIF network to advance open science" (source).
The Mass Georeferencing Tool works on a similar approach as software as a service (SaaS), where the IT department manages the database, runs the queries, and provides the computing power. In the case of the Smithsonian Institution, DPO as part of the Office of the Chief Information Officer (OCIO) will provide this service. This will allow the collection staff to concentrate on selecting the best matches for the records and validating these matches. Our main goal is that this system, and the associated workflows, will allow the staff of the National Museum of Natural History (NMNH) to georeference more records on a massive scale. The large collections of the NMNH, more than 146 million specimens, will require innovative approaches and tools. The Mass Georeferencing Tool is one of several projects that DPO is working on to create and enhance the digital records of the Smithsonian museums.
The tool groups biodiversity records based on criteria selected by the collection staff, for example by collection event or species. The system then runs a massive parallel search for similar localities across a large number of spatial databases. The result is a list of candidate localities with score values based on the similarity of the text, state or province match, distance to the known species distribution, and other factors. This workflow is similar to what the staff usually does, the tool lets the computers do the tedious tasks.
The web application part of the tool displays these results on a browser. The web application uses Leaflet to map the spatial data from a PostGIS database and a variety of basemaps. In addition, the map can display the species distribution to provide more relevant information. This is a sample screenshot of an early test release of the application, where the collection staff can select the best candidate locality for the record in the collection:
The collection staff then downloads the results in a number of customized formats, according to their needs. All the computing power, including matching, data storage, spatial data mapping, and data formatting is provided by resources at the data center managed by OCIO.
We recently finished a testing phase of the tool with staff of NMNH which resulted in valuable feedback that will help us to continue to develop the tool towards a stable release. Our goal is to deploy a large-scale test early next year and a full production system several months later.
The source code of the application is available on GitHub with an open license: https://github.com/Smithsonian/Mass-Georeferencing