Skip to main content
Objects and Dashboard from Mass Digitization

Automation in Mass Digitization Projects

By Luis J. Villanueva on Mon, 09/28/2020 - 09:30

The Mass Digitization projects of the Digitization Program Office (DPO) created millions of images, many of which are now available under Open Access. The scale of each project resulted in a time-consuming process of keeping track of the progress of each file, find problems, and fix errors. In the last couple of years, we have moved to data-centric approaches for our workflows, where we use automated tools to collect detailed data from each project and to keep track of each file, errors, and create reports of these projects.

Previously, the progress in each project was tracked in manually edited spreadsheets. We depended on the validation checks and reports by the Smithsonian Digital Asset Management System (DAMS), where all the images are stored. This made it easy to lose track of the progress made in some projects; when thousands of images are created every day, the data becomes outdated very fast.

The images produced in each project are verified in case they do not meet the project requirements. These checks were only possible when the images were ingested in the DAMS. Whenever there was an issue, it would create more work for our colleagues in the DAMS team and remediation could take a while to coordinate. For example, remediation of objects that had already been moved to storage added more complexity to the project.

To solve these issues, we developed a data-centric automated system, which we called Osprey, that tracks each file that the vendor creates. This system runs the integrity and validation checks the museum requires for the project as close to image capture as possible. These checks include: 

  • Pair of files are present (TIF and raw) 
  • When a JPG file is required, checks that it is also present 
  • TIF and JPG files are formatted correctly according to JHOVE and ImageMagick 
  • File name meets requirements, which could be: 
    • Set prefix for the project 
    • File name present in lookup table of valid names 
  • TIF files are compressed using LZW lossless compression to save disk space 
  • No duplicate file names 
  • File md5 hash signature matches the inventory list provided by the vendor 
Dashboard of the Paleobiology EPICC Mass Digitization project. The top row shows the project statistics, the column in the left side is the list of folders, the column in the center has the test and validation results for each file, and the column in the right side has the details of the selected file.
dashboard_paleo_0.png

In our recent projects, the vendor stores the images in a network share that Osprey has read-only access. The system continuously scans the network share to read and validate each new file. A dashboard lets the project officer, the museum staff, and the vendor staff know of problems within minutes or hours of the moment each image was captured.

We have also automated the process of creating progress reports of each project. Previously, the project officer needed to update spreadsheets of each project. This process was cumbersome and created the problem of having data that was outdated by weeks or months. In addition, creating custom reports, like the total images and objects that have been digitized across all museums, was time consuming since the data was stored in separate files. We automated the reports by using the Osprey dashboard to summarize the main project statistics. The system writes spreadsheets with a full report of all files for each project and these are shared in an online drive.  

In addition, we have a Mass Digitization Dashboard that is fed by our main database. One version of this dashboard is available to the public at https://shiny.si.edu/massdigi/. Visitors can see the main statistics of our projects, including plots and tables detailing the progress by day, the completion percentage of each project, and the total number of specimens digitized by DPO. This dashboard is updated every Monday morning and lets our staff, the Smithsonian leadership, and the public know the progress we are making.  

Automation allows DPO to keep track of digitized specimens (left; NMNH-Paleobiology) in a Mass Digitization Dashboard (right).

 

All our tools are available with an open source license from the Smithsonian Central GitHub: https://github.com/Smithsonian/.