Skip to main content
Botany Specimens

DPO Mass Digitization at the Smithsonian Imaging Workflow (Part 2)

By Jessica Warner and Ken Rahaim, DPO on Wed, 12/28/2016 - 12:17

Once we’ve worked out the details for the Physical Workflow to ensure that the collections objects are moved safely and efficiently to and from the digitization work space, we turn our attention to the imaging workflow.

The imaging workflow design process is divided into four parts:

  1. Object Driven Image Fidelity (ODIF) Analysis
  2. ODIF Validation
  3. Image Processing
  4. Object Movement (pre- to post-digitization staging)

In Step 1 we measure the smallest physical detail of the physical object to be recorded as determined by DPO and museum staff.  These measurements set the imaging standard for the objects.  In Step 2 we validate the specifications determined in Step 1 and document them in the vendor’s contract (for more detail about these first two steps, see the previous blog post on the Imaging Workflow—Part 1).

In Step 3 we document the settings associated with the digital processing of the images and the file management for all the image files being generated.  And, finally, in Step 4 we choreograph and document the object handling to move the object from pre-digitization staging to the imaging station, in this case a conveyor-belt driven system, and then on to the post-digitization staging.

Okay, now let’s get into the nitty and the gritty of the Image Processing and Object Movement for our botanical specimen sheets digitization project in the Department of Botany at the National Museum of Natural History.

Step 3: Image Processing

1.    Input

a.    Color space for camera
b.    White balance: Vendor may choose how to establish white balancing (test above used Golden Thread Target balanced on the .75 density patch)
c.    If custom color ICC profile is used, vendor will document here (test above used custom camera profile created with basICColor Input v3.5.1)

2.    Output

a.    File naming:

i.    Based on code 39 barcode input

1.    Example = 00436248.jpg | tif | iiq

b.    Raw file format – Phase One medium format iiq

c.    Color space: 

i.    raw – camera color space
ii.    tif – wide color gamut such as ProPhoto or eciRGBv2
iii.    jpg – n/a as per TPC

d.    Bit depth:

i.    raw – 16 
ii.    tif – 8
iii.    jpg – n/a as per TPC

e.    Sharpening – None

f.    Tone curve – Capture One Linear Scientific or equivalent

g.    Derivative generation

i.    raw – uncropped 
ii.    tif – uncropped 
iii.    jpg – n/a as per TPC

h.    Output directory

i.    Root directory for deliverable images should be “Production-Botany”
ii.    Images will be saved in a new folder for each days’ production
iii.    Each day’s parent folder will have the format “unit-project-yyyymmdd” for its name

1.    unit= nmnh
2.    project = botany

iv.    Each days’ parent folder will have 2 sub-folders for tifs and iiqs files to be stored in. The sub-folders will be named “tifs” and “raws”
v.    Each day’s images will also have md5 checksums generated for all images.

1.    Two md5 checksum files will be generated, one for the tiffs and one for the iiqs:

a.    Checksums for raw images will be stored in a file whose filename will have the format “unit-project-yyyymmdd-raws.md5”
b.    Checksums for tif images will be stored in a file whose filename will have the format “unit-project-yyyymmdd-tifs.md5”

2.    The checksum data in the file will have the format “checksum <space> filename.ext” for example:

595f44fec1e92a71d3e9e77456ba80d1 barcode1.tif
71f920fa27592a9b50fa4d4d41432a38 barcode2.tif

43c191bf6d23443d32349af0098c90d0 barcode1.iiq
983920a53e0098c8923d9908da83b08 barcode2.iiq

3.    The checksum files will be stored in the day’s respective raw & tif image directories

vi.    An example day’s directory will look like this:







3.    Data Generation

a.    File sizes:

i.    Raw = ~95MB
ii.    Tif color = ~140MB = LZW lossless compression on
iii.    Jpg = n/a as per TPC
iv.    Total per image = ~340MB

b.    Throughput = 4500 sheets per day for short conveyor setup (6,000 sheets per day for long conveyor setup which we cannot use due to space reasons)

i.    ~1.53 TBs of image data generated per day (8 hour day)
ii.    Ingest rate:

1.    ~191 GB per hour ingest rate required (based on 8 hour day)
2.    ~96 GB per hour ingest rate required (based on 16 hour day)

4.    Metadata

a.    Barcodes:

i.    Format = code 39
ii.    Mapping:

1.    Specimen sheet barcode -> filename & iptc title

a.    NOTE: For parsing purposes, ALL specimen sheet barcodes will start with a 0 (zero)

2.    Taxonomic/IRN folder barcode -> iptc headline

a.    NOTE: For parsing purposes, NO taxonomic/IRN folder barcode will start with a 0 (zero)

b.    Boilerplate metadata:

i.      IPTC Creator: Contractor Name
ii.     IPTC Creator: Address: PO BOX 37012 MRC166
iii.    IPTC Creator: City: D. C.
iv.    IPTC Creator: Postal Code: 20013-7012
v.     IPTC Creator: Email(s): 
vi.    IPTC Creator: Website(s):
vii.   IPTC Keywords: specimen image; specimens; herbarium; U. S. National Herbarium; Smithsonian; Botany; (US)
viii.  IPTC Rights Usage Terms:


Step 4: Specimen Movement (pre- to post-digitization staging)

The steps below describe the precise movement of specimens from pre-digitization staging to post-digitization staging.  See workspace floorplan below.*

Digitization rate for the conveyor system will be 4,000 – 4,500 specimens per day.  In order to ensure continuous production throughout the day, at least 5,000 specimens per day should be ready at pre-digitization staging each morning prior to digitization starting at 8AM.






SI Handler

As per the physical workflow design document, using custom Viking rolling carts, brings the folders for the day’s production from permanent storage to pre-digitization staging area.


Cnvyr Op 1

Picks up folders from pre-digitization staging and stacks them on the west counter to transfer them to the production area.


Cnvyr Op 1

Removes folders from west counter and places them on the conveyor belt table and ensures there are enough black sheets to put below each object.


Cnvyr Op 1

Double checks bottom right front of folders to ensure taxonomic/IRN barcode.


Cnvyr Op 1

Puts the taxonomic/IRN barcoded folder on the conveyor belt so barcode can be read.


Cnvyr Op 2

Folder advances down conveyor to barcode station where folder taxonomic/IRN barcode is scanned and saved for embedding into each of the succeeding specimen image’s IPTC Headline metadata field.


Cnvyr Op 1

After specimen’s folder is placed on conveyor, one specimen after another is placed on the conveyor (optional: on 1 black background sheet each for cropping), positioned as straight as possible and oriented on the guidelines of the conveyor-belt.

Note: for specimens that are enclosed within an envelope, the specimens shall remain inside the envelope attached to the specimen sheet and photographed that way.


Cnvyr Op 1

Supporting material such as literature, photographs, illustrations and reference material (anything that is not a specimen), will be placed on the conveyor but well outside the imaging area.


Cnvyr Op 2

For specimen sheets that don’t have a specimen barcode (always starts with 0 [zero]) a new specimen barcode (always starts with 0 [zero]) is applied to bottom middle of specimen sheet.


Cnvyr Op 2

Specimen advances down the conveyor belt to barcode station where specimen barcode (always starts with 0 [zero]) is scanned and saved for naming the raw file (optional: and embedding into the image’s IPTC Title metadata field).



Specimen advances to imaging station where:

a)     Specimen is automatically shot with Golden Thread specimen target and “US Herbarium, Smithsonian Institution” plus SI logo “burned into” image 

b)     All necessary post-processing, including cropping, applied automatically

c)     Quality control analysis is automatically run against Golden Thread specimen target for each specimen

d)     Stored specimen barcode (always starts with 0 [zero]) is written to image’s filename (optional: and into embedded into image’s IPTC Title metadata). Stored specimen’s folder barcode is written to image’s IPTC Headline metadata. Both barcodes are written to barcode text log for later delivery to TBD.

e)     tif derivative is automatically generated and written to specified storage folders


Cnvyr Op 3

Imaging results are displayed on monitor and double checked by operator.


Cnvyr Op 1

Specimens will continue to be imaged until all specimens from the folder are imaged. When the next folder arrives, it is placed on the conveyor as per step 5 and process continues from there.


Cnvyr Op 3

Empty folder reaches the end of the conveyor and is placed on waiting cart. (OPTIONAL: folder is marked to annotate it has been digitized).


Cnvyr Op 3

As each specimen or supporting material reaches the end of the conveyor, it is placed back in its folder on the waiting cart (optional: The black background sheet is placed in basket for reuse in step 7).


Cnvyr Op 3

Once the folder is filled it is closed and set aside on a cart for return to post-digitization staging.


Cnvyr Op 3

Once the cart is filled with folders, it is moved to the custom Viking rolling carts from step1 and unloaded.


SI Handler

Once the custom Viking rolling carts are filled up at the end of the day or at regular intervals, it is moved to permanent storage where the folders are unloaded into their shelves.


Cnvyr Op 3

Once the day's work is done, confirm text file w/ folder taxonomic/IRN & specimen barcode text file has been generated and sent to TBD.


The following steps describe exceptions to the process above and their solutions:






Cnvyr Op 1 & SI Handler

Taxonomic/IRN barcode is not on folder

Folder is removed from processing into “hold” bin. NMNH staff is notified. New taxonomic/IRN barcode is printed and applied by NMNH staff. All folders from “hold” bin are processed at end of day.


Cnvyr Op 3

Imaging system fails to take an acceptable photo

Conveyor operators initiate “back” function which reverses the conveyor to problem specimen, bad files are erased, specimen is recaptured and post-processed and process continues.


Workspace Floorplan*













Having a precise plan for moving the specimens through the imaging process saves precious time.  And when you have a colossal and significant collection like this one with such deep research value, there's not a moment to waste making it available to scientists' screens around the world! 

Next up in our Mass Digitization series, The Virtual Workflow--coming soon!