Skip to content

Data Processing Steps

John Wieczorek edited this page Feb 1, 2016 · 6 revisions

Create Occurrences file from source

The result is a UTF8-encoded TSV file with all non-printing characters removed, an internally unique processingID, and a header with clean field names (no white spaces) in sorted order.

  • Encoding to UTF-8
  • Get header and dialect
  • Try to read all rows
  • See if field count is wrong for any row
  • Add field "dummytest" as first field with values "dummytest"
  • Remove non-printing characters (see PurgeNonprintingCharacters.sh and PurgeNuls.sh)
  • Format as TSV, add processingID field, remove dummy field, merge header to get clean, sorted header

Create Darwin Cloud Occurrence file from Occurrences file

The result is a version of the Occurrences file with superfluous fields removed and with remaining fields having names matching terms in the Darwin Cloud wherever possible and with any remaining non-Darwin Cloud terms processed into dwc:dynamicProperties.

Map the fields to Darwin Cloud terms

Create a mapping file consisting of one key:value pair per line, where each Occurrence file field name is a key and its value is either "omit" or a Darwin Cloud term name (Darwin Cloud terms are commonly used fields that can be unambiguously processed into Darwin Core terms).

loanID:omit
catNum:catalogNumber
datum:geodeticDatum

Process the mapping

Use the mapping file to create Darwin Cloud Occurrence file by doing the following for every record

  • omit all fields mapped to "omit"
  • set the value of that explicitly mapped-to field to be the value of the mapped-from field unless that field already exists for the record, in which case, append the value (Note: this doesn not account for situations such as preparations where the fields are named parts (Skin, Skull) and the values are True or False).
  • for a key in the Occurrence file that is not in the mapping list
    • if the key is a Darwin Cloud term, set the value of Darwin Cloud term
    • if the key is not a Darwin Cloud term, append a dynamicProperty whose key is the given key

Darwin Cloud terms

  • Occurrence
  • Event
  • Identification
  • Taxon
  • Location
    • Depth
    • Elevation
    • Georeference