Some Simple Guidelines for Effective Data Management

From AcaWiki
Jump to: navigation, search

Citation: Elizabeth T. Borer, Eric W. Seabloom, Matthew B. Jones, Mark Schildhauer (2009/05/01) Some Simple Guidelines for Effective Data Management. Ecological Society of America Bulletin (RSS)
DOI (original publisher): 10.1890/0012-9623-90.2.205
Semantic Scholar (metadata): 10.1890/0012-9623-90.2.205
Sci-Hub (fulltext): 10.1890/0012-9623-90.2.205
Internet Archive Scholar (search for fulltext): Some Simple Guidelines for Effective Data Management
Tagged: data management (RSS)


  1. Use a scripted program for analysis.
    • NB: use a workflow management system to run your scripts. This balances speed of computation with reproducibility.
    • "GUI-driven" analysis is harder to scrutinize and reproduce.
  2. Store data in non-proprietary software formats (e.g., comma delimited text file, .csv).
    • NB: CSV is not space-efficient for large amounts of numeric data. HDF5, still non-proprietary, may be more appropriate in that case.
  3. Store data in non-proprietary hardware formats.
    • NB: Archival websites are even better than physical storage.
  4. Store an uncorrected data; make corrections within a scripted language.
    • Make it a read-only file.
    • This way, you can revert mistakes in the analysis.
  5. Use descriptive names for your data files
    • NB: Other authors recommend not parsing filenames to get metadata. Often there is too much metadata to fit in a filename, and that metadata is not necessarily unique.
    • No spaces in filenames.
  6. Include a “header” line that describes the variables as the first line in the table.
    • NB: The first row becomes the "name" of the column in Pandas, so if you use a descriptive short-name there, write a sentence describing each column in the second row.
  7. Use plain ASCII text for your file names, variable names, and data values.
    • NB: I consider this a little out-of-date: UTF-8 is the new de facto standard, and it is backwards compatible with ASCII, so UTF-8-unaware programs will render most of the text properly and emojibake over special characters.
  8. When you add data to a database, try not to add columns; rather, design your tables so that you add only rows.
  9. All cells within each column should contain only one type of information (i.e., either text, numerical, etc.).
  10. Record a single piece of data (unique measurement) only once
  11. Record full information about taxonomic names.
  12. Record full dates, using standardized formats.
  13. Always maintain effective metadata.
    • NB: This metadata should be machine-readable, e.g. in YAML.