Some Simple Guidelines for Effective Data Management
From AcaWiki
Citation: Elizabeth T. Borer, Eric W. Seabloom, Matthew B. Jones, Mark Schildhauer (2009/05/01) Some Simple Guidelines for Effective Data Management. Ecological Society of America Bulletin (RSS)
DOI (original publisher): 10.1890/0012-9623-90.2.205
Semantic Scholar (metadata): 10.1890/0012-9623-90.2.205
Sci-Hub (fulltext): 10.1890/0012-9623-90.2.205
Internet Archive Scholar (search for fulltext): Some Simple Guidelines for Effective Data Management
Download: https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/0012-9623-90.2.205
Tagged: data management (RSS)
Summary
- Use a scripted program for analysis.
- NB: use a workflow management system to run your scripts. This balances speed of computation with reproducibility.
- "GUI-driven" analysis is harder to scrutinize and reproduce.
- Store data in non-proprietary software formats (e.g., comma delimited text file, .csv).
- NB: CSV is not space-efficient for large amounts of numeric data. HDF5, still non-proprietary, may be more appropriate in that case.
- Store data in non-proprietary hardware formats.
- NB: Archival websites are even better than physical storage.
- Store an uncorrected data; make corrections within a scripted language.
- Make it a read-only file.
- This way, you can revert mistakes in the analysis.
- Use descriptive names for your data files
- NB: Other authors recommend not parsing filenames to get metadata. Often there is too much metadata to fit in a filename, and that metadata is not necessarily unique.
- No spaces in filenames.
- Include a “header” line that describes the variables as the first line in the table.
- NB: The first row becomes the "name" of the column in Pandas, so if you use a descriptive short-name there, write a sentence describing each column in the second row.
- Use plain ASCII text for your file names, variable names, and data values.
- NB: I consider this a little out-of-date: UTF-8 is the new de facto standard, and it is backwards compatible with ASCII, so UTF-8-unaware programs will render most of the text properly and emojibake over special characters.
- When you add data to a database, try not to add columns; rather, design your tables so that you add only rows.
- NB: I believe this is more commonly known as First Normal-Form.
- All cells within each column should contain only one type of information (i.e., either text, numerical, etc.).
- Record a single piece of data (unique measurement) only once
- NB: This is Third Normal-Form.
- Record full information about taxonomic names.
- Record full dates, using standardized formats.
- Always maintain effective metadata.
- NB: This metadata should be machine-readable, e.g. in YAML.