Skip to content
164 changes: 164 additions & 0 deletions chapters/02_primary_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,170 @@ toy_data/ Very (very!) small pieces of data for dev testing
If a directory contains many files or subdirectories, consider whether it's
clearer to write a separate manifest specifically for that directory.

### Document the Data

**Metadata**, or data that describes data, is critical to the research process.
It delineates how the data were collected, what assumptions were made, what
biases might be present, any ethical concerns, the overall structure, what each
observation means, what each feature means, and more. Good data
documentation guides researchers towards appropriate, responsible use of the
data for both the current and future studies.

Good metadata should answer the questions the who, what, when, where, why, and
how questions about your data. How your metadata answers these questions will
depend on conventions in your field of research. If you are submitting your data
to a particular data repository, it will likely have a required metadata scheme
to you will need to follow. Otherwise, pick a metadata scheme that aligns with
other researchers in your field. If you have to submit a Data Management Plan,
it will specifically ask how you will apply and adhere to field specific data
standards. For more information about Data Management Plans, see the UC Davis
Library's [Data Management Research Guide][lib-dmp] and the California
Digital Library's [DMPTool][]

If you aren't sure what the standard is in your field, there are several online
repositories to help you out. The [Metadata Standards Catalog][msc] has a fairly
exhaustive list of metadata schemes, which you can browse [by
subject][msc-subject]. [Fairsharing.org][fairshare] also stores metadata and
other documentation standards. By using an existing community standard metadata
scheme, you make it possible for future researchers (including you!) to compare
your data to data from other, heterogeneous, sources.

```{note}
Many metadata resources include a **[controlled
vocabulary][c-vocab]**. This is a list of specific values, each with a
predefined meaning. It is designed to provide consistency and uniqueness across
data sources. One common example of a controlled vocabulary is a list of
geographic names, like the [Thesaurus of Geographic Names (TGN)][tgn]. There are
many ways you can refer to New York City (NYC, the Big Apple, Manhattan etc).
But if you want to be able to group together all data about New York City, it is
helpful if everyone calls it the same thing.
```

Even if you don't know where your data will end up, documenting your data when
you collect it will help ensure your documentation doesn't have any gaps. Timely
documentation also maximizes the likelihood that your research can be
reproduced, as well as reused by other researchers increasing your overall
research impact. If your project uses data collected earlier or by someone else,
it's a good practice to fill gaps in the existing documentation with your own.
Thorough documentation isn't just beneficial to other researchers, it's also
beneficial to future you. Small details you notice and document can be important
later in the project.

```{figure} /images/michener_information_entropy.png
---
name: information-entropy
figwidth: 550px
align: center
alt:
---
Information Entropy (Figure 1) from ([Michener et al. 1997][michener]) ©
1997 by the Ecological Society of America.
```

:::{note}
One of the simplest and most widely used metadata standards is the [Dublin
Core][dublin-core], a set of 15 metadata elements originally defined at a 1995
workshop in Dublin, Ohio. The exact definition of the Dublin Core elements can
be a bit technical, but the University College Dublin (Ireland) Library provides
simplified explanations and examples [here][dublin-ex].
:::

If all of this seems overwhelming, that's okay. The Consortium of European
Social Science Data Archives (CESSDA) has a great introductory
[video][cessda-video] for those who have never documented data before. CESSDA
also provides detailed explanations of what information to document at both
project and data level in their [Data Management Expert Guide][cessda-guide].
This includes detailed information about documenting quantitative and
qualitative data. Just make sure to expand all the collapsed sections.


:::{seealso}
There are many resouces on documenting your data available. Here are a selection
of them:
- [Metadata Standards Catalog][msc]
- [Fairsharing.org][fairshare]
- [README, Write Me! DataLab workshop reader][datalab-readme]
- [UC Davis Research Data Management LibGuide][lib-metadata]
- [CESSDA's Data Management Expert Guide][cessda-guide]
- [The Turing Way on Documentation and Metadata][turing-metadata]
- [MIT Metadata Info][mit-metadata]
- [Harvard Biomedical Documentation and Metadata][harvard-metadata]
- University College Dublin on [Metadata][dublin-ex] and [Documentation][ucd-doc]
:::


[lib-metadata]: https://guides.library.ucdavis.edu/data-management/documentation
[lib-dmp]: https://guides.library.ucdavis.edu/data-management/planning
[DMPTool]: https://dmptool.org/
[msc]: https://rdamsc.bath.ac.uk/
[msc-subject]: https://rdamsc.bath.ac.uk/subject-index
[fairshare]: https://fairsharing.org/
[michener]: https://esajournals.onlinelibrary.wiley.com/doi/10.1890/1051-0761%281997%29007%5B0330%3ANMFTES%5D2.0.CO%3B2
[mit-metadata]: https://libraries.mit.edu/data-management/store/documentation/
[dublin-core]: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#section-3
[dublin-ex]: https://libguides.ucd.ie/data/metadata
[cessda-video]: https://www.youtube.com/watch?v=cjGz-I0GgKk
[cessda-guide]: https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/Documentation-and-metadata
[tgn]: https://www.getty.edu/research/tools/vocabularies/tgn/index.html
[turing-metadata]: https://book.the-turing-way.org/reproducible-research/rdm/rdm-metadata/
[harvard-metadata]: https://datamanagement.hms.harvard.edu/collect-analyze/documentation-metadata
[c-vocab]: https://rdf-vocabulary.ddialliance.org/
[ucd-doc]: https://libguides.ucd.ie/data/documentation

(create-data-dictionary)=
#### Create a Data Dictionary

A **data dictionary** is a document that explains what every field or element in
your dataset means as well as any restrictions on their values. This includes
things like the data type (ex. number, date, text, boolean), and whether that
field can be missing. The more information you include, the more helpful it will
be down the line (see [Captain Obvious][captain_o]). Data dictionaries are the
most efficient way to communicate the structure and content of your data to
other collaborators, including future you! A very basic one could look like
this:

|Field Name |Field Description |Data Type |Notes |
|-----------|------------------------------------------|------------|----------|
|person_id |autogenerated by database |integer | |
|name |legal full name (family name, given name) |string | |
|occupation |A person's job or vocation |string |Must come from the Bureau of Labor Statistics Occupation List |
|... |... |... |... |



If you aren't sure where to start with creating a data dictionary, DataLab has a
[template][datalab_dd_template] you can use as a jumping off point. [Open
Science Framework][osf_dd] has resources on what details to add to your data
dictionary, and the [USGS][usgs_dd] provides many examples of data dictionaries
and how they are used in different contexts. If you are working with multiple
data sets, make sure to clarify which data dictionary to use with each data set.

If your dataset looks less like a series of rows and columns, and more like a
long list of files, consider creating a **data inventory** instead. A data
inventory should include the author or source, title, publication year (if
published), and file name for each file, but can include more file metadata as
necessary. A data inventory for a public domain fiction data set would look
something like this.

|Author |Title |Year |Filename |
|--------------------|--------------------|-----|------------------------------------------|
|Bronte,Charlotte |JaneEyre |1847 |EN_1847_BronteCharlotte_JaneEyre.txt |
|Austen,Jane |SenseandSensibility |1811 |EN_1811_AustenJane_SenseandSensibility.txt|
|Wollstonecraft,Mary |Maria |1798 |EN_1798_WollstonecraftMary_Maria.txt |
|... |... |... |... |


If you also need to keep track of things like the provenance or license
associated with each file or data set, DataLab's
[data inventory template][datalab_di_template] provides a pretty comprehensive
starting point.

[osf_dd]: https://help.osf.io/article/217-how-to-make-a-data-dictionary
[usgs_dd]: https://www.usgs.gov/data-management/data-dictionaries
[captain_o]: https://dataedo.com/blog/captain-obivous-guide-to-column-descriptions-data-dictionary-best-practices
[datalab_dd_template]: https://docs.google.com/spreadsheets/d/12N0hKyeT0ndZnt7rVZsz7LTW--BHhbb6TOegXEKQoxE/edit?usp=sharing
[datalab_di_template]: https://docs.google.com/spreadsheets/d/1nUb-eu82Q7VplDpk0np5rYuaN52mYHLdql18pRD0i4Y/edit?usp=sharing


(workflows)=
#### Workflows
Expand Down
25 changes: 0 additions & 25 deletions chapters/03_secondary_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,31 +12,6 @@ relevant, and we recommend you do too.
Documentation
-------------

### Document the Data

In a perfect world, every data set would come with detailed documentation or
**metadata** about how the data were collected, what assumptions were made,
what biases might be present, any ethical concerns, the overall structure, what
each observation means, what each feature means, and more. Good data
documentation guides researchers towards appropriate, responsible use of the
data.

Collecting data as part of a project gives you and your collaborators control
over how the data are documented, so you can ensure there are no gaps. If your
project uses data collected earlier or by someone else, it's a good practice to
fill gaps in the existing documentation with your own. Thorough documentation
isn't just beneficial to other researchers, it's also beneficial to future
you---small details you notice and document about features could be important
later in the project.

:::{seealso}
See DataLab's [README, Write Me! workshop reader][datalab-readme] for more
about how to document data.
:::

[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/


(document-every-experiment)=
### Document Every Experiment

Expand Down
Binary file added images/michener_information_entropy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.