Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions chapters/02_primary_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,94 @@ toy_data/ Very (very!) small pieces of data for dev testing
If a directory contains many files or subdirectories, consider whether it's
clearer to write a separate manifest specifically for that directory.

### Document the Data

In a perfect world, every data set would come with detailed documentation or
**metadata** about how the data were collected, what assumptions were made,
what biases might be present, any ethical concerns, the overall structure, what
each observation means, what each feature means, and more. Good data
documentation guides researchers towards appropriate, responsible use of the
data.

Collecting data as part of a project gives you and your collaborators control
over how the data are documented, so you can ensure there are no gaps. If your
project uses data collected earlier or by someone else, it's a good practice to
fill gaps in the existing documentation with your own. Thorough documentation
isn't just beneficial to other researchers, it's also beneficial to future
you---small details you notice and document about features could be important
later in the project.

:::{seealso}
See DataLab's [README, Write Me! workshop reader][datalab-readme] for more
about how to document data.
:::

[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/

(create-data-dictionary)=
#### Create a Data Dictionary

A **data dictionary**, part of your metadata, is a document that explains what
every field or element in your dataset means as well as any restrictions on
their values. This includes things like the data type (ex. number, date, text,
boolean), and whether that field can be missing. The more information you
include, the more helpful it will be down the line (see [Captain
Obvious][captain_o]). Data dictionaries are the most efficient way to
communicate the structure and content of your data to other collaborators,
including future you! A very basic one could look like this:

|Field Name |Field Description |Data Type |Notes |
|-----------|------------------------------------------|------------|----------|
|person_id |autogenerated by database |integer | |
|name |legal full name (family name, given name) |string | |
|occupation |A person's job or vocation |string |Must come from the Bureau of Labor Statistics Occupation List |
|... |... |... |... |



If you aren't sure where to start with creating a data dictionary, DataLab has a
[template][datalab_dd_template] you can use as a jumping off point. If you
prefer step by step instructions, Kristin Briney's [Create a Data Dictionary
exercise][create_dd] minght be for you. [Open Science Framework (OSF)][osf_dd] has
resources on what details to add to your data dictionary, and the
[USGS][usgs_dd] provides many examples of data dictionaries and how they are
used in different contexts. If you are working with multiple data sets, make
sure to clarify which data dictionary to use with each data set.

If your dataset looks less like a series of rows and columns, and more like a
long list of files, consider creating a **data inventory** instead. A data
inventory should include the author or source, title, publication year (if
published), and file name for each file, but can include more file metadata as
necessary. A data inventory for a public domain fiction data set would look
something like this.

|Author |Title |Year |Filename |
|--------------------|--------------------|-----|------------------------------------------|
|Bronte,Charlotte |JaneEyre |1847 |EN_1847_BronteCharlotte_JaneEyre.txt |
|Austen,Jane |SenseandSensibility |1811 |EN_1811_AustenJane_SenseandSensibility.txt|
|Wollstonecraft,Mary |Maria |1798 |EN_1798_WollstonecraftMary_Maria.txt |
|... |... |... |... |


If you also need to keep track of things like the provenance or license
associated with each file or data set, DataLab's
[data inventory template][datalab_di_template] provides a pretty comprehensive
starting point.

[osf_dd]: https://help.osf.io/article/217-how-to-make-a-data-dictionary
[usgs_dd]: https://www.usgs.gov/data-management/data-dictionaries
[captain_o]: https://dataedo.com/blog/captain-obivous-guide-to-column-descriptions-data-dictionary-best-practices
[datalab_dd_template]: https://docs.google.com/spreadsheets/d/12N0hKyeT0ndZnt7rVZsz7LTW--BHhbb6TOegXEKQoxE/edit?usp=sharing
[datalab_di_template]: https://docs.google.com/spreadsheets/d/1nUb-eu82Q7VplDpk0np5rYuaN52mYHLdql18pRD0i4Y/edit?usp=sharing
[create_dd]: https://caltechlibrary.github.io/RDMworkbook/documentation.html#data-dictionary

:::{seealso}
See [OSF][osf_dd] and the [Research Data Management Workbook][create_dd] for how
to create a data dictionary, UC Davis DataLab for [data
dictionary][datalab_dd_template] and [data inventory][datalab_di_template]
templates, and [USGS][usgs_dd] for examples.
:::


(workflows)=
#### Workflows
Expand Down
25 changes: 0 additions & 25 deletions chapters/03_secondary_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,31 +12,6 @@ relevant, and we recommend you do too.
Documentation
-------------

### Document the Data

In a perfect world, every data set would come with detailed documentation or
**metadata** about how the data were collected, what assumptions were made,
what biases might be present, any ethical concerns, the overall structure, what
each observation means, what each feature means, and more. Good data
documentation guides researchers towards appropriate, responsible use of the
data.

Collecting data as part of a project gives you and your collaborators control
over how the data are documented, so you can ensure there are no gaps. If your
project uses data collected earlier or by someone else, it's a good practice to
fill gaps in the existing documentation with your own. Thorough documentation
isn't just beneficial to other researchers, it's also beneficial to future
you---small details you notice and document about features could be important
later in the project.

:::{seealso}
See DataLab's [README, Write Me! workshop reader][datalab-readme] for more
about how to document data.
:::

[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/


(document-every-experiment)=
### Document Every Experiment

Expand Down