Skip to content

Standardizing attributions for display on scaife.perseus.org #2308

@jacobwegner

Description

@jacobwegner

I've created this issue to track updates to the underlying attribution data that we're now extracting / displaying on scaife.perseus.org

Overview

I've extracted the existing attributions (from respStmt elements) and exported them to a Google Spreadsheet, OGL - First1kGreek Attributions. I can grant access to the appropriate persons within OGL to perform bulk edits to the data.

Once the preferred edits have been made to the spreadsheet, I will use the spreadsheet to bulk update the underlying XML files with the new attribution information and open a pull request.

If this workflow works well, we can do it for other OGL repos (and ideally any other repos contributing texts to scaife.perseus.org)

Desired data model

Here are a few samples of what the updated respStmt elements will look like:

Thibault Clérice, Lead Developer (University of Leipzig) 2015 - 2017

From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg0062/tlg001/tlg0062.tlg001.1st1K-grc1.xml#L28

to:

<respStmt>
  <resp from="2015" to="2017">Lead Developer</resp>
  <persName ref="https://orcid.org/0000-0003-1852-9204">Thibault Clérice</persName>
  <orgName>University of Leipzig</orgName>
</respStmt>

Notes:

  • We make use of from and to attrs to denote the timeframe of the resp.
  • We set a person's ORCID in persName.ref

Simona Stoyanova, Project Manager (University of Leipzig), 2015, Project Assistant (University of Leipzig), 2013-2014

From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/stoa0146d/stoa001/stoa0146d.stoa001.opp-grc1.xml#L47

to:

<respStmt>
  <resp when="2015">Project Manager</resp>
  <persName>Simona Stoyanova</persName>
  <orgName>University of Leipzig</orgName>
</respStmt>
<respStmt>
  <resp from="2013" to="2014">Project Assistant</resp>
  <persName>Simona Stoyanova</persName>
  <orgName>University of Leipzig</orgName>
</respStmt>

Notes:

  • We move from a single respStmt containing two resp elements to a 1:1 relationship between respStmt and resp
  • when and from|to attrs denote the resp. timeframe

Gregory Crane, Leonard Muellner, Bruce Robertson, Published original versions of the electronic texts, Open Greek and Latin

From

to:

<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <persName role="principal">Gregory Crane</persName>
  <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <persName role="principal">Leonard Muellner</persName>
  <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <persName role="principal">Bruce Robertson</persName>
  <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>

Notes:

  • We move from a single respStmt containing multiple persName elements to a 1:1 relationship between respStmt and persName.
  • We also include orgName in each respStmt

Implementation

Extraction process

Each row in the attributions-data worksheet corresponds to a set of URNs extracted from the underlying XML files.

There are "key" and "urn" fields which should not be modified and will be used to perform the bulk update.

Editing attribution data in the spreadsheet

I went through and made an initial pass to clean up the data. This involved fixing small typos in organization names, normalizing names (Mt. Allison vs Mount Allison, etc) and restructuring data to fit the desired model (discussed below).

The unique-* worksheets show uniquevalues for the resp, orgName and persName.

Ideally, we can standardize on "Proofreading" vs "proofreader" vs "Proofreading and CTS conversion" as appropriate. If proofreading and CTS conversion are two distinct responsibilities for a given text, I would suggest:

  1. Adding an additional row beneath "Proofreading and CTS conversion"

  2. Edit the original resp to Proofreading

  3. Set the resp in the new row to CTS conversion

  4. Copy the other relevant fields (resp, orgName and persName) to the new row

  5. Leave a comment on the row so I can ensure that the urn and key fields are also populated.

There are also several instances where slight variants in a person's name are used, or resp possibly contains data better suited for orgName .

We should not delete any rows; if there are duplicate rows in the spreadsheet, we'll use the urn and key fields to de-duplicate data.

Bulk update process

Once edits have been finalized in the spreadsheet, I'll use the urn and key fields to map the edits back to the desired data model (see below)

I will also perform a reordering of the desired "proofreading / conversion" role(s) so that they are weighted before any other roles.

I'll open up a PR and link it back to this issue. The PR can be merged and then the updated attributions will be made available on scaife.perseus.org

Closing thoughts

I'm not sure if there is "template" for future XML files, but I would also be happy to take the examples in Desired data model above and integrate them into that template.

As long as the XML files have respStmt with resp and one of persName or orgName, we can extract attributions for display on scale.perseus.org.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions