-
Notifications
You must be signed in to change notification settings - Fork 99
Standardizing attributions for display on scaife.perseus.org #2308
Description
I've created this issue to track updates to the underlying attribution data that we're now extracting / displaying on scaife.perseus.org
Overview
I've extracted the existing attributions (from respStmt elements) and exported them to a Google Spreadsheet, OGL - First1kGreek Attributions. I can grant access to the appropriate persons within OGL to perform bulk edits to the data.
Once the preferred edits have been made to the spreadsheet, I will use the spreadsheet to bulk update the underlying XML files with the new attribution information and open a pull request.
If this workflow works well, we can do it for other OGL repos (and ideally any other repos contributing texts to scaife.perseus.org)
Desired data model
Here are a few samples of what the updated respStmt elements will look like:
Thibault Clérice, Lead Developer (University of Leipzig) 2015 - 2017
to:
<respStmt>
<resp from="2015" to="2017">Lead Developer</resp>
<persName ref="https://orcid.org/0000-0003-1852-9204">Thibault Clérice</persName>
<orgName>University of Leipzig</orgName>
</respStmt>Notes:
- We make use of
fromandtoattrs to denote the timeframe of the resp. - We set a person's ORCID in
persName.ref
Simona Stoyanova, Project Manager (University of Leipzig), 2015, Project Assistant (University of Leipzig), 2013-2014
to:
<respStmt>
<resp when="2015">Project Manager</resp>
<persName>Simona Stoyanova</persName>
<orgName>University of Leipzig</orgName>
</respStmt>
<respStmt>
<resp from="2013" to="2014">Project Assistant</resp>
<persName>Simona Stoyanova</persName>
<orgName>University of Leipzig</orgName>
</respStmt>Notes:
- We move from a single respStmt containing two
respelements to a 1:1 relationship betweenrespStmtandresp whenandfrom|toattrs denote the resp. timeframe
Gregory Crane, Leonard Muellner, Bruce Robertson, Published original versions of the electronic texts, Open Greek and Latin
From
| <respStmt> |
to:
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<persName role="principal">Gregory Crane</persName>
<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<persName role="principal">Leonard Muellner</persName>
<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<persName role="principal">Bruce Robertson</persName>
<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>Notes:
- We move from a single
respStmtcontaining multiplepersNameelements to a 1:1 relationship betweenrespStmtandpersName. - We also include
orgNamein eachrespStmt
Implementation
Extraction process
Each row in the attributions-data worksheet corresponds to a set of URNs extracted from the underlying XML files.
There are "key" and "urn" fields which should not be modified and will be used to perform the bulk update.
Editing attribution data in the spreadsheet
I went through and made an initial pass to clean up the data. This involved fixing small typos in organization names, normalizing names (Mt. Allison vs Mount Allison, etc) and restructuring data to fit the desired model (discussed below).
The unique-* worksheets show uniquevalues for the resp, orgName and persName.
Ideally, we can standardize on "Proofreading" vs "proofreader" vs "Proofreading and CTS conversion" as appropriate. If proofreading and CTS conversion are two distinct responsibilities for a given text, I would suggest:
-
Adding an additional row beneath "Proofreading and CTS conversion"
-
Edit the original
respto Proofreading -
Set the
respin the new row toCTS conversion -
Copy the other relevant fields (
resp,orgNameandpersName) to the new row -
Leave a comment on the row so I can ensure that the
urnandkeyfields are also populated.
There are also several instances where slight variants in a person's name are used, or resp possibly contains data better suited for orgName .
We should not delete any rows; if there are duplicate rows in the spreadsheet, we'll use the urn and key fields to de-duplicate data.
Bulk update process
Once edits have been finalized in the spreadsheet, I'll use the urn and key fields to map the edits back to the desired data model (see below)
I will also perform a reordering of the desired "proofreading / conversion" role(s) so that they are weighted before any other roles.
I'll open up a PR and link it back to this issue. The PR can be merged and then the updated attributions will be made available on scaife.perseus.org
Closing thoughts
I'm not sure if there is "template" for future XML files, but I would also be happy to take the examples in Desired data model above and integrate them into that template.
As long as the XML files have respStmt with resp and one of persName or orgName, we can extract attributions for display on scale.perseus.org.