[GSoC Project Proposal]: NOAA trawl survey database #74

kellijohnson-NOAA · 2025-02-10T14:11:00Z

Project Description

The primary objective of this project is to create an accessible international database of transboundary marine survey data across the Northeast Pacific Ocean. Initial work for this project cleaned and joined haul-level data from several surveys operating along the west coast of North America, spanning two countries into a data frame (3.3 million observations of 55 species). Extending this work into a database, rather than data frame, would (1) allow for joining more data such as life-history information like the age of fish in the haul, and (2) allow for a larger number of species to be included (because of file sizes currently, we are providing data from only 55 of more than 1000 species). Most of the data are publicly available in independent-regional databases but no international database exists and the independent databases are not standardized, which significantly inhibits the use of the data for research and in assessments of the status of marine resources. This international database would help strengthen our understanding of climate-driven shifts in groundfish distribution in the North Pacific Ocean through data sharing between Fisheries and Oceans Canada (DFO) and NOAA Fisheries (NMFS) and it has the potential to improve the assessments and management of those species, serving as a proof-of-concept and foundation for a proposed North America-wide effort to join survey data.

Expected Outcomes

We expect at a minimum that the data already compiled into the joined data frame would be enhanced by being moved to a queryable database. Second, more data that require the relational structure of a database, e.g., age- and length-composition data, could be added to the database as time allows. Including such data will allow for the database to be a one-stop shop for survey data, which would drastically reduce the time needed to compile the data for use in both research and management-related tasks.

Skills Required

SQL, R

Additional Background/Issues

The R package and code to join some of the data exists on GitHub within the surveyjoin repository.
Potential steps for constructing the database: Convert to relational database DFO-NOAA-Pacific/surveyjoin#57

Mentor(s)

Eric Ward (@ericward-noaa), Sean Anderson (@seananderson), Kelli Johnson (@kellijohnson-NOAA), Derek Bolser (@dgbolser)

Mentor Contact Email(s)

[email protected]

Expected Project Size

175 hours

Project Difficulty

Intermediate

w-nityammm · 2025-02-23T18:56:58Z

I'm interested in working on this. Are there any preferred db systems? As I'm thinking of using postgres since this is obviously huge and going to be pretty query-heavy.

kellijohnson-NOAA · 2025-02-23T19:04:36Z

Thank you @w-nityammm for your interest. To the best of my knowledge we do not have a preferred db system but we can ask around to see if something is preferred within NOAA and get back you. @dgbolser might have more information when he returns the office this week.

dgbolser-NOAA · 2025-02-24T15:21:57Z

Yes, thanks for your interest @w-nityammm! Postgres seems like a good option and I don't see any obvious incompatibilities with other databases we maintain. I am checking with the database manager here at HQ to see if there's a preferred system.

dgbolser-NOAA · 2025-02-24T17:02:15Z

@w-nityammm Oracle has been our default but I confirmed that there won't be issues with if we go with postgres. Our folks see the advantages and support going in that direction if it is best for the project.

w-nityammm · 2025-02-24T17:48:44Z

Alright sounds good. Will start looking into it :) . Thank you for the response! @dgbolser

7yl4r · 2025-02-24T22:34:29Z

Is it possible to use OBIS as the unified database? Data would feed into OBIS and then be queried back out, similar to using SQL. ROBIS is an R library for fetching OBIS data.

ericward-noaa · 2025-02-24T22:42:17Z

Interesting thought @7yl4r -- I'd defer to someone who knows more about OBIS, but my initial reaction is that we might be constrained by size. OBIS claims to have 136,000,000 records. I don't know exactly how many records we'll be dealing with -- but at least an order of magnitude more than the 3.3 million aggregate records we already have (more if we include more than 55 species). I can do some initial summaries of samples / species to get an idea.

In an ideal world, we'd be able to serve up data for any species, including some of the corals, sponges, and other invertebrates. There's a lot of 0s for rarer species, and we wouldn't need to store those -- but it would be useful to include data on individual samples.

dgbolser-NOAA · 2025-02-26T17:10:03Z

@7yl4r Thanks for the great suggestion. I'd like to see us integrate with OBIS but I think our first-line data users (e.g., stock assessment scientists) may need data in formats that might not be directly compatible with OBIS standards. And as @ericward-noaa said, we may not need to store a lot of the zeros for our purposes. The Humbolt extension of Darwin Core is probably most relevant to what we would use, but I have not done a side-by-side comparison between data in that format and how data are presented in surveyjoin.

Another thing to consider is that we may need to have some flexibility to update our formats and metadata standards to meet changing research and assessment goals. This is harder to do if we serve data through an outside entity. But I think being able to translate our format/standards to theirs is a good goal.

7yl4r · 2025-02-26T19:52:40Z

may need data in formats that might not be directly compatible with OBIS standards

OBIS uses Darwin Core, which is a star-schema that (in theory) allows for any taxa occurrence (and associated meta) data to be included. Sometimes extensions are needed, but a lot can be done with just event-core and the measurementOrFact extension.

we may not need to store a lot of the zeros for our purposes

Not required but it's worth noting that OBIS encourages sharing of these as occurrenceStatus=absent.

Anyway... I'm interested to see how this project progresses and I'm here to answer any questions related to OBIS || Darwin Core and associated toolings.

dgbolser-NOAA · 2025-03-03T16:52:06Z

@7yl4r Good to know and thanks! We are certainty interested in integrating with OBIS -- especially if we can do so and meet the needs of our first-line data users. Will dig into this further.

w-nityammm · 2025-03-03T18:33:09Z

I've been following this discussion about OBIS integration and was wondering—are we still moving forward with the relational db project using postgres? Just want to make sure since I'm working on my GSoC proposal and don’t want to go in the wrong direction. Also, if OBIS integration is something planned for the future, I'll see if I can design the db in a way that makes it compatible with it.

dgbolser-NOAA · 2025-03-03T18:54:22Z

@w-nityammm Thanks for checking in; it's great to hear that you're working on the proposal. We are still moving forward with the relational db using postgres, so no worries there. I think we need to do some more research to see what compatibility with OBIS looks like in the near-term before committing to it as a project goal. It would be great to hear your thoughts on that as specific plans are ironed out. Either way, we can connect you with OBIS folks if this looks realistic and is something you're interested in.

w-nityammm · 2025-03-03T21:04:01Z

@dgbolser Thanks for confirming! I’ll focus on the relational DB for now, but I’m definitely interested in understanding how OBIS integration might work down the line. Happy to adapt things as needed if there are any specific directions to consider.

Aliyasuv · 2025-03-16T18:22:42Z

hi, I am particularly interested in how the relational database structure integrates diverse datasets efficiently and what challenges arise in maintaining data consistency across regions. Could you share insights on scaling this database to include additional species or regions, and whether there’s a roadmap for integrating new data sources in the future?

w-nityammm · 2025-03-17T10:50:21Z

Hey everyone, so I have an almost-complete proposal with a few details left to finalize, like the exact dates of my end-term exams and some changes to the deliverables and timeline. I'd really appreciate if anyone can give it a review and provide any suggestions or feedback. Thanks!

kellijohnson-NOAA · 2025-03-17T12:26:00Z

@Aliyasuv thank you for your interest in this project. Regarding the inclusion of additional species, the current structure of using an R package and uploading to GitHub limits our ability to include additional species. So, moving to a proper database format will allow for the inclusion of more species. The main potential roadblock that I see in adding more species will be ensuring that species names are the same across regions. For example, there is the potential for a species to change scientific names over time, and thus, we will want to make sure that nuances like this are recorded and species that should be linked across time and across regions are accounted for. We can come up with a set of rules though that can be coded. Regarding the inclusion of additional areas, code will need to be written to convert existing to the format and naming convention used within the database. Essentially, we have at least mental roadmaps for how this should be done because we had to do it when we made the original database. We can help make the process more formal though. Please let us know if you have any additional questions.

kellijohnson-NOAA · 2025-03-17T12:27:04Z

@w-nityammm please feel free to forward your proposal to the emails listed above and at least one of us will have time to review it.

w-nityammm · 2025-03-17T14:11:26Z

There's only this email listed above - [email protected]. Forwarded it there.

Aliyasuv · 2025-03-17T15:31:46Z

@kellijohnson-NOAA Thank you for the guidance. just wanted to ask one more question, given the potential roadblocks with scientific name changes and regional differences in species classification, have you considered integrating an external authority database (such as ITIS or WoRMS) for consistent species identification and tracking over time? If so, what challenges might arise in maintaining compatibility with such databases? and also where should i send my proposal for you guys to check. I really am interested in this project and want to contribute to it.

ericward-noaa · 2025-03-17T15:37:58Z

Hi @Aliyasuv -- thanks for the interest in the project! @kellijohnson-NOAA may have more to add to this -- right now, we are using ITIS as a species classifier, and it works well for the commonly encountered species (like the 55 currently provided). When we expand the database to include all species / observations, there will be more challenges for rare or unclassified things. Examples: some species may be only recorded to genus, others might be entered as "unidentified octopus", and others might be entered as species complexes (where biologists can't differentiate based on visual differences alone). Feel free to send it to me at [email protected] and I'll pass along to our group

Aliyasuv · 2025-03-17T15:59:08Z

@ericward-noaa Thank you. Will you forward it there.

kellijohnson-NOAA · 2025-03-17T16:21:43Z

@Aliyasuv I do not think that it would be within the scope of this project to have the Google Summer of Code proposal provide a solution to all species-name mismatches but rather just document things as they come up so that things can be fixed in their primary regional databases or we can, in the future, write code to join some species together. This will be an ongoing process where it will probably take years to find all of the species that are being referenced with "incorrect" names. We can rely on down-stream users to help inform us. But, your proposal could include "creating a GitHub Issue form for users to document where they believe species name mismatches exist and a protocol for developers to rectify the mismatches".

Aliyasuv · 2025-03-17T16:37:12Z

@kellijohnson-NOAA Thank You for the clarification. I will revise my proposal to focus on documenting species-name mismatches as they arise, rather than providing a comprehensive solution. I'll also include a plan for creating a GitHub Issue form and a protocol for developers to address these mismatches over time. Please let me know if you have any additional suggestions.

7yl4r · 2025-03-18T20:37:24Z

@kellijohnson-NOAA Thank You for the clarification. I will revise my proposal to focus on documenting species-name mismatches as they arise, rather than providing a comprehensive solution. I'll also include a plan for creating a GitHub Issue form and a protocol for developers to address these mismatches over time. Please let me know if you have any additional suggestions.

For standardization of species names the SMBD recommends using WoRMS and AphiaIDs. There is an R library wrapping the API and a web GUI.

MathewBiddle · 2025-03-19T13:16:32Z

I will second the use of WoRMS and AphiaIDs as the authority for taxonomic information, since it's a requirement for OBIS. More details on matching data to WoRMS can be found at https://ioos.github.io/bio_mobilization_workshop/03-data-cleaning.html#matching-your-scientific-names-to-worms and https://manual.obis.org/name_matching.html. While ITIS is acceptable, some of the data might not be accessible we pulled out of OBIS if it doesn't appropriately match to WoRMS. So, using WoRMS from the start would be good.

More details about OBIS, Darwin Core, and taxonomy can be found at https://manual.obis.org/darwin_core.html#taxonomy-and-identification

Mirandazhu02 · 2025-03-26T23:21:49Z

Hi Eric Ward (@ericward-noaa), Sean Anderson (@seananderson), Kelli Johnson (@kellijohnson-NOAA), Derek Bolser (@dgbolser),

I'm Miranda Zhu, a data science student at UC Berkeley with experience in PostgreSQL, R, and database migration projects. After reviewing the surveyjoin package and project description, I'm very interested in contributing to the GSoC project to convert the current SQLite database to a PostgreSQL implementation.

I understand the current system already handles standardizing marine survey data across different regions in the Northeast Pacific Ocean, but is limited by the SQLite backend in terms of species coverage and additional data types. I'd like to better understand the project requirements with a couple of questions:

After reviewing the surveyjoin package, I understand it currently uses SQLite to store standardized trawl survey data. As I design the PostgreSQL database structure, what are the most common types of analyses or data retrievals that researchers perform with this data? For example, are they primarily looking at species distribution changes across regions, temporal trends, or relationships with environmental factors? This would help me ensure the database design prioritizes the right performance optimizations.

For the biological sample data (age, length, maturity) that you'd like to incorporate, what is the relationship between these samples and the current catch data? Are samples taken for individual specimens within a catch, or are they aggregated at some level?

Thank you for your consideration. I'm excited about the opportunity to help develop this valuable research tool.

Best,
Miranda

kellijohnson-NOAA · 2025-03-27T03:58:30Z

@Mirandazhu02 thank you for your interest in our project. Below are some responses to your questions. Please feel free to ask additional questions, we are happy to answer them.

What are the most common types of analyses or data retrievals that researchers perform with this data? For example, are they primarily looking at species distribution changes across regions, temporal trends, or relationships with environmental factors? This would help me ensure the database design prioritizes the right performance optimizations. For management purposes, the catch information is used in species distribution models to predict an index of abundance (mt per year) that can be used in a model that predicts that status of the species. Additionally, many people use the data for research questions such as how do species change their distribution during marine heat waves.
For the biological sample data (age, length, maturity) that you'd like to incorporate, what is the relationship between these samples and the current catch data? Are samples taken for individual specimens within a catch, or are they aggregated at some level? The biological data comes from individuals that were within the catch in a haul/tow but not every individual is sampled. Often the crew will sample a certain number of each species. So, for less commonly caught species it might be every individual that was caught but more commonly it is a certain number, e.g., 20, of specimens from a given species that was brought on board. The individual specimen shares a haul id with the haul.

Mirandazhu02 · 2025-03-27T18:34:25Z

@kellijohnson-NOAA Thank you so much for answering! I have sent my first initial proposal draft to the email and am looking forward to hearing back from you. I'll be very happy to contribute if theres any small task that I can work on right now.

apekshatej · 2025-03-28T02:13:32Z

Hello team,

I'm Apeksha Tejwani, and I’m enthusiastic about contributing to the NOAA trawl survey database project through Google Summer of Code 2025. After exploring the surveyjoin repository and project description in depth, I’ve drafted a proposal outlining my approach to building a scalable relational database structure based on the suggested implementation. I’ve sent the proposal draft to the email address mentioned and would be grateful for any feedback that the team has.

Regards,
Apeksha

kellijohnson-NOAA · 2025-05-08T19:25:00Z

I would like to thank everyone for their applications to this project and congratulate 🥳 @w-nityammm 🥳 for being chosen as a top competitor and the selected applicant for this proposal. Welcome to the team!

w-nityammm · 2025-05-08T20:16:27Z

thanks alot!! @kellijohnson-NOAA

kellijohnson-NOAA added GSoC25 project idea Designates a proposed project idea labels Feb 10, 2025

kellijohnson-NOAA changed the title ~~[GSoC Project Proposal]:~~ [GSoC Project Proposal]: NOAA trawl survey database Feb 10, 2025

[GSoC Project Proposal]: NOAA trawl survey database #74

[GSoC Project Proposal]: NOAA trawl survey database #74

Comments

kellijohnson-NOAA commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Project Description

Expected Outcomes

Skills Required

Additional Background/Issues

Mentor(s)

Mentor Contact Email(s)

Expected Project Size

Project Difficulty

w-nityammm commented Feb 23, 2025

Uh oh!

kellijohnson-NOAA commented Feb 23, 2025

Uh oh!

dgbolser-NOAA commented Feb 24, 2025

Uh oh!

dgbolser-NOAA commented Feb 24, 2025

Uh oh!

w-nityammm commented Feb 24, 2025

Uh oh!

7yl4r commented Feb 24, 2025

Uh oh!

ericward-noaa commented Feb 24, 2025

Uh oh!

dgbolser-NOAA commented Feb 26, 2025

Uh oh!

7yl4r commented Feb 26, 2025

Uh oh!

dgbolser-NOAA commented Mar 3, 2025

Uh oh!

w-nityammm commented Mar 3, 2025

Uh oh!

dgbolser-NOAA commented Mar 3, 2025

Uh oh!

w-nityammm commented Mar 3, 2025

Uh oh!

Aliyasuv commented Mar 16, 2025

Uh oh!

w-nityammm commented Mar 17, 2025

Uh oh!

kellijohnson-NOAA commented Mar 17, 2025

Uh oh!

kellijohnson-NOAA commented Mar 17, 2025

Uh oh!

w-nityammm commented Mar 17, 2025

Uh oh!

Aliyasuv commented Mar 17, 2025

Uh oh!

ericward-noaa commented Mar 17, 2025

Uh oh!

Aliyasuv commented Mar 17, 2025

Uh oh!

kellijohnson-NOAA commented Mar 17, 2025

Uh oh!

Aliyasuv commented Mar 17, 2025

Uh oh!

7yl4r commented Mar 18, 2025

Uh oh!

MathewBiddle commented Mar 19, 2025

Uh oh!

Mirandazhu02 commented Mar 26, 2025

Uh oh!

kellijohnson-NOAA commented Mar 27, 2025

Uh oh!

Mirandazhu02 commented Mar 27, 2025

Uh oh!

apekshatej commented Mar 28, 2025

Uh oh!

kellijohnson-NOAA commented May 8, 2025

Uh oh!

w-nityammm commented May 8, 2025

Uh oh!

kellijohnson-NOAA commented Feb 10, 2025 •

edited

Loading