diff --git a/faqs/galaxy/datasets_import_from_data_library.md b/faqs/galaxy/datasets_import_from_data_library.md index 16652c5f289b6..5371af18c5263 100644 --- a/faqs/galaxy/datasets_import_from_data_library.md +++ b/faqs/galaxy/datasets_import_from_data_library.md @@ -9,11 +9,17 @@ contributors: [bebatut,shiltemann,nsoranzo,hexylena,wm75] As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a *shared data library*: -* Go into **Shared data** (top panel) then **Data libraries** -* Navigate to -{% if include.path %} - *{{ include.path }}* or {% endif %} the correct folder as indicated by your instructor -* Select the desired files -* Click on the **To History** button near the top and select **{{ include.astype | default: "as Datasets" }}** from the dropdown menu -* In the pop-up window, select the history you want to import the files to (or create a new one) -* Click on **Import** +1. Go into **Shared data** (top panel) then **Data libraries** +2. Navigate to{% if include.path %}: *{{ include.path }}* or {% endif %} the correct folder as indicated by your instructor +3. Select the desired files +4. Click on **Add to History** {% icon galaxy-dropdown %} near the top and select **{{ include.astype | default: "as Datasets" }}** from the dropdown menu +5. In the pop-up window, choose + {% if include.collection_type %} + * *"Collection type"*: **{{ include.collection_type }}** + {% endif %} + * *"Select history"*: {% if include.tohistory %}{{ include.tohistory }}{% else %}the history you want to import the data to (or create a new one){% endif %} +6. Click on {% if include.collection_type %}**Continue**{% else %}**Import**{% endif %} +{% if include.collection_type %} +7. In the next dialog, give a suitable **Name**{% if include.collection_name %}, like `{{ include.collection_name }}`,{% endif %} to the new collection +8. Click on **Create collection** +{% endif %} diff --git a/faqs/galaxy/workflows_import_search.md b/faqs/galaxy/workflows_import_search.md index 10e737e1dcab0..d1db6af174665 100644 --- a/faqs/galaxy/workflows_import_search.md +++ b/faqs/galaxy/workflows_import_search.md @@ -3,27 +3,17 @@ title: Importing a workflow using the search area: workflows box_type: tip layout: faq -contributors: [bebatut] +contributors: [bebatut,wm75] --- -- Click on *Workflow* on the top menu bar of Galaxy. You will see a list of all your workflows. -- Click on the {% icon galaxy-upload %} **Import** icon at the top-right of the screen -- Click on **search form** in **Import a Workflow from Configured GA4GH Tool Registry Servers (e.g. Dockstore)** -{% if include.trs_server %} -- Select *"TRS Server"*: `{{ include.trs_server }}` -{% else %} -- Select the relevant TRS Server -{% endif %} -{% if include.search_query %} -- Type `{{ include.search_query }}` in the search query -{% else %} -- Type the query -{% endif %} -{% if include.workflow_name %} -- Expand the workflow named `{{ include.workflow_name }}` -{% else %} -- Expand the correct workflow -{% endif %} -- Click on the wanted version +1. Click on *Workflow* in the top menu bar of Galaxy. You will see a list of all your workflows. +2. Click on the {% icon galaxy-upload %} **Import** icon at the top-right of the screen +3. On the new page, select the **GA4GH servers** tab, and configure the **GA4GH Tool Registry Server (TRS) Workflow Search** interface as follows: + 1. *"TRS Server"*: {% if include.trs_server %}**{{ include.trs_server }}**{% else %}the TRS Server you want to search on (Dockstore or Workflowhub){% endif %} + 2. {% if include.search_query %}*"search query"*: `{{ include.search_query }}` + {% else %}Type in the *search query*{% endif %} + 3. {% if include.workflow_name %}Expand the workflow named `{{ include.workflow_name }}` + {% else %}Expand the correct workflow{% endif %} by clicking on it + 4. Select the version you would like to {% icon galaxy-upload %} import - The workflow will be imported in your workflows +The workflow will be imported to your list of workflows. Note that it will also carry a little green check mark next to its name, which indicates that this is an original workflow version imported from a TRS server. If you ever modify the workflow with Galaxy's workflow editor, it will lose this indicator. diff --git a/topics/variant-analysis/images/sars-cov-2-variant-discovery/ncov_clades.png b/topics/variant-analysis/images/sars-cov-2-variant-discovery/ncov_clades.png deleted file mode 100644 index 35e702c2d09be..0000000000000 Binary files a/topics/variant-analysis/images/sars-cov-2-variant-discovery/ncov_clades.png and /dev/null differ diff --git a/topics/variant-analysis/images/sars-cov-2-variant-discovery/ncov_clades.svg b/topics/variant-analysis/images/sars-cov-2-variant-discovery/ncov_clades.svg new file mode 100644 index 0000000000000..323d31cdd55ec --- /dev/null +++ b/topics/variant-analysis/images/sars-cov-2-variant-discovery/ncov_clades.svg @@ -0,0 +1,97 @@ +19A (B)20A (B.1)19B (A)20B (B.1.1)21A (Delta, B.1.617.2)20C21H (Mu, B.1.621)21D (Eta, B.1.525)21B (Kappa, B.1.617.1)20E (EU1, B.1.177)21M (Omicron, B.1.1.529)21E (Theta, P.3)20J (Gamma, P.1)20I (Alpha, B.1.1.7)20F (D.2)20D (B.1.1.1)21I (Delta)21J (Delta)21F (Iota, B.1.526)21C (Epsilon, B.1.427/429)20H (Beta, B.1.351)20G (B.1.2)21K (Omicron, BA.1)21L (Omicron, ~BA.2)21G (Lambda, C.37)22F (Omicron, XBB)22D (Omicron, BA.2.75)22C (Omicron, BA.2.12.1)22B (Omicron, BA.5)22A (Omicron, BA.4)23A (Omicron, XBB.1.5)23B (Omicron, XBB.1.16)22E (Omicron, BQ.1) \ No newline at end of file diff --git a/topics/variant-analysis/images/sars-cov-2-variant-discovery/schema.png b/topics/variant-analysis/images/sars-cov-2-variant-discovery/schema.png index 38a455dd3577d..eb7a3607715a8 100644 Binary files a/topics/variant-analysis/images/sars-cov-2-variant-discovery/schema.png and b/topics/variant-analysis/images/sars-cov-2-variant-discovery/schema.png differ diff --git a/topics/variant-analysis/images/sars-cov-2-variant-discovery/variant-frequency.svg b/topics/variant-analysis/images/sars-cov-2-variant-discovery/variant-frequency.svg index 6a1617d4a5bc6..759ee9c7ac4f8 100644 --- a/topics/variant-analysis/images/sars-cov-2-variant-discovery/variant-frequency.svg +++ b/topics/variant-analysis/images/sars-cov-2-variant-discovery/variant-frequency.svg @@ -1,102 +1,102 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -153,6547 +153,4503 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + + + + - + - + - + - + - + - + - + - + - + - + - + - - - - + - + - + - + - + - + - + - + - - + - + - + + - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/data-library.yaml b/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/data-library.yaml index 1c86dbed577eb..d56ce12113ff7 100644 --- a/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/data-library.yaml +++ b/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/data-library.yaml @@ -10,166 +10,153 @@ items: items: - name: Mutation calling, viral genome reconstruction and lineage/clade assignment from SARS-CoV-2 sequencing data items: - - name: 'DOI: 10.5281/zenodo.5036687' + - name: 'DOI: 10.5281/zenodo.4555734' description: latest items: - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ARTIC_amplicon_info_v3.tabular + - url: https://zenodo.org/api/files/ce9faef4-1387-4bea-ac9f-67558c85e9e1/NC_045512.2_reference.fasta + src: url + ext: fasta + info: https://zenodo.org/record/5888324 + - url: https://zenodo.org/api/files/ce9faef4-1387-4bea-ac9f-67558c85e9e1/NC_045512.2_feature_mapping.tsv src: url ext: tabular info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ARTIC_nCoV-2019_v3.bed6 + - url: https://zenodo.org/api/files/ce9faef4-1387-4bea-ac9f-67558c85e9e1/ARTIC_nCoV-2019_v4.bed src: url ext: bed6 - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931005_1.fastqsanger.gz + info: https://zenodo.org/record/5888324 + - url: https://zenodo.org/api/files/ce9faef4-1387-4bea-ac9f-67558c85e9e1/ARTIC_amplicon_info_v4.tsv src: url - ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931005_2.fastqsanger.gz + ext: tabular + info: https://zenodo.org/record/5888324 + - name: 'Bioproject: PRJNA784038 - SARS-CoV-2 Omicron Sequencing in South Africa' + description: latest + items: + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/002/SRR17054502/SRR17054502_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931006_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/002/SRR17054502/SRR17054502_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931006_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/003/SRR17054503/SRR17054503_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931007_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/003/SRR17054503/SRR17054503_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931007_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/004/SRR17054504/SRR17054504_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931008_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/004/SRR17054504/SRR17054504_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5931008_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/005/SRR17054505/SRR17054505_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949456_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/005/SRR17054505/SRR17054505_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949456_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/006/SRR17054506/SRR17054506_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949457_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/006/SRR17054506/SRR17054506_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949457_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/007/SRR17054507/SRR17054507_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949458_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/007/SRR17054507/SRR17054507_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949458_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/008/SRR17054508/SRR17054508_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949459_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/008/SRR17054508/SRR17054508_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949459_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/009/SRR17054509/SRR17054509_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949460_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/009/SRR17054509/SRR17054509_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949460_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/010/SRR17054510/SRR17054510_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949461_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/010/SRR17054510/SRR17054510_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949461_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/033/SRR17051933/SRR17051933_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949462_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/033/SRR17051933/SRR17051933_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949462_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/034/SRR17051934/SRR17051934_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949463_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/034/SRR17051934/SRR17051934_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949463_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/035/SRR17051935/SRR17051935_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949464_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/035/SRR17051935/SRR17051935_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949464_2.fastqsanger.gz - src: url - ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949465_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/036/SRR17051936/SRR17051936_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949465_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/036/SRR17051936/SRR17051936_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949466_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/037/SRR17051937/SRR17051937_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949466_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/037/SRR17051937/SRR17051937_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949467_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/038/SRR17051938/SRR17051938_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949467_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/038/SRR17051938/SRR17051938_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949468_1.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/039/SRR17051939/SRR17051939_1.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949468_2.fastqsanger.gz + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 + - url: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/039/SRR17051939/SRR17051939_2.fastq.gz src: url ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949469_1.fastqsanger.gz - src: url - ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/ERR5949469_2.fastqsanger.gz - src: url - ext: fastqsanger.gz - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/NC_045512.2_feature_mapping.tabular - src: url - ext: tabular - info: https://zenodo.org/record/5036687 - - url: https://zenodo.org/api/files/e3380e5c-6916-4490-b9df-59b88a2223c9/NC_045512.2_reference_sequence.fasta - src: url - ext: fasta - info: https://zenodo.org/record/5036687 + info: https://www.ebi.ac.uk/ena/browser/view/PRJNA784038 diff --git a/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/tutorial.md b/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/tutorial.md index 03ab768f158e7..bf8f379d3ee4a 100644 --- a/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/tutorial.md +++ b/topics/variant-analysis/tutorials/sars-cov-2-variant-discovery/tutorial.md @@ -6,52 +6,83 @@ subtopic: one-health level: Intermediate zenodo_link: "https://zenodo.org/record/5036687" questions: -- How can we extract annotated allelic variants in SARS-Cov-2 sequences in Galaxy? -- Which tools and workflows can we use to identify SARS-CoV-2 lineages in Galaxy? +- How can a complete analysis, including viral consensus sequence reconstruction and lineage assignment be performed? +- How can such an analysis be kept manageable for lots of samples, yet flexible enough to handle different types of input data? +- What are key results beyond consensus genomes and lineage assignments that need to be understood to avoid inappropriate conclusions about samples? +- How can the needs for high-throughput data analysis in an ongoing infectious disease outbreak/pandemic and the need for proper quality control and data inspection be balanced? objectives: -- Repeat SARS-CoV-2 data preparation -- Select and run workflow to extract annotated allelic variants from FASTQ files -- Run workflow to summarize and generate report for previously called allelic variants -- Interpret summaries for annotated allelic variants -- Run workflow to extract consensus sequences -- Select and run tools to assign clades/lineages +- Discover and obtain recommended Galaxy workflows for SARS-CoV-2 sequence data analysis through public workflow registries +- Choose and run a workflow to discover mutations in a batch of viral samples from sequencing data obtained through a range of different protocols and platforms +- Run a workflow to summarize and visualize the mutation discovery results for a batch of samples +- Run a workflow to construct viral consensus sequences for the samples in a batch +- Know different SARS-CoV-2 lineage classification systems, and use pangolin and Nextclade to assign samples to predefined lineages +- Combine information from different analysis steps to be able to draw appropriate conclusions about individual samples and batches of viral data time_estimation: 3H key_points: -- 4 specialized, best-practice variant calling workflows are available for the identification of annotated allelic variants from raw sequencing data depending on the exact type of input -- Data from batches of samples can be processed in parallel using collections -- Annotated allelic variants can be used to build consensus sequences for and assign each sample to known viral clades/lineages +- The Galaxy Covid-19 project has developed a flexible set of workflows for SARS-CoV-2 genome surveillance, which is freely available through public workflow registries. +- The workflows enable processing of whole batches of samples with rather limited user interaction. +- They provide a high-throughput and flexible analysis solution without compromising on accuracy, nor on the possibility to explore intermediate steps and outputs in detail. + +requirements: + - + type: "internal" + topic_name: galaxy-interface + tutorials: + - collections + - + type: "internal" + topic_name: variant-analysis + tutorials: + - sars-cov-2 contributors: - wm75 - bebatut tags: - covid19 - virology +- one-health --- # Introduction +Sequence-based monitoring of global infectious disease crises, such as the COVID-19 pandemic, requires capacity to generate and analyze large volumes of sequencing data in near real time. These data have proven essential for surveilling the emergence and spread of new viral variants, and for understanding the evolutionary dynamics of the virus. -Effectively monitoring global infectious disease crises, such as the COVID-19 pandemic, requires capacity to generate and analyze large volumes of sequencing data in near real time. These data have proven essential for monitoring the emergence and spread of new variants, and for understanding the evolutionary dynamics of the virus. +The tutorial [SARS-CoV-2 sequencing data analysis]({% link topics/variant-analysis/tutorials/sars-cov-2/tutorial.md %}) shows in detail how you can identify mutations in SARS-CoV-2 samples from paired-end whole-genome sequencing data generated on the Illumina platform. -Two sequencing platforms (Illumina and Oxford Nanopore) in combination with several established library preparation (Ampliconic and metatranscriptomic) strategies are predominantly used to generate SARS-CoV-2 sequence data. However, data alone do not equal knowledge: they need to be analyzed. The Galaxy community has developed high-quality analysis workflows to support +For versatile and efficient genome surveillance, however, you would want to: +- be able to analyze data of different origin -- sensitive identification of SARS-CoV-2 allelic variants (AVs) starting with allele frequencies as low as 5% from deep sequencing reads -- generation of user-friendly reports for batches of results -- reliable and configurable consensus genome generation from called variants + Besides WGS paired-end Illumina data, different labs are generating also single-end Illumina and ONT data, and are combining these platforms with various tiled-amplicon approaches upstream of sequencing. -> Further reading -> More information about the workflows, including benchmarking, can be found -> - on the Galaxy Covid-19 effort website: [covid19.galaxyproject.org](https://covid19.galaxyproject.org/) -> - as a BioRxiv preprint: [Global platform for SARS-CoV-2 analysis](https://www.biorxiv.org/content/10.1101/2021.03.25.437046v1) -{: .details} +- go beyond per-sample mutation calls in variant call format (VCF) + + To keep track of large numbers of samples sequenced in batches (as, nowadays, produced routinely by SARS-CoV-2 genome surveillance initiatives across the globe), you need concise reports and visualizations of results at the sample and batch level. + +- use sample mutation patterns to construct sample consensus genomes + +- use the consensus genomes to assign the samples to SARS-CoV-2 lineages as defined by major lineage classification systems ([Nextstrain clades](https://nextstrain.org/blog/2022-04-29-SARS-CoV-2-clade-naming-2022) and [PANGO](https://www.pango.network/)) -This tutorial will teach you how to obtain, run and combine these workflows appropriately for different types of input data, be it: +- decrease hands-on time and data manipulation errors by combining analysis steps into workflows for automated execution + +The purpose of this tutorial is to demonstrate how a set of workflows developed by the [Galaxy Covid-19 project](https://galaxyproject.org/projects/covid19/) can be combined and used together with a handful of additional tools to achieve all of the above. Specifically, we will cover the analysis flow presented in figure 1. + +![Analysis flow in the tutorial](../../images/sars-cov-2-variant-discovery/schema.png "Analysis flow in the tutorial") -- Single-end data derived from Illumina-based RNAseq experiments -- Paired-end data derived from Illumina-based RNAseq experiments -- Paired-end data generated with Illumina-based Ampliconic (ARTIC) protocols, or -- ONT FASTQ files generated with Oxford nanopore (ONT)-based Ampliconic (ARTIC) protocols +Depending on the type of sequencing data **one of four variation analysis workflows** can be run to discover mutations in a batch of input samples. Outputs of any of these workflows can then be processed further with two additional workflows: the **variation reporting workflow** generates a per-sample report of mutations, but also batch-level reports and visualizations, while the **consensus construction workflow** reconstructs the full viral genomes of all samples in the batch by modifying the SARS-CoV-2 reference genome with each sample's set of mutations. + +A few highlights of these workflows are: + +- All the variation analysis workflows are more sensitive than they need to be for consensus sequence generation, i.e. they can not only be used to capture fixed or majority alleles, but can also be used on their own to address less routine questions such as co-infections with two viral lineages or shifting intrahost allele-frequencies in, for example, immunocompromised, longterm-infected patients. +- The reporting workflow produces a batch-level overview plot of mutations and their observed allele-frequencies that enables spotting of batch-effects like sample cross-contamination and outlier samples that are different from the rest of the batch. +- The consensus workflow can express uncertainty about any base position in the generated consensus sequence by N-masking the position according to user-defined thresholds. +- All of the workflows are openly developed and available in the form of defined releases through major public workflow registries. + +In this tutorial you will learn to +- obtain releases of the workflows from public registries +- set up input data for different types of sequencing protocols +- run and combine the workflows +- understand the various outputs produced by the workflows and to extract insight about viral samples from them > > @@ -62,8 +93,27 @@ This tutorial will teach you how to obtain, run and combine these workflows appr > {: .agenda} + # Prepare Galaxy and data +The suggested input for this tutorial is a special batch of data that is of particular interest as it represents a turning point in the COVID-19 pandemic. +It is a subset (16 samples) of the first sequencing data reported from South Africa at the end of November 2021 for the then novel, fast-spreading SARS-CoV-2 variant that would later be named Omicron. +This data has been Illumina paired-end sequenced after amplification with the ARTIC v4 set of tiled-amplicon primers. + +> Bringing your own data +> Alternatively, you can also follow this tutorial using your own SARS-CoV-2 sequencing data (you need at least two samples) as long as it is of one of the following types: +> +> - Single-end data derived from Illumina-based whole-genome sequencing experiments +> - Paired-end data derived from Illumina-based whole-genome sequencing experiments +> - Paired-end data generated with Illumina-based tiled-amplicon (e.g. ARTIC) protocols +> - ONT FASTQ files generated with Oxford nanopore (ONT)-based tiled-amplicon (e.g. ARTIC) protocols +> +> {% icon warning %} If you are using your own *tiled-amplicon* data, you are also expected to know the primer scheme used at the amplification step. +> +{: .comment} + +## Prepare a new Galaxy history + Any analysis should get its own Galaxy history. So let's start by creating a new one: > Prepare the Galaxy history @@ -80,607 +130,655 @@ Any analysis should get its own Galaxy history. So let's start by creating a new ## Get sequencing data -Before we can begin any Galaxy analysis, we need to upload the input data: FASTQ files with the sequenced viral RNA from different patients infected with SARS-CoV-2. Several types of data are possible: - -- Single-end data derived from Illumina-based RNAseq experiments -- Paired-end data derived from Illumina-based RNAseq experiments -- Paired-end data generated with Illumina-based Ampliconic (ARTIC) protocols -- ONT FASTQ files generated with Oxford nanopore (ONT)-based Ampliconic (ARTIC) protocols +> Importing your own data +> If you are going to use your own sequencing data, there are several possibilities to upload the data depending on how many datasets you have and what their origin is: +> +> - You can import data +> +> - from your local file system, +> - from a given URL or +> - from a shared data library on the Galaxy server you are working on +> +> In all of these cases you will also have to organize the imported data into a dataset collection like explained in detail for the suggested example data. +> +> > Data logistics +> > +> > A detailed explanation of all of the above-mentioned options for getting your data into Galaxy and organizing it in your history is beyond the scope of this tutorial. +> > If you are struggling with getting your own data set up like shown for the example data in this section, please: +> > - Option 1: Browse some of the material on [Using Galaxy and Managing your Data]({% link topics/galaxy-interface %}) +> > - Option 2: Consult the FAQs on [uploading data]({% link faqs/galaxy/index.md %}#data%20upload) and on [collections]({% link faqs/galaxy/index.md %}#collections) +> > - Option 3: Watch some of the related brief videos from the [{% icon video %} Galactic introductions](https://www.youtube.com/playlist?list=PLNFLKDpdM3B9UaxWEXgziHXO3k-003FzE) playlist. +> > +> {: .details} +> +> - Alternatively, if your data is available from [NCBI's Sequence Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra), you can import it with the help of a dedicated tool, which will organize the data into collections for you. +> +> > Getting data from SRA +> > +> > The simpler [SARS-CoV-2 sequencing data analysis tutorial]({% link topics/variant-analysis/tutorials/sars-cov-2/tutorial.md %}) uses and explains this alternative way of importing. +> > +> {: .details} +{: .comment} -We encourage you to use your own data here (with at least 2 samples). If you do not have any datasets available, we provide some example datasets (paired-end data generated with Illumina-based Ampliconic (ARTIC) protocols) from [COG-UK](https://www.cogconsortium.uk/), the COVID-19 Genomics UK Consortium. +For the suggested batch of early Omicron data we suggest downloading it via URLs from the [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena/browser/home). In case your Galaxy server offers that same data through a shared data library, this represents a faster (data is already on the server) alternative, so we offer instructions for this scenario as well. + +> Import the sequencing data +> +> - Option 1: Import from the ENA +> +> ``` +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/002/SRR17054502/SRR17054502_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/002/SRR17054502/SRR17054502_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/003/SRR17054503/SRR17054503_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/003/SRR17054503/SRR17054503_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/004/SRR17054504/SRR17054504_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/004/SRR17054504/SRR17054504_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/005/SRR17054505/SRR17054505_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/005/SRR17054505/SRR17054505_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/006/SRR17054506/SRR17054506_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/006/SRR17054506/SRR17054506_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/007/SRR17054507/SRR17054507_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/007/SRR17054507/SRR17054507_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/008/SRR17054508/SRR17054508_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/008/SRR17054508/SRR17054508_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/009/SRR17054509/SRR17054509_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/009/SRR17054509/SRR17054509_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/010/SRR17054510/SRR17054510_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/010/SRR17054510/SRR17054510_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/033/SRR17051933/SRR17051933_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/033/SRR17051933/SRR17051933_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/034/SRR17051934/SRR17051934_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/034/SRR17051934/SRR17051934_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/035/SRR17051935/SRR17051935_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/035/SRR17051935/SRR17051935_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/036/SRR17051936/SRR17051936_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/036/SRR17051936/SRR17051936_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/037/SRR17051937/SRR17051937_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/037/SRR17051937/SRR17051937_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/038/SRR17051938/SRR17051938_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/038/SRR17051938/SRR17051938_2.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/039/SRR17051939/SRR17051939_1.fastq.gz +> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/039/SRR17051939/SRR17051939_2.fastq.gz +> ``` +> +> 1. Copy the links above +> 2. Open the {% tool [Upload](upload1) %} Manager +> 3. In the top row of tabs select **Collection** +> 4. Configure the drop-down select boxes on that tab like this: +> - *"Collection Type"*: `List of Pairs` +> - *"File Type"*: `fastqsanger.gz` +> 5. Click on **Paste/Fetch data** and paste the links you copied into the empty text box +> 6. Press **Start** +> 7. Wait for the **Build** button to become enabled, then click it +> 8. In the lower half of the next dialogue, Galaxy already suggests a mostly reasonable pairing of the inputs. +> +> As you can see, however, this auto-pairing would retain a *.fastq* suffix attached to each pair of forward and reverse reads. To correct this +> 1. Click **Unpair all** above the suggested pairings to undo all of them. +> 2. Change the following default values in the upper half of the window: +> - *"unpaired forward"*: `_1.fastq.gz` (instead of *_1*) +> - *"unpaired reverse"*: `_2.fastq.gz` (instead of *_2*) +> 3. Click **Auto-pair** +> 9. At, the bottom of the window, enter a suitable **Name**, like `Sequencing data`, for the new collection +> 10. Click on **Create collection** +> +> - Option 2: Import from a shared data library +> +> {% snippet faqs/galaxy/datasets_import_from_data_library.md astype="as a Collection" collection_type="List of Pairs" collection_name="Sequencing data" tohistory="the history you created for this tutorial" path="GTN - Material / Variant analysis / Mutation calling, viral genome reconstruction and lineage/clade assignment from SARS-CoV-2 sequencing data / DOI: 10.5281/zenodo.5036686" box_type="none" %} +> +{: .hands_on} -There are several possibilities to upload the data depending on how many datasets you have and what their origin is: +## Import reference sequence and auxiliary datasets -- Import datasets +Besides the sequenced reads data, we need at least two additional datasets for calling variants and annotating them: - - from your local file system, - - from a given URL or - - from a shared data library on the Galaxy server you are working on +- the SARS-CoV-2 reference sequence [NC_045512.2](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to align and compare our sequencing data against - and organize the imported data as a dataset collection. +- a tabular dataset defining aliases for viral gene product names, which will let us translate NCBI RefSeq Protein identifiers (used by the SnpEff annotation tool) to the commonly used names of coronavirus proteins and cleavage products. - > Collections - > - > A dataset collection is a way to represent an arbitrarily large collection of samples as a singular entity within a user's workspace. For an in-depth introduction to the concept you can follow this [dedicated tutorial]({% link topics/galaxy-interface/tutorials/collections/tutorial.md %}). - > - {: .comment} +> Get reference sequence and feature mappings +> +> 1. Get the SARS-CoV-2 reference sequence +> +> A convenient public download link for this sequence is best obtained from the ENA again, where the sequence is known under its [INSDC](https://www.insdc.org/) alias [MN908947.3](https://www.ebi.ac.uk/ena/browser/view/MN908947.3): +> ``` +> https://www.ebi.ac.uk/ena/browser/api/fasta/MN908947.3?download=true +> ``` +> +> 1. {% tool [Upload](upload1) %} the reference to your history via the link above and make sure the dataset format is set to `fasta`. +> +> {% snippet faqs/galaxy/datasets_import_via_link.md format="fasta" %} +> 2. {% tool [Replace Text in entire line](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/1.1.2) in entire line %} to simplify the reference sequence name +> - {% icon param-file %} *"File to process"*: the uploaded reference sequence from the ENA +> - In {% icon param-repeat %} *"1. Replacement"*: +> - *"Find pattern"*: `^>.+` +> - *"Replace with"*: `>NC_045512.2` +> +> While the identifiers `MN908947.3` and `NC_045512.2` really refer to the same sequence, one of the tools we are going to use during the analysis (SnpEff) requires the NCBI RefSeq identifier. +> 3. When the Replace Text tool run is finished, **rename** the output dataset to make it clear that this is the SARS-CoV-2 reference dataset to use in the analysis. +> +> {% snippet faqs/galaxy/datasets_rename.md name="SARS-CoV-2 reference" format="fasta" %} +> 2. {% tool [Upload](upload1) %} the mapping for translation product identifiers in tabular format +> +> This mapping really consists of just a few lines of text. Each line lists the NCBI Protein Refseq identifier of a SARS-CoV-2 translation product (which the tool SnpEff knows about and will use for annotating mutation effects), followed by a more commonly used name for that product (which we would like to see in final mutation reports). The last line specifies a "mapping" for the **.** annotation, which SnpEff uses for mutations that do not affect any viral open-reading frame. We do not have a better name for it so we specify that we want to retain this annotation unaltered in reports. +> +> ``` +> YP_009725297.1 leader +> YP_009725298.1 nsp2 +> YP_009725299.1 nsp3 +> YP_009725300.1 nsp4 +> YP_009725301.1 3Cpro +> YP_009725302.1 nsp6 +> YP_009725303.1 nsp7 +> YP_009725304.1 nsp8 +> YP_009725305.1 nsp9 +> YP_009725306.1 nsp10 +> YP_009725307.1 RdRp +> YP_009725308.1 helicase +> YP_009725309.1 ExoN +> YP_009725310.1 endoR +> YP_009725311.1 MethTr +> YP_009725312.1 nsp11 +> GU280_gp02 S +> GU280_gp03 orf3a +> GU280_gp04 E +> GU280_gp05 M +> GU280_gp06 orf6 +> GU280_gp07 orf7a +> GU280_gp08 orf7b +> GU280_gp09 orf8 +> GU280_gp10 N +> GU280_gp11 orf10 +> . . +> ``` +> +> Two remarks on this content: +> - Since the feature mapping dataset is expected to be in tabular format, but the above display uses spaces to separate columns, please make sure, you have **Convert spaces to tabs** checked when creating the dataset from the copied content! +> - If you prefer other names for certain translation products than the ones defined above (*e.g.* you might be used to call the first peptide *nsp1* instead of *leader*), you are, of course, free to change those names in the pasted content before uploading it to Galaxy. What needs to be kept is only the Refseq identifiers in the first column as these are fixed by the SnpEff tool. +> +> {% snippet faqs/galaxy/datasets_create_new_file.md name="SARS-CoV-2 feature mapping" format="tabular" convertspaces="true" %} +> +{: .hands_on} -- Import from [NCBI's Sequence Read Archive (SRA) at NCBI](https://www.ncbi.nlm.nih.gov/sra) with the help of a dedicated tool, which will organize the data into collections for you. +Another two datasets are needed only for the analysis of ampliconic, e.g. ARTIC-amplified, input data: - > Getting data from SRA - > - > [A dedicated tutorial is available to explain how to find and import SARS-CoV-2 data from SRA]({% link topics/variant-analysis/tutorials/sars-cov-2/tutorial.md %}). - > - {: .comment} +- a BED file specifying the primers used during amplification and their binding sites on the viral genome +- a custom tabular file describing the amplicon grouping of the primers (currently NOT used for tiled-amplicon ONT data) -> Import datasets -> -> 1. Import the datasets -> -> - Option 1 [{% icon video %}](https://youtu.be/FFCDx1rMGAQ): Your own local data using **Upload Data** (recommended for 1-10 datasets). +> Using your own tiled-amplicon data? Provide the correct primer scheme and amplicon info. > -> {% snippet faqs/galaxy/datasets_upload.md %} +> The instructions below assume that you are going to analyze viral samples amplified using **version 4 of the ARTIC network's SARS-CoV-2 set of primers**, which is the case for the suggested Omicron batch of data. > -> - Option 2 [{% icon video %}](https://youtu.be/hC8KSuT_OP8): Your own local data using **FTP** (recommended for >10 datasets) +> If you have decided to analyze your own tiled-amplicon sequencing data in this tutorial, and if your samples have been amplified with a **different** set of primers, you are supposed, at the following step, to upload a primer scheme file and corresponding amplicon information that describes this set of primers. > -> {% snippet faqs/galaxy/datasets_upload_ftp.md %} +> The Galaxy Project maintains a [collection of such files for some commonly used sets of primers](https://doi.org/10.5281/zenodo.4555734). +> If the files describing your set of primers are part of this collection, you can simply upload them using their Zenodo download URLs (analogous to what is shown for the ARTIC v4 primers below). > -> - Option 3: From the shared data library +> For sets of primers *not* included in the collection, you will have to create those files yourself (or obtain them from other sources). +> The expected format for the primer scheme file is 6-column BED format, while the amplicon info file is a simple tabular format that lists all primer names (which must be identical to the ones used in the primer scheme file) that contribute to formation of the same amplicon on a single tab-separated line. +> If in doubt, download the ARTIC v4 files through the URLs provided below, and use them as a template for your own custom files. > -> {% snippet faqs/galaxy/datasets_import_from_data_library.md path="GTN - Material / Variant analysis / Mutation calling, viral genome reconstruction and lineage/clade assignment from SARS-CoV-2 sequencing data / DOI: 10.5281/zenodo.5036686" %} -> -> - Option 4: From an external server via URL -> -> {% snippet faqs/galaxy/datasets_import_via_link.md %} -> -> For our example datasets, the datasets are stored on [Zenodo]({{ page.zenodo_link }}) and can be retrieved using the following URLs: -> -> ``` -> {{ page.zenodo_link }}/files/ERR5931005_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5931005_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5931006_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5931006_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5931007_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5931007_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5931008_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5931008_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949456_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949456_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949457_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949457_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949458_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949458_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949459_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949459_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949460_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949460_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949461_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949461_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949462_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949462_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949463_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949463_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949464_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949464_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949465_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949465_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949466_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949466_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949467_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949467_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949468_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949468_2.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949469_1.fastqsanger.gz -> {{ page.zenodo_link }}/files/ERR5949469_2.fastqsanger.gz -> ``` -> -> 2. Create a collection to organize the data -> -> - Option 1 [{% icon video %}](https://youtu.be/6ZU9hFjnRDo): Single-end data (Illumina or ONT data) -> -> {% snippet faqs/galaxy/collections_build_list.md %} +{: .details} + +> Get primer scheme and amplicon info > -> - Option 2 [{% icon video %}](https://youtu.be/6toVj35q1r0): Paired-end data (Illumina data) +> 1. Get the ARTIC v4 primer scheme file from > -> {% snippet faqs/galaxy/collections_build_list_paired.md %} +> ``` +> https://zenodo.org/record/5888324/files/ARTIC_nCoV-2019_v4.bed +> ``` > -> For the example datasets: -> - Since the datasets carry `_1` and `_2` in their names, Galaxy may already have detected a possible pairing scheme for the data, in which case the datasets will appear in green in the lower half (the paired section) of the dialog. +> and upload it to Galaxy as a dataset of type `bed`. > -> You could accept this default pairing, but as shown in the middle column of the paired section, this would include the `.fastqsanger` suffix in the pair names (even with `Remove file extensions?` checked Galaxy would only remove the last suffix, `.gz`, from the dataset names. +> {% snippet faqs/galaxy/datasets_import_via_link.md format="bed" %} > -> It is better to undo the default pairing and specify exactly what we want: -> - at the top of the *paired section*: click `Unpair all` +> 2. Get the ARTIC v4 amplicon info file from > -> This will move all input datasets into the *unpaired section* in the upper half of the dialog. -> - set the text of *unpaired forward* to: `_1.fastqsanger.gz` -> - set the text of *unpaired reverse* to: `_2.fastqsanger.gz` -> - click: `Auto-pair` +> ``` +> https://zenodo.org/record/5888324/files/ARTIC_amplicon_info_v4.tsv +> ``` > -> All datasets should be moved to the *paired section* again, but the middle column should now show that only the sample accession numbers will be used as the pair names. +> and upload it to Galaxy as a dataset of type `tabular`. > -> - Make sure *Hide original elements* is checked to obtain a cleaned-up history after building the collection. -> - Click *Create Collection* +> {% snippet faqs/galaxy/datasets_import_via_link.md format="tabular" %} > {: .hands_on} -> Learning to build collections automatically -> -> It is possible to build collections from tabular data containing URLs, sample sheets, list of accessions or identifiers, etc., directly during upload of the data. [A dedicated tutorial is available to explain the different possibilities]({% link topics/galaxy-interface/tutorials/upload-rules/tutorial.md %}). -> -{: .comment} +At this point you should have the following items in your history: -## Import auxiliary datasets +1. A collection of sequenced reads for all samples to be analyzed + - This collection should be organized as a list with one element per sample that you are planning to analyze. -Besides the sequenced reads data, we need at least two additional datasets for calling variants and annotating them: + If you are going to analyze the suggested batch of Omicron data, the list should have 16 elements. -- the SARS-CoV-2 reference sequence [NC_045512.2](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to align and compare our sequencing data against + - For paired-end sequenced data, like the suggested batch, each of the list elements should itself be a paired collection of a *forward* and a *reverse* reads dataset, each in *fastqsanger.gz* format. -- a tabular dataset defining aliases for viral gene product names, which will let us translate NCBI RefSeq Protein identifiers (used by the SnpEff annotation tool) to the commonly used names of coronavirus proteins and cleavage products. +2. The SARS-CoV-2 reference as a single *fasta* dataset +3. The SARS-CoV-2 feature mappings as a single *tabular* dataset +4. Only required if you are analyzing *tiled-amplicon* data (which is the case for the suggested batch): + - a primer scheme as a single *bed* or *bed6* dataset + - amplicon information as a single *tabular* dataset (*not* required for ONT data) -Another two datasets are needed only for the analysis of ampliconic, e.g. ARTIC-amplified, input data: +If any of these items are still missing, head back to the corresponding section(s) and upload them now. -- a BED file specifying the primers used during amplification and their binding sites on the viral genome -- a custom tabular file describing the amplicon grouping of the primers +If any of these items have not been assigned the correct format (expand the view of each dataset to reveal the format Galaxy has recorded for it), please fix them now. +{% snippet faqs/galaxy/datasets_change_datatype.md %} -> Import auxiliary datasets -> -> 1. Import the auxiliary datasets: -> - the SARS-CoV-2 reference (`NC_045512.2_reference.fasta`) -> - gene product name aliases (`NC_045512.2_feature_mapping.tsv`) -> - ARTIC v3 primer scheme (`ARTIC_nCoV-2019_v3.bed`) -> - ARTIC v3 primer amplicon grouping info (`ARTIC_amplicon_info_v3.tsv`) -> -> > Not using ARTIC v3 amplified sequencing data? -> > -> > The instructions here assume you will be analyzing the example samples -> > suggested above, which have been amplified using version 3 of the ARTIC -> > network's SARS-CoV-2 primer set. If you have decided to work through -> > this tutorial using your own samples of interest, and if those samples -> > have been amplified with a different primer set, you will have to upload -> > your own datasets with primer and amplicon information at this point. -> > If the primer set is from the ARTIC network, just not version 3, you -> > should be able to obtain the primer BED file from -> > [their SARS-CoV-2 github repo](https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019). -> > Look for a 6-column BED file structured like the version 3 one we suggest below. -> > For the tabular amplicon info file, you only need to combine all primer names from -> > the BED file that contribute to the same amplicon on a single tab-separated line. -> > The result should look similar to the ARTIC v3 amplicon grouping info file we -> >suggest to upload. -> {: .details} -> -> Several options exist to import these datasets: -> -> - Option 1: From the shared data library +If everything looks fine, you are ready to start the actual data analysis. + + +# Analysis + +As shown in figure 1, the analysis will consist mainly of running three workflows - one for calling the mutations for each input sample, one for reporting the findings and one for generating sample consensus sequences based on the identified mutations. + +You will also run a handful of individual tools, in particular for generating and exploring lineage assignments for the input samples. + +Along the way you will interpret key outputs of each used workflow and tool to get a complete picture of what can be learnt about the genomes of the input samples. + +## From sequencing data to annotated mutations per sample + +To identify mutations from sequencing data, the first workflow performs steps including + +- sequencing data quality control with filtering and trimming of low-quality reads +- mapping +- filtering and further processing of mapped reads (including primer-trimming in case of tiled-amplicon data) +- calling of mutations +- annotation of identified mutations with their effects on open-reading frames (ORFs) and proteins + +Four flavors of this workflow are available that are optimized for different types of input data so your first task is to obtain the correct workflow for your type of input data. + +> Import the workflow for your data into Galaxy > -> {% snippet faqs/galaxy/datasets_import_from_data_library.md path="GTN - Material / Variant analysis / Mutation calling, viral genome reconstruction and lineage/clade assignment from SARS-CoV-2 sequencing data / DOI: 10.5281/zenodo.5036686" %} +> All workflows developed as part of the Galaxy Covid-19 project can be retrieved via either of the two popular workflow registries [Dockstore](https://dockstore.org/) and [WorkflowHub](https://workflowhub.eu/), the choice is up to you, and Galaxy makes this process really easy for you. > -> - Option 2: From [Zenodo](https://zenodo.org/record/4555735) +> {% snippet faqs/galaxy/workflows_import_search.md search_query='organization:"iwc" name:"sars-cov-2"' box_type="none" %} > -> ``` -> https://zenodo.org/record/4555735/files/NC_045512.2_reference.fasta -> https://zenodo.org/record/4555735/files/NC_045512.2_feature_mapping.tsv -> https://zenodo.org/record/4555735/files/ARTIC_nCoV-2019_v3.bed -> https://zenodo.org/record/4555735/files/ARTIC_amplicon_info_v3.tsv -> ``` +> *IWC* (the intergalactic workflow commission) is the organization that the Galaxy Covid-19 project uses to publish its workflows, and the name restriction makes sure we are only getting workflows from that organization that deal with SARS-CoV-2 data analysis. > -> {% snippet faqs/galaxy/datasets_import_via_link.md %} +> Depending on your input data you will need to select the appropriate workflow from the list of hits returned by the workflow registry server. +> This would be: +> - **sars-cov-2-pe-illumina-artic-variant-calling/COVID-19-PE-ARTIC-ILLUMINA** > -> - Option 3: From a shared history: -> - On [usegalaxy.org](https://usegalaxy.org/u/aun1/h/artic-v3) -> - On [usegalaxy.eu](https://usegalaxy.eu/u/nekrut/h/artic-v3) -> - On [usegalaxy.org.au](https://usegalaxy.org.au/u/nekrut/h/artic-v3) +> if you are working with the suggested batch of samples for this tutorial, or if your own data is tiled-amplicon data sequenced on the Illumina platform in paired-end mode > -> {% snippet faqs/galaxy/histories_import.md %} +> Once imported into Galaxy, this workflow appear under the name: **COVID-19: variation analysis on ARTIC PE data**. +> - **sars-cov-2-ont-artic-variant-calling/COVID-19-ARTIC-ONT** > -> For the example datasets, you will need to import all 4 auxiliary datasets. +> if you are working with tiled-amplicon data sequenced on the ONT platform > -> 2. Check and manually correct assigned datatypes +> Once imported into Galaxy, this workflow appear under the name: **COVID-19: variation analysis of ARTIC ONT data**. +> - **sars-cov-2-pe-illumina-wgs-variant-calling/COVID-19-PE-WGS-ILLUMINA** > -> If you have imported the auxiliary datasets via their Zenodo links, Galaxy -> will have tried to autodetect the format of each imported dataset, but -> will not always be right with its guess. It's your task now to check and -> possibly correct the format assignments for each of the datasets! +> if you are working with WGS (i.e. non-ampliconic) data obtained on the Illumina platform in paired-end mode > -> - Expand the view of each of the uploaded auxiliary datasets and see if -> Galaxy shows the following `format` values: -> - for `NC_045512.2_reference.fasta`: `fasta` -> - for `NC_045512.2_feature_mapping.tabular`: `tabular` -> - for `ARTIC_nCoV-2019_v3.bed6`: `bed6` or `bed` -> - for `ARTIC_amplicon_info_v3.tabular`: `tabular` +> Once imported into Galaxy, this workflow appear under the name: **COVID-19: variation analysis on WGS PE data**. +> - **sars-cov-2-se-illumina-wgs-variant-calling/COVID-19-SE-WGS-ILLUMINA** > -> - If any of the above assignments are not what they should be, then change -> the datatype of the corresponding dataset now to the intended format. +> if you are working with WGS data obtained on the Illumina platform in single-end mode > -> {% snippet faqs/galaxy/datasets_change_datatype.md %} +> Once imported into Galaxy, this workflow appear under the name: **COVID-19: variation analysis on WGS SE data**. > -> If you have imported the auxiliary datasets into your history from a -> shared data library or history, then the above steps are not necessary -> (though checking the datatypes of imported data is good practice in -> general) because the shared datasets have their format configured -> correctly already. +> In all cases, the latest version of the workflow should be fine to use in this tutorial. > -{: .hands_on} - -# From FASTQ to annotated allelic variants - -To identify the SARS-CoV-2 allelic variants (AVs), a first workflow converts the FASTQ files to annotated AVs through a series of steps that include quality control, trimming, mapping, deduplication, AV calling, and filtering. - -Four versions of this workflow are available with their tools and parameters optimized for different types of input data as outlined in the following table: - -Workflow version | Input data | Read aligner | Variant caller ---- | --- | --- | --- -Illumina RNAseq SE | Single-end data derived from RNAseq experiments | **bowtie2** {% cite langmead_2012 %} | **lofreq** {% cite wilm_2012 %} -Illumina RNAseq PE | Paired-end data derived from RNAseq experiments | **bwa-mem** {% cite li_2010 %} | **lofreq** {% cite wilm_2012 %} -Illumina ARTIC | Paired-end data generated with Illumina-based Ampliconic (ARTIC) protocols | **bwa-mem** {% cite li_2010 %} | **lofreq** {% cite wilm_2012 %} -ONT ARTIC | ONT FASTQ files generated with Oxford nanopore (ONT)-based Ampliconic (ARTIC) protocols | **minimap2** {% cite li_2018 %} | **medaka** +{: .hands-on} > About the workflows > -> - The two Illumina RNASeq workflows (Illumina RNAseq SE and Illumina RNAseq PE) perform read mapping with **bwa-mem** and **bowtie2**, respectively, followed by sensitive allelic-variant (AV) calling across a wide range of AFs with **lofreq**. -> - The workflow for Illumina-based ARTIC data (Illumina ARTIC) builds on the RNASeq workflow for paired-end data using the same steps for mapping (**bwa-mem**) and AV calling (**lofreq**), but adds extra logic operators for trimming ARTIC primer sequences off reads with the **ivar** package. In addition, this workflow uses **ivar** also to identify amplicons affected by ARTIC primer-binding site mutations and excludes reads derived from such “tainted” amplicons when calculating alternative allele frequences (AFs) of other AVs. -> - The workflow for ONT-sequenced ARTIC data (ONT ARTIC) is modeled after the alignment/AV-calling steps of the [ARTIC pipeline](https://artic.readthedocs.io/). It performs, essentially, the same steps as that pipeline’s minion command, i.e. read mapping with **minimap2** and AV calling with **medaka**. Like the Illumina ARTIC workflow it uses **ivar** for primer trimming. Since ONT-sequenced reads have a much higher error rate than Illumina-sequenced reads and are therefore plagued more by false-positive AV calls, this workflow makes no attempt to handle amplicons affected by potential primer-binding site mutations. +> Differences between the four different *variation analysis* workflow flavors are as follows: +> - All workflows for analysis of Illumina data use a core analysis flow through **bwa-mem** (or **bowtie2** for single-end data) as the read mapper and **lofreq** as a sensitive variant caller (which is capable of identifying also non-majority variant alleles), while the workflow for analysis of ONT data uses **minimap2** and **medaka** for read mapping and variant calling, respectively. +> - Workflows for analysis of tiled-amplicon data differ from WGS data analysis ones in that they do not perform mapped reads deduplication (because apparent duplicates are a side-effect of sequencing defined amplicons instead of random genome fragments). +> In addition, the tiled-amplicon data analysis workflows handle primer trimming and amplicon-based filtering through the **ivar** suite of tools and additional processing logic. > -> All four workflows use **SnpEff**, specifically its 4.5covid19 version, for AV annotation. +> All four workflows use **fastp** for quality control, raw reads filtering and trimming, and **SnpEff**, specifically its 4.5covid19 version, for annotation of mutation effects. > -> Workflows default to requiring an AF ≥ 0.05 and AV-supporting reads of ≥ 10 (these and all other parameters can be easily changed by the user). For an AV to be listed in the reports, it must surpass these thresholds in at least one sample of the respective dataset. We estimate that for AV calls with an AF ≥ 0.05, our analyses have a false-positive rate of < 15% for both Illumina RNAseq and Illumina ARTIC data, while the true-positive rate of calling such low-frequency AVs is ~80% and approaches 100% for AVs with an AF ≥ 0.15. This estimate is based on an initial application of the Illumina RNAseq and Illumina ARTIC workflows to two samples for which data of both types had been obtained at the virology department of the University of Freiburg and the assumption that AVs supported by both sets of sequencing data are true AVs. The second threshold of 10 AV-supporting reads is applied to ensure that calculated AFs are sufficiently precise for all AVs. -> -> More details about the workflows, including benchmarking of the tools, can be found on [covid19.galaxyproject.org](https://covid19.galaxyproject.org/genomics/global_platform/#methods) {: .details} -> From FASTQ to annotated AVs +{% include _includes/cyoa-choices.html option1="tiled-amplicon Illumina paired-end" option2="tiled-amplicon ONT" option3="WGS Illumina paired-end" option4="WGS Illumina single-end" default="tiledamplicon-Illumina-pairedend" text="Now that you have imported the data and the corresponding workflow of your choice, please select the type of your input data so that we can adjust a few parts of this tutorial that are dependent on the nature of your data:" %} + +> From sequencing data to annotated mutations > -> 1. **Get the workflow** for your data into Galaxy +>
> -> - Option 1: Find workflows on the [WorkflowHub](https://workflowhub.eu) and run them directly on [usegalaxy.eu](https://usegalaxy.eu/) +> 1. Run the **COVID-19: variation analysis on ARTIC PE data** {% icon workflow %} workflow using the following parameters: > -> Please note that this option currently works *only* with usegalaxy.eu! +> - {% icon param-collection %} *"Paired Collection"*: your paired collection of input sequencing data +> - {% icon param-file %} *"NC_045512.2 FASTA sequence of SARS-CoV-2"*: the `SARS-CoV-2 reference` sequence +> - {% icon param-file %} *"ARTIC primer BED"*: the uploaded primer scheme in **bed** format +> - {% icon param-file %} *"ARTIC primers to amplicon assignments"*: the uploaded amplicon info in **tabular** format > -> - Open the workflow page on the WorkflowHub -> - [Illumina ARTIC PE](https://workflowhub.eu/workflows/110) - The one to use for example datasets -> - [Illumina RNAseq SE](https://workflowhub.eu/workflows/112) -> - [Illumina RNAseq PE](https://workflowhub.eu/workflows/113) -> - [ONT ARTIC](https://workflowhub.eu/workflows/111) +> A common mistake here is to mix up the last two datasets: the *primer BED* dataset is the one with the positions of the primer binding sites listed in it. The *amplicon assignments* dataset contains only (grouped) names of primers. > -> - Click on `Run on usegalaxy.eu` at the top right of the page -> -> The browser will open a new tab with Galaxy's workflow invocation interface. +> The additional workflow parameters *"Read removal minimum AF"*, *"Read removal maximum AF"*, *"Minimum DP required after amplicon bias correction"* and *"Minimum DP_ALT required after amplicon bias correction"* can all be left at their default values. The tiled-amplicon Illumina workflow goes to great lengths in order to calculate accurate allele-frequencies even in cases of viral mutations that affect the binding of some of the tiled-amplicon primers. The four settings above are for fine-tuning this behavior, but their defaults have been carefully chosen and you should modify them only if you know exactly what you are doing. +>
+>
> -> - Option 2: Import the workflow from [Dockstore](https://dockstore.org/) using Galaxy's workflow search +> 1. Run the **COVID-19: variation analysis of ARTIC ONT data** {% icon workflow %} workflow using the following parameters: > -> {% snippet faqs/galaxy/workflows_import_search.md trs_server="Dockstore" search_query='organization:"iwc-workflows"' %} +> - {% icon param-collection %} *"ONT-sequenced reads"*: your collection of input sequencing data +> - {% icon param-file %} *"NC_045512.2 FASTA sequence of SARS-CoV-2"*: the `SARS-CoV-2 reference` sequence +> - {% icon param-file %} *"Primer binding sites info in BED format"*: the uploaded primer scheme in **bed** format > -> For the example dataset: `sars-cov-2-pe-illumina-artic-variant-calling/COVID-19-PE-ARTIC-ILLUMINA` +> The optional workflow parameters *"Minimum read length"* and *"Maximum read length"* should be chosen according to the tiled-amplicon primer scheme's amplicon sizes. +> The [ARTIC network's recommendations for excluding obviously chimeric reads](https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html) (see the section "Read filtering" on that page) are a good starting point. > -> - Option 3: Import the workflow via its github link +> The workflow defaults are appropriate for ARTIC network primers, but you may have to modify them if your sample material has been amplified with another primer scheme. As suggested on the above page: try to set *"Minimum read length"* to the size of the smallest amplicon in your primer scheme, and *"Maximum read length"* to the size of the largest amplicon plus 200 nts. > -> - Open the GitHub repository of your workflow -> - [Illumina ARTIC PE](https://github.com/iwc-workflows/sars-cov-2-pe-illumina-artic-variant-calling) - The one to use for example datasets -> - [Illumina RNAseq SE](https://github.com/iwc-workflows/sars-cov-2-se-illumina-wgs-variant-calling) -> - [Illumina RNAseq PE](https://github.com/iwc-workflows/sars-cov-2-pe-illumina-wgs-variant-calling) -> - [ONT ARTIC](https://github.com/iwc-workflows/sars-cov-2-ont-artic-variant-calling) -> - Open the `.ga` file -> - Click on `Raw` at the top right of the file view -> - Save the file or Copy the URL of the file -> - Import the workflow to Galaxy +>
+>
> -> {% snippet faqs/galaxy/workflows_import.md %} +> 1. Run the **COVID-19: variation analysis on WGS PE data** {% icon workflow %} workflow using the following parameters: > -> 2. Run **COVID-19: variation analysis on ...** {% icon workflow %} using the following parameters: +> - {% icon param-collection %} *"Paired Collection"*: your paired collection of input sequencing data +> - {% icon param-file %} *"NC_045512.2 FASTA sequence of SARS-CoV-2"*: the `SARS-CoV-2 reference` sequence > -> {% snippet faqs/galaxy/workflows_run.md %} +>
+>
> -> - *"Send results to a new history"*: `No` +> 1. Run the **COVID-19: variation analysis on WGS SE data** {% icon workflow %} workflow using the following parameters: > -> - For **Illumina ARTIC PE** workflow (named **COVID-19: variation analysis on ARTIC PE data**), *to use for example datasets* -> - {% icon param-file %} *"1: ARTIC primers to amplicon assignments"*: `ARTIC_amplicon_info_v3.tsv` or `ARTIC amplicon info v3` -> - {% icon param-file %} *"2: ARTIC primer BED"*: `ARTIC_nCoV-2019_v3.bed` or `ARTIC nCoV-2019 v3` -> - {% icon param-file %} *"3: FASTA sequence of SARS-CoV-2"*: `NC_045512.2_reference.fasta` or `NC_045512.2 reference sequence` -> - {% icon param-collection %} *"4: Paired Collection (fastqsanger) - A paired collection of fastq datasets to call variant from"*: paired collection created for the input datasets +> - {% icon param-collection %} *"Single End Collection"*: your collection of input sequencing data +> - {% icon param-file %} *"NC_045512.2 FASTA sequence of SARS-CoV-2"*: the `SARS-CoV-2 reference` sequence > -> - For **Illumina RNAseq PE** workflow (named **COVID-19: variation analysis on WGS PE data**) -> - {% icon param-collection %} *"1: Paired Collection (fastqsanger)"*: paired collection created for the input datasets -> - {% icon param-file %} *"2: NC_045512.2 FASTA sequence of SARS-CoV-2"*: `NC_045512.2_reference.fasta` or `NC_045512.2 reference sequence` +>
> -> - For **Illumina RNAseq SE** workflow (named **COVID-19: variation analysis on WGS SE data**) -> - {% icon param-collection %} *"1: Input dataset collection"*: dataset collection created for the input datasets -> - {% icon param-file %} *"2: NC_045512.2 FASTA sequence of SARS-CoV-2"*: `NC_045512.2_reference.fasta` or `NC_045512.2 reference sequence` +> > Running a workflow +> > +> > {% snippet faqs/galaxy/workflows_run.md box_type="none" %} +> > +> > Note: the {% icon galaxy-gear %} icon next to **Run Workflow** offers the option to *Send results to a new history*. +> > This is very useful if you are planning to analyze the data in your current history in multiple different ways, and you would like to have each analysis end up in its own dedicated history. +> > Here, however, we only want to do one analysis of our batch of data so we are fine with results of the workflow run getting added to the current history. +> {: .tip} > -> - For **ONT ARTIC** workflow (named **COVID-19: variation analysis of ARTIC ONT data**) -> - {% icon param-file %} *"1: ARTIC primer BED"*: `ARTIC_nCoV-2019_v3.bed` or `ARTIC nCoV-2019 v3` -> - {% icon param-file %} *"2: FASTA sequence of SARS-CoV-2"*: `NC_045512.2_reference.fasta` or `NC_045512.2 reference sequence` -> - {% icon param-collection %} *"3: Collection of ONT-sequenced reads"*: dataset collection created for the input datasets {: .hands_on} -The execution of the workflow takes some time. It is possible to launch the next step even if it is not done, as long as all steps are successfully scheduled. +Scheduling of the workflow will take a while. One of the last datasets that will be added to your history will be called **Preprocessing and mapping reports**. It is the first overview report that you will obtain in this tutorial, and the only one produced by the variation analysis workflow. -# From annotated AVs per sample to AV summary +Once this dataset is ready take a moment to explore its content. It contains potentially valuable information about coverage of reads mapped to the reference genome and quality statistics for these reads. Technical issues with any samples in a batch are typically visible in this report already and spotting them early can often prevent overinterpretation of later reports. -Once the jobs of previous workflows are done, we identified AVs for each sample. We can run a "Reporting workflow" on them to generate a final AV summary. +If you are following along with the suggested batch of samples, we also have prepared a simple question for you. -This workflow takes the collection of called (with lofreq) and annotated (with SnpEff) variants (one VCF dataset per input sample) that got generated as one of the outputs of any of the four variation analysis workflows above, and generates two tabular reports and an overview plot summarizing all the variant information for your batch of samples. - -> Use the right collection of annotated variants! -> The variation analysis workflow should have generated *two* collections of annotated variants - one called `Final (SnpEff-) annotated variants`, the other one called `Final (SnpEff-) annotated variants with strand-bias soft filter applied`. -> -> If you have analyzed ampliconic data with any of the **variation analysis of ARTIC** data workflows, then please consider the strand-bias soft-filtered collection experimental and proceed with the `Final (SnpEff-) annotated variants` collection as input to the next workflow. +> > -> If you are working with WGS data using either the **variation analysis on WGS PE data** or the **variation analysis on WGS SE data** workflow, then (and only then) you should continue with the `Final (SnpEff-) annotated variants with strand-bias soft filter applied` collection to eliminate some likely false-postive variant calls. +> 1. There are at least three problematic samples in the batch. What are their identifiers and what is the issue with them? > -{: .warning} +> > +> > +> > 1. The three samples are **SRR17054505**, **SRR17054506** and **SRR17054508**. +> > +> > The problem with all three is poor coverage as can be seen from the "General Statistics" section. +> > When you inspect the "Cumulative genome coverage" plot, you can see that a fourth sample, **SRR17054502**, is not that much better than two of these three. +> > +> > Coverage of the worst sample, **SRR17054505**, is critically low and we cannot expect very usable mutation calls from it. +> > For the other three we may hope for some results, but we should not expect those to be perfect. +> > +> {: .solution} +{: .question} -> From annotated AVs per sample to AV summary -> -> 1. **Get the workflow** into Galaxy -> -> - Option 1: Find workflows on the [WorkflowHub](https://workflowhub.eu) and run them directly on [usegalaxy.eu](https://usegalaxy.eu/) -> -> Please note that this option currently works *only* with usegalaxy.eu! -> -> - Open the [workflow page on WokflowHub](https://workflowhub.eu/workflows/109) -> - Click on `Run on usegalaxy.eu` on the top right of the page -> -> The browser will open a new tab with Galaxy's workflow invocation interface. -> -> - Option 2: Import the workflow from [Dockstore](https://dockstore.org/) using Galaxy's workflow search -> -> {% snippet faqs/galaxy/workflows_import_search.md trs_server="Dockstore" search_query='organization:"iwc-workflows"' workflow_name="sars-cov-2-variation-reporting/COVID-19-VARIATION-REPORTING" %} +## From mutations per sample to reports and visualizations + +The main output of the first workflow is a collection with annotated mutation calls per sample. This collection will serve as input to the *reporting* workflow. + +> Import the variant analysis reporting workflow into Galaxy > -> - Option 3: Import the workflow via its github repo link +> Just like the variation analysis workflows before, also the *reporting workflow* developed by the Galaxy Covid-19 project can be retrieved from *Dockstore* or *WorkflowHub*: > -> - Open the [workflow GitHub repository](https://github.com/iwc-workflows/sars-cov-2-variation-reporting) -> - Open the `.ga` file -> - Click on `Raw` on the top right of the file -> - Save the file or Copy the URL of the file -> - Import the workflow to Galaxy +> {% snippet faqs/galaxy/workflows_import_search.md search_query='organization:"iwc" name:"sars-cov-2"' workflow_name="sars-cov-2-variation-reporting/COVID-19-VARIATION-REPORTING" box_type="none" %} > -> {% snippet faqs/galaxy/workflows_import.md %} +> Again, you can just select the latest version of the workflow, and, once imported, it should appear in your list of workflows under the name: **COVID-19: variation analysis reporting**. > -> 2. Run **COVID-19: variation analysis reporting** {% icon workflow %} using the following parameters: +{: .hands-on} + +> From mutations per sample to reports and visualizations > -> {% snippet faqs/galaxy/workflows_run.md %} +> 1. Run the **COVID-19: variation analysis reporting** {% icon workflow %} workflow with the following parameters: > -> - *"Send results to a new history"*: `No` -> - *"1: AF Filter - Allele Frequency Filter"*: `0.05` -> -> This number is the minimum allele frequency required for variants to be included in the report. +>
> -> - *"2: DP Filer"*: `1` +> - *"Variation data to report"*: `Final (SnpEff-) annotated variants` +>
+>
> -> The minimum depth of all alignments required at a variant site; -> the suggested value will, effectively, deactivate filtering on overall DP and will result in the DP_ALT Filter to be used as the only coverage-based filter. +> - *"Variation data to report"*: `Final (SnpEff-) annotated variants` +>
+>
> -> - *"3: DP_ALT Filter"*: `10` +> - *"Variation data to report"*: `Final (SnpEff-) annotated variants with strand-bias soft filter applied` +>
+>
> -> The minimum depth of alignments at a site that need to support the respective variant allele +> - *"Variation data to report"*: `Final (SnpEff-) annotated variants with strand-bias soft filter applied` +>
> -> - *"4: Variation data to report"*: `Final (SnpEff-) annotated variants` +> The collection with variation data in VCF format; output of the previous workflow > -> The collection with variation data in VCF format: the output of the previous workflow +> > Use the right collection of annotated variants! +> > The variation analysis workflow should have generated *two* collections of annotated variants - one called `Final (SnpEff-) annotated variants`, the other one called `Final (SnpEff-) annotated variants with strand-bias soft filter applied`. +> > +> >
+> > For tiled-amplicon data, please consider the strand-bias filter experimental and proceed with the `Final (SnpEff-) annotated variants` collection as input here. +> >
+> >
+> > For tiled-amplicon data, please consider the strand-bias filter experimental and proceed with the `Final (SnpEff-) annotated variants` collection as input here. +> >
+> >
+> > For WGS (i.e. non-ampliconic) data, use the `Final (SnpEff-) annotated variants with strand-bias soft filter applied` collection as input here to eliminate some likely false-positive variant calls. +> >
+> >
+> > For WGS (i.e. non-ampliconic) data, use the `Final (SnpEff-) annotated variants with strand-bias soft filter applied` collection as input here to eliminate some likely false-positive variant calls. +> >
+> > +> {: .comment} > +> - *"gene products translations"*: the uploaded `SARS-CoV-2 feature mapping` dataset > -> - *"4: gene products translations"*: `NC_045512.2_feature_mapping.tsv` or `NC_045512.2 feature mapping` +> Remember, this mapping defines the gene names that will appear as affected by given mutations in the reports. +> - *"Number of Clusters"*: `3` > -> The custom tabular file mapping NCBI RefSeq Protein identifiers (as used by snpEff version 4.5covid19) to their commonly used names, part of the auxillary data; the names in the second column of this dataset are the ones that will appear in the reports generated by this workflow. +> The variant frequency plot generated by the workflow will separate the samples into this number of main clusters. > -> - *"5: Number of Clusters"*: `3` +> 3 is an appropriate value for the suggested example batch of 16 samples. +> If you are using your own data, and it consists of a lot less or more samples than the example batch, you might want to experiment with lower or higher values, respectively. +> Generating an optimal plot sometimes requires some experimenting with this value, but once the workflow run is complete, you can always just rerun the one job that generates the plot and adjust this setting for that rerun. > -> The variant frequency plot generated by the workflow will separate the samples into this number of clusters. +> The remaining workflow parameters *"AF Filter"*, *"DP Filter"*, and *"DP_ALT_FILTER"* define criteria for reporting mutations as reliable. They can all be left at their default values. > {: .hands_on} The three key results datasets produced by the Reporting workflow are: -1. **Combined Variant Report by Sample**: This table combines the key statistics for each AV call in each sample. Each line in the dataset represents one AV detected in one specific sample - - Column | Field | Meaning - --- | --- | --- - 1 | `Sample` | SRA run ID - 2 | `POS` | Position in [NC_045512.2](https://www.ncbi.nlm.nih.gov/nuccore/1798174254) - 3 | `FILTER` | `Filter` field from VCF - 4 | `REF` | Reference base - 5 | `ALT` | Alternative base - 6 | `DP` | Sequencing depth - 7 | `AF` | Alternative allele frequency - 8 | `SB` | Strand bias P-value from Fisher's exact test calculated by [`lofreq`](https://csb5.github.io/lofreq/) - 9 |`DP4` | Depth for Forward Ref Counts, Reverse Ref Counts, Forward Alt Counts, Reverse Alt Counts - 10 |`IMPACT` | Functional impact (from SNPEff) - 11 |`FUNCLASS` | Funclass for change (from SNPEff) - 12 |`EFFECT` | Effect of change (from SNPEff) - 13 |`GENE` | Gene name - 14 |`CODON` | Codon - 15 |`AA` | Amino acid - 16 |`TRID` | Short name for the gene - 17 |`min(AF)` | Minimum Alternative Allele Freq across all samples containing this change - 18 |`max(AF)` | Maximum Alternative Allele Freq across all samples containing this change - 19 |`countunique(change)` | Number of distinct types of changes at this site across all samples - 20 |`countunique(FUNCLASS)` | Number of distinct FUNCLASS values at this site across all samples - 21 |`change` | Change at this site in this sample - - > - > - > 1. How many AVs are found for all samples? - > 2. How many AVs are found for the first sample in the document? - > 3. How many AVs are found for each sample? - > - > > - > > - > > 1. By expanding the dataset in the history, we have the number of lines in the file. 868 lines for the example datasets. The first line is the header of the table. Then 867 AVs. - > > - > > 2. We can filter the table to get only the AVs for the first sample {% tool [Filter data on any column using simple expressions](Filter1) %} with the following parameters: - > > - {% icon param-file %} *"Filter*": `Combined Variant Report by Sample` - > > - *"With following condition*": `c1=='ERR5931005'` (to adapt with the sample name) - > > - *"Number of header lines to skip*": `1` - > > - > > We got then only the AVs for the selected sample (48 for ERR5931005). - > > - > > 3. To get the number of AVs for each sample, we can run {% tool [Group data](Grouping1) %} with the following parameters: - > > - {% icon param-file %} *"Select data"*: `Combined Variant Report by Sample` - > > - *"Group by column"*: `Column: 1` - > > - In *"Operation"*: - > > - In *"1: Operation"*: - > > - *"Type"*: `Count` - > > - *"On column"*: `Column: 2` - > > - > > With our example datasets, it seems that samples have between 42 and 56 AVs. - > {: .solution} - {: .question} - -2. **Combined Variant Report by Variant**: This table combines the information about each AV *across* samples. - - Column | Field | Meaning - --- | --- | --- - 1 | `POS` | Position in [NC_045512.2](https://www.ncbi.nlm.nih.gov/nuccore/1798174254) - 2 | `REF` | Reference base - 3 | `ALT` | Alternative base - 4 | `IMPACT` | Functional impact (from SnpEff) - 5 | `FUNCLASS` | Funclass for change (from SnpEff) - 6 | `EFFECT` | Effect of change (from SnpEff) - 7 | `GENE` | Gene - 8 | `CODON` | Codon - 9 | `AA` | Amino acid - 10 |`TRID` | Short name for the gene (from the feature mapping dataset) - 11 |`countunique(Sample)` | Number of distinct samples containing this change - 12 |`min(AF)` | Minimum Alternative Allele Freq across all samples containing this change - 13 |`max(AF)` | Maximum Alternative Allele Freq across all samples containing this change - 14 |`SAMPLES(above-thresholds)` | List of distinct samples where this change has frequency abobe threshold (5%) - 15 |`SAMPLES(all)` | List of distinct samples containing this change at any frequency (including below threshold) - 16 |`AFs(all)` | List of all allele frequencies across all samples - 17 |`change` | Change - - > - > - > 1. How many AVs are found? - > 1. What are the different impacts of the AVs? - > 2. How many variants are found for each impact? - > 3. What are the different effects of HIGH impact? - > 4. Are there any AVs impacting all samples? - > - > > - > > - > > 1. By expanding the dataset in the history, we have the number of lines in the file. 184 lines for the example datasets. The first line is the header of the table. Then 183 AVs. - > > - > > 2. The different impacts of the AVs are HIGH, MODERATE and LOW. - > > - > > 2. To get the number of AVs for each impact levels, we can run {% tool [Group data](Grouping1) %} with the following parameters: - > > - {% icon param-file %} *"Select data"*: `Combined Variant Report by Variant` - > > - *"Group by column"*: `Column: 4` - > > - In *"Operation"*: - > > - In *"1: Operation"*: - > > - *"Type"*: `Count` - > > - *"On column"*: `Column: 1` - > > - > > With our example datasets, we find: - > > - 11 AVs with no predicted impact - > > - 52 LOW AVs - > > - 111 MODERATE AVs - > > - 9 HIGH AVs - > > - > > 3. We can filter the table to get only the AVs with HIGH impact by running {% tool [Filter data on any column using simple expressions](Filter1) %} with the following parameters: - > > - {% icon param-file %} *"Filter*": `Combined Variant Report by Variant` - > > - *"With following condition*": `c4=='HIGH'` - > > - *"Number of header lines to skip*": `1` - > > - > > The different effects for the 9 HIGH AVs are STOP_GAINED and FRAME_SHIFT. - > > - > > 4. We can filter the table to get the AVs for which `countunique(Sample)` is equal the number of samples (18 in our example dataset): {% tool [Filter data on any column using simple expressions](Filter1) %} with the following parameters: - > > - {% icon param-file %} *"Filter*": `Combined Variant Report by Variant` - > > - *"With following condition*": `c11==18` (to adapt to the number of sample) - > > - *"Number of header lines to skip*": `1` - > > - > > For our example datasets, 4 AVs are found in all samples - > {: .solution} - {: .question} +1. **Combined Variant Report by Sample** -3. **Variant frequency plot** + This table combines the key statistics for each mutation call in each individual sample. Each line in the dataset represents one mutation detected in one specific sample. - ![Variant frequency plot](../../images/sars-cov-2-variant-discovery/variant-frequency.svg) + This report is meant as a faster and more legible way for studying the collection of per-sample mutations in VCF format directly. - This plot represents AFs (cell color) for the different AVs (columns) and the different samples (rows). The AVs are grouped by genes (different colors on the 1st row). Information about their effect is also represented on the 2nd row. The samples are clustered following the tree displayed on the left. + > Structure of the by-sample mutation report + > + > Column | Field | Meaning + > --- | --- | --- + > 1 | `Sample` | Sample ID + > 2 | `POS` | Position of site with regard to the reference genome + > 3 | `FILTER` | Whether the variant passed the *AF*, *DP* and *DP_ALT* filters defined for the workflow run; variants that do not pass the filters in at least one sample are not reported, but a variant may fail some or all of the filters in *some* samples + > 4 | `REF` | Reference base + > 5 | `ALT` | Alternative base + > 6 | `DP` | Sequencing depth at the variant site + > 7 | `AF` | Allele frequency + > 8 | `AFcaller` | Uncorrected allele frequency emitted by variant caller; for Illumina data, may differ from `AF` due to a [bug](https://github.com/CSB5/lofreq/issues/80) in the lofreq variant caller + > 8 | `SB` | Strand bias P-value from Fisher's exact test calculated by lofreq + > 9 |`DP4` | Observed counts of 1. REF-supporting fw-strand reads, 2. REF-supporting rv-strand reads, 3. ALT-supporting fw-strand reads, 4. ALT-supporting rv-strand reads + > 10 |`IMPACT` | Functional impact (from SnpEff annotation) + > 11 |`FUNCLASS` | Funclass for change (from SnpEff annotation) + > 12 |`EFFECT` | Effect of change (from SnpEff annotation) + > 13 |`GENE` | Name of affected viral gene or ORF (from SnpEff annotation) + > 14 |`CODON` | Affected codon (from SnpEff annotation) + > 15 |`AA` | Amino acid change caused by the mutation (from SnpEff annotation) + > 16 |`TRID` | Name of the affected viral protein (as defined in the feature mapping dataset) + > 17 |`min(AF)` | Minimum alternate allele frequency across all samples containing this mutation + > 18 |`max(AF)` | Maximum alternate allele frequency across all samples containing this mutation + > 19 |`countunique(change)` | Number of distinct types of changes at this site across all samples; if > 1, other samples in the batch have the same site affected by different base changes + > 20 |`countunique(FUNCLASS)` | Number of distinct FUNCLASS values at this site across all samples; if > 1, other samples in the batch have the same site affected by different base changes that have a different effect on the gene product + > 21 |`change` | Nucleotide change at this site in this sample + > + {: .details} - In the example datasets, the samples are clustered in 3 clusters (as we defined when running the workflow), that may represent different SARS-CoV-2 lineages as the AVs profiles are different. +2. **Combined Variant Report by Variant** -# From AVs to consensus sequences + This table aggregates the information about each mutation *across* samples. It is much more concise than the by-sample report at only a small amount of information loss. -For the variant calls, we can now run a workflow which generates reliable consensus sequences according to transparent criteria that capture at least some of the complexity of variant calling: + > Structure of the across-samples mutation report + > + > Column | Field | Meaning + > --- | --- | --- + > 1 | `POS` | Position of site with regard to the reference genome + > 2 | `REF` | Reference base + > 3 | `ALT` | Alternative base + > 4 | `IMPACT` | Functional impact (from SnpEff annotation) + > 5 | `FUNCLASS` | Funclass for change (from SnpEff annotation) + > 6 | `EFFECT` | Effect of change (from SnpEff annotation) + > 7 | `GENE` | Name of affected viral gene or ORF (from SnpEff annotation) + > 8 | `CODON` | Affected codon (from SnpEff annotation) + > 9 | `AA` | Amino acid change caused by the mutation (from SnpEff annotation) + > 10 |`TRID` | Name of the affected viral protein (as defined in the feature mapping dataset) + > 11 |`countunique(Sample)` | Number of distinct samples containing this change + > 12 |`min(AF)` | Minimum alternate allele frequency across all samples containing this mutation + > 13 |`max(AF)` | Maximum alternate allele frequency across all samples containing this mutation + > 14 |`SAMPLES(above-thresholds)` | List of samples, in which this mutation was identified and in which the call passed the *AF*, *DP* and *DP_ALT* filters defined for the workflow run + > 15 |`SAMPLES(all)` | List of all samples, in which this mutation was identified (including samples in which the call did not pass all filters) + > 16 |`AFs(all)` | List of allele frequencies at which the mutation was observed in the samples in SAMPLES(all) + > 17 |`change` | Nucleotide change at this site + > + {: .details} -- Each consensus sequence is guaranteed to capture all called, filter-passing variants as defined in the VCF of its sample that reach a user-defined consensus allele frequency threshold. -- Filter-failing variants and variants below a second user-defined minimal allele frequency threshold are ignored. -- Genomic positions of filter-passing variants with an allele frequency in between the two thresholds are hard-masked (with N) in the consensus sequence of their sample. -- Genomic positions with a coverage (calculated from the read alignments input) below another user-defined threshold are hard-masked, too, unless they are consensus variant sites. +3. **Variant frequency plot** -The workflow takes a collection of VCFs and a collection of the corresponding aligned reads (for the purpose of calculating genome-wide coverage) such as produced by the first workflow we ran. + ![Variant frequency plot](../../images/sars-cov-2-variant-discovery/variant-frequency.svg) -> Use the right collections as input! -> The variation analysis workflow should have generated *two* collections of annotated variants - one called `Final (SnpEff-) annotated variants`, the other one called `Final (SnpEff-) annotated variants with strand-bias soft filter applied`. -> -> If you have analyzed ampliconic data with any of the **variation analysis of ARTIC** data workflows, then please consider the strand-bias soft-filtered collection experimental and proceed with the `Final (SnpEff-) annotated variants` collection as input to the next workflow. -> -> If you are working with WGS data using either the **variation analysis on WGS PE data** or the **variation analysis on WGS SE data** workflow, then (and only then) you should continue with the `Final (SnpEff-) annotated variants with strand-bias soft filter applied` collection to eliminate some likely false-postive variant calls. -> -> As for the collection of aligned reads, you should choose the collection of BAM datasets that was used for variant calling (specifically for the first round of variant calling in the case of Illumina ampliconic data). This collection should be called `Fully processed reads for variant calling (primer-trimmed, realigned reads with added indelquals)`, if it was generated by any of the Illumina variation analysis workflows, or `BamLeftAlign on collection ...`, if it was produced by the ONT variation analysis workflow. -> -{: .warning} + This plot represents the allele-frequencies (AFs, cell color) for the different mutations (columns) in the different samples (rows). Mutations are grouped by viral ORFs (different colors on the 1st row). Information about their impact on translation of viral proteins is color-coded on the second row. Sample clustering is indicated by the tree displayed on the left. +If you are working with your own data, we encourage you to just explore these three outputs on your own. If you are following along with the suggested batch of example data, then we have some guidance and questions prepared for you: -> From AVs to consensus sequences +> Some results exploration and questions > -> 1. **Get the workflow** into Galaxy +> The variant frequency plot shows a large main cluster of samples (in the middle of the plot) flanked by two smaller ones. +> The small cluster at the bottom of the plot consists of only two samples, the cluster at the top of three samples. > -> - Option 1: Find workflows on the [WorkflowHub](https://workflowhub.eu) and run them directly on [usegalaxy.eu](https://usegalaxy.eu/) +> Questions > -> Please note that this option currently works *only* with usegalaxy.eu! +> 1. Do you recognize some of the sample identifiers in the two small clusters? > -> - Open the [workflow page on WokflowHub](https://workflowhub.eu/workflows/138) -> - Click on `Run on usegalaxy.eu` on the top right of the page -> -> The browser will open a new tab with Galaxy's workflow invocation interface. +> Use your browser's zoom function to magnify the plot enough for labels to become readable on your screen. > -> - Option 2: Import the workflow from [Dockstore](https://dockstore.org/) using Galaxy's workflow search +> 2. At least for **SRR17054505**, do you have an idea why its row in the plot might look like it does? > -> {% snippet faqs/galaxy/workflows_import_search.md trs_server="Dockstore" search_query='organization:"iwc-workflows"' workflow_name="sars-cov-2-consensus-from-variation/COVID-19-CONSENSUS-CONSTRUCTION" %} +> 3. What is unique about sample **SRR17051933**? > -> - Option 2: Import the workflow via its github repo link +> > +> > 1. The two samples forming the smallest cluster, **SRR17054502** and **SRR17054506**, and two of the samples in the three-samples cluster, **SRR17054505** and **SRR17054508**, are exactly the four samples we observed as having particularly low coverage in the **Preprocessing and mapping reports** before. +> > 2. Only a single mutation was identified for **SRR17054505** (i.e. it is *very* different from all other samples). +> > +> > (By zooming far enough into the plot, or by looking up the sample in the **Combined Variant Report by Sample** you can learn that this single variant is 27,807 C->T.) +> > +> > From the **Preprocessing and mapping reports** we had learnt before that **SRR17054505** is the sample with critically low coverage. With hardly any sequence information available, even the best variant calling software will probably not be able to identify many mutations for this sample. Chances are that at least for this sample, lots of mutations have simply been missed. +> > +> > Knowing that the other three samples have a bit higher coverage than **SRR17054505**, one might speculate that for them maybe only *some* mutations went unidentified, and that might explain why they end up outside the main cluster in the variant frequency plot. At the moment that is speculation, but we will revisit that idea later. +> > +> > 3. What is striking about **SRR17051933** are all the red cells it has in the variant frequency plot. +> > +> > Many of these overlap with black cells of other samples, but some of the darkest cells for **SRR17051933** do not. +> > +> > Cell color indicates the allele frequency observed for a mutation in a given sample, and pitch-black corresponds to an allele frequency of ~ 1, i.e. to mutations that were found on nearly every sequenced read that overlaps the site of the mutation. The lighter colors for almost all mutations found for **SRR17051933** mean that for this sample some reads at mutation sites confirm the mutation, but typically many do not. We can look at the mutations in this sample also in the **Combined Variant Report by Sample** and confirm that many of these mutations have really low values in the `AF` column. +> > +> > What about the mutations that seem to be unique to **SRR17051933** from the plot? +> > We can obtain a nice tabular report of them by filtering the **Combined Variant Report by Variant**. +> > +> > Run {% tool [Filter data on any column using simple expressions](Filter1) %} with +> > +> > - {% icon param-file %} *"Filter"*: **Combined Variant Report by Variant** +> > - *"With following conidition"*: `c15 == "SRR17051933"` +> > - *"Number of header lines to skip"*: `1` +> > +> > Column 15 of **Combined Variant Report by Variant** is `SAMPLES(all)`, i.e. lists all samples for which a given mutation in the report has been identified (at any allele frequency). The filtering above will retain those lines from the report that list **SRR17051933** as the *only* sample. +> > +> > When you inspect the output, you can see that it lists 17 **SRR17051933**-specific variants and you can read off the allele frequencies at which they were observed in the sample from column 13 (`max(AF)`). +> > It is relatively obvious then that most of these variants have a AF value between 0.68 and 0.8, i.e. they are confirmed by the majority of reads, but not all of them. +> > +> > We will come back to this observation and look for possible explanations near the end of the tutorial. +> > +> {: .solution} +{: .question} + +## From mutations per sample to consensus sequences + +We can now run a last workflow which generates reliable consensus sequences according to transparent criteria that capture at least some of the complexity of variant calling: + +- Each consensus sequence produced by the workflow is guaranteed to capture all called, filter-passing mutations as defined in the VCF of its sample that reach a user-defined consensus allele frequency threshold. +- Filter-failing mutations and mutations below a second user-defined minimal allele frequency threshold are ignored. +- Genomic positions of filter-passing mutations with an allele frequency in between the two thresholds are hard-masked (with N) in the consensus sequence of their sample. +- Genomic positions with a coverage (calculated from the aligned reads for the sample) below another user-defined threshold are hard-masked, too, unless they are sites of consensus alleles. + +> Import the consensus construction workflow into Galaxy > -> - Open the [workflow GitHub repository](https://github.com/iwc-workflows/sars-cov-2-consensus-from-variation) -> - Open the `.ga` file -> - Click on `Raw` on the top right of the file -> - Save the file or Copy the URL of the file -> - Import the workflow to Galaxy +> Just like all workflows before, also the *consensus construction workflow* developed by the Galaxy Covid-19 project can be retrieved from *Dockstore* or *WorkflowHub*: > -> {% snippet faqs/galaxy/workflows_import.md %} +> {% snippet faqs/galaxy/workflows_import_search.md search_query='organization:"iwc" name:"sars-cov-2"' workflow_name="sars-cov-2-consensus-from-variation/COVID-19-CONSENSUS-CONSTRUCTION" box_type="none" %} > -> 2. Run **COVID-19: consensus construction** {% icon workflow %} using the following parameters: +> Again, you can just select the latest version of the workflow, and, once imported, it should appear in your list of workflows under the name: **COVID-19: consensus construction**. > -> {% snippet faqs/galaxy/workflows_run.md %} +{: .hands-on} + +The workflow takes the collection of called variants (one VCF dataset per input sample, same collection as used as input for the *reporting* workflow) and a collection of the corresponding aligned reads (for the purpose of calculating genome-wide coverage). Both collections have been generated by the *variation analysis* workflow. + +> From mutations per sample to consensus sequences > -> - *"Send results to a new history"*: `No` -> - *"1: Variant calls"*: `Final (SnpEff-) annotated variants` +> 1. Run the **COVID-19: consensus construction** {% icon workflow %} workflow with these parameters: > -> The collection with variation data in VCF format: the output of the first workflow +>
> -> - *"2: min-AF for consensus variants*: `0.7` +> - *"Variant calls"*: `Final (SnpEff-) annotated variants` +>
+>
> -> Only variant calls with an AF greater than this value will be considered consensus variants. +> - *"Variant calls"*: `Final (SnpEff-) annotated variants` +>
+>
> -> - *"3: min-AF for failed variants"*: `0.25` +> - *"Variant calls"*: `Final (SnpEff-) annotated variants with strand-bias soft filter applied` +>
+>
> -> Variant calls with an AF higher than this value, but lower than the AF threshold for consensus variants will be considered questionable and the respective sites be masked (with Ns) in the consensus sequence. Variants with an AF below this threshold will be ignored. +> - *"Variant calls"*: `Final (SnpEff-) annotated variants with strand-bias soft filter applied` +>
> -> - *"4: aligned reads data for depth calculation"*: `Fully processed reads for variant calling` +> The collection with variation data in VCF format; output of the first workflow > -> Collection with fully processed BAMs generated by the first workflow. +> > Use the right collection of annotated variants! +> > The variation analysis workflow should have generated *two* collections of annotated variants - one called `Final (SnpEff-) annotated variants`, the other one called `Final (SnpEff-) annotated variants with strand-bias soft filter applied`. +> > +> >
+> > For tiled-amplicon data, please consider the strand-bias filter experimental and proceed with the `Final (SnpEff-) annotated variants` collection as input here. +> >
+> >
+> > For tiled-amplicon data, please consider the strand-bias filter experimental and proceed with the `Final (SnpEff-) annotated variants` collection as input here. +> >
+> >
+> > For WGS (i.e. non-ampliconic) data, use the `Final (SnpEff-) annotated variants with strand-bias soft filter applied` collection as input here to eliminate some likely false-positive variant calls. +> >
+> >
+> > For WGS (i.e. non-ampliconic) data, use the `Final (SnpEff-) annotated variants with strand-bias soft filter applied` collection as input here to eliminate some likely false-positive variant calls. +> >
+> > +> {: .comment} > -> For ARTIC data, the BAMs should NOT have undergone processing with **ivar removereads** +> - *"aligned reads data for depth calculation"*: `Fully processed reads for variant calling` > -> - *"5: Depth-threshold for masking"*: `5` +> Collection with fully processed BAMs generated by the first workflow. > -> Sites in the viral genome covered by less than this number of reads detection of variants is considered to become unreliable. Such sites will be masked (with Ns) in the consensus sequence unless there is a consensus variant call at the site. +> For tiled-amplicon data, the BAMs should NOT have undergone processing with **ivar removereads**, so please take care to select the right collection! > -> - *"6: Reference genome*": `NC_045512.2_reference.fasta` or `NC_045512.2 reference sequence` +> - *"Reference genome*": the `SARS-CoV-2 reference` sequence > -> SARS-CoV-2 reference genome, part of the auxillary data. +> The remaining workflow parameters: *"min-AF for consensus variant"*, *"min-AF for failed variants"*, and *"Depth-threshold for masking"* can all be left at their default values. > {: .hands_on} @@ -688,181 +786,239 @@ The main outputs of the workflow are: - A collection of viral consensus sequences. - A multisample FASTA of all these sequences. -The last one can be used as input for tools like **Pangolin** or **Nextclade**. +The first one is useful for quickly getting at the consensus sequence for one particular sample, the last one can be used as input for tools like **Pangolin** or **Nextclade**. -# From consensus sequences to clade/lineage assignments +### Exploring consensus sequence quality + +Unfortunately, not all consensus sequences are equal in terms of quality. As explained above, questionable mutations and low coverage can lead to N-masking of individual nucleotides or whole stretches of bases in any sample's consensus genome. + +Since these Ns are hard to discover by just scrolling through a consensus sequence fasta dataset, it is a good idea to have their positions reported explicitly. + +> Reporting masked positions in consensus sequences +> +> 1. {% tool [Fasta regular expression finder](toolshed.g2.bx.psu.edu/repos/mbernt/fasta_regex_finder/fasta_regex_finder/0.1.0) %} with the following parameters: +> - {% icon param-collection %} *"Input"*: `Consensus sequence with masking` collection produced by the last workflow run +> - *"Regular expression"*: `N+` +> +> This looks for stretches of one or more Ns. +> - In *"Specify advanced parameters"*: +> - *"Do not search the reverse complement"*: `Yes` +> +> We are only interested in the forward strand (and an N is an N on both strands anyway) so we can save some compute by turning on this option. +> - *"Maximum length of the match to report"*: `1` +> +> This causes stretches of Ns to be reported more concisely. +> +{: .hands_on} + +If you are following along with the suggested batch of input sequences, you can now correlate our previous observations with this newly produced collection. + +> Questions +> +> 1. We know from the quality report produced as part of the *variation analysis* workflow that samples **SRR17054505**, **SRR17054506**, **SRR17054508** and **SRR17054502** suffer from low-coverage sequencing data. +> +> In how far is this reflected in the generated consensus sequences for these samples? +> 2. The variant frequency plot (generated as part of the *reporting* workflow) showed all four of these samples as outliers outside of the main cluster of samples. +> +> Since the plot does not show coverage at sites of mutations, it cannot be determined from the plot alone whether a mutation missing from particular samples did not get reported because the sample does not harbor it, or simply because there was insufficient coverage for detecting it. The report of consensus sequence N stretches, however, can help here. +> +> Are the four samples really lacking all those mutations that are characteristic for the main cluster of samples, or may they have just gone undetected due to insufficient coverage? +> +> > +> > +> > 1. The N-masking reports for these four samples reveal many more and larger stretches of Ns for the four problematic samples than for others. +> > +> > **SRR17054505**, in particular, has a consensus sequence consisting almost entirely of Ns with just rather few resolved nucleotides in between them. +> > 2. Most mutations that look like they are missing from the four samples based on the plot alone, turn out to simply be undeterminable as they also correspond to Ns (instead of to the reference allele in the corresponding consensus sequences. +> > +> > The clusters of missing S gene mutations in **SRR17054502** and **SRR17054506**, visible in the plot between 22,578 and 22713, and between 23403 and 24130, for example, fall entirely into N-masked, i.e. low-coverage regions, of these samples. +> > This means there is no objective basis for thinking that these samples represent a different viral lineage than those of the main cluster of samples. +> > +> {: .solution} +{: .question} + +## From consensus sequences to lineage assignments To assign lineages to the different samples from their consensus sequences, two tools are available: **Pangolin** and **Nextclade**. -## With Pangolin +### Lineage assignment with Pangolin -Pangolin (Phylogenetic Assignment of Named Global Outbreak LINeages) can be used to assign a SARS-CoV-2 genome sequence the most likely lineage based on the PANGO nomenclature system. +Pangolin (Phylogenetic Assignment of Named Global Outbreak LINeages) can be used to assign a SARS-CoV-2 genome sequence to its most likely lineage from the PANGO nomenclature system. -> From consensus sequences to clade assignations using Pangolin +> From consensus sequences to clade assignments using Pangolin > -> 1. {% tool [Pangolin](toolshed.g2.bx.psu.edu/repos/iuc/pangolin/pangolin/3.1.4+galaxy0) %} with the following parameters: +> 1. {% tool [Pangolin](toolshed.g2.bx.psu.edu/repos/iuc/pangolin/pangolin/4.2+galaxy0) %} with the following parameters: > - {% icon param-file %} *"Input FASTA File(s)"*: `Multisample consensus FASTA` +> - *"Include header line in output file"*: `Yes` > > 2. Inspect the generated output {: .hands_on} -Pangolin generates a table file with taxon name and lineage assigned. Each line corresponds to each sample in the input consensus FASTA file provided. The columns are: - -Column | Field | Meaning ---- | --- | --- -1 | `taxon` | The name of an input query sequence, here the sample name. -2 | `lineage` | The most likely lineage assigned to a given sequence based on the inference engine used and the SARS-CoV-2 diversity designated. This assignment may be is sensitive to missing data at key sites. [Lineage Description List](https://cov-lineages.org/lineage_description_list.html) -3 | `conflict` | In the pangoLEARN decision tree model, a given sequence gets assigned to the most likely category based on known diversity. If a sequence can fit into more than one category, the conflict score will be greater than 0 and reflect the number of categories the sequence could fit into. If the conflict score is 0, this means that within the current decision tree there is only one category that the sequence could be assigned to. -4 | `ambiguity_score` | This score is a function of the quantity of missing data in a sequence. It represents the proportion of relevant sites in a sequence which were imputed to the reference values. A score of 1 indicates that no sites were imputed, while a score of 0 indicates that more sites were imputed than were not imputed. This score only includes sites which are used by the decision tree to classify a sequence. -5 | `scorpio_call` | If a query is assigned a constellation by scorpio this call is output in this column. The full set of constellations searched by default can be found at the constellations repository. -6 | `scorpio_support` | The support score is the proportion of defining variants which have the alternative allele in the sequence. -7 | `scorpio_conflict` | The conflict score is the proportion of defining variants which have the reference allele in the sequence. Ambiguous/other non-ref/alt bases at each of the variant positions contribute only to the denominators of these scores. -8 | `version` | A version number that represents both the pango-designation number and the inference engine used to assign the lineage. -9 | `pangolin_version` | The version of pangolin software running. -10 | `pangoLEARN_version` | The dated version of the pangoLEARN model installed. -11 | `pango_version` | The version of pango-designation lineages that this assignment is based on. -12 | `status` | Indicates whether the sequence passed the QC thresholds for minimum length and maximum N content. -13 | `note` | If any conflicts from the decision tree, this field will output the alternative assignments. If the sequence failed QC this field will describe why. If the sequence met the SNP thresholds for scorpio to call a constellation, it’ll describe the exact SNP counts of Alt, Ref and Amb (Alternative, Reference and Ambiguous) alleles for that call. +Pangolin generates tabular output, in which each line corresponds to one sample found in the input consensus FASTA file. The output columns are explained in the Galaxy tool's help section and in [this chapter of the pangolin documentation](https://cov-lineages.org/resources/pangolin/output.html). + +With a larger number of samples we might actually want to create an observed lineage summary report. Lets see if you can generate such a report: > +> +> Can you configure the tool {% tool [Datamash](toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0) %} to produce a three-column lineage summary from the pangolin output, in which each line lists: +> - one of the assigned lineages (from column 2 of the pangolin output) on the first column, +> - how many samples had that lineage assigned on the second column, and +> - comma-separated identifiers of these samples (from column 1 of the pangolin output) on the third column? +> +> > +> > +> > Configure {% tool [Datamash](toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0) %} like this: +> > - {% icon param-file %} *"Input tabular dataset"*: the output of pangolin +> > - *"Group by fields"*: 2 +> > +> > We want to group by pangolin lineages. +> > - *"Sort input"*: `Yes` +> > +> > Grouping only works as expected if the data is sorted by the values to group on, which isn't the case in the original pangolin output. +> > - *"Input file has a header line"*: `Yes` +> > - *"Print header line"*: `Yes` +> > - In *"Operation to perform on each group"* +> > - In {% icon param-repeat %} *"1. Operation to perform on each group"* +> > - *"Type"*: `count` +> > - *"On column"*: `Column: 1` +> > - {% icon param-repeat %} *"Insert Operation to perform on each group"* +> > - *"Type"*: `Combine all values` +> > - *"On column"*: `Column: 1` +> {: .solution} +{: .question} + +If you are working with the suggested batch of data, here are some specific follow-up questions for you: + +> Questions > -> How many different lineages have been found? How many samples for each lineage? +> 1. Pangolin assigned most samples to one particular SARS-CoV-2 lineage and sub-lineages thereof. +> What is that lineage? +> +> 2. Which three samples did not get assigned to that lineage? +> +> 3. Can you explain the results for these three samples? > > > > > -> > To summarize the number of lineages and number of samples for each lineage, we can run {% tool [Group data](Grouping1) %} with the following parameters: -> > - {% icon param-file %} *"Select data"*: output of **pangolin** -> > - *"Group by column"*: `Column: 2` -> > - In *"Operation"*: -> > - In *"1: Operation"*: -> > - *"Type"*: `Count` -> > - *"On column"*: `Column: 1` -> > -> > For our example datasets, we obtain then: -> > - 13 samples B.1.1.7 / Alpha (B.1.1.7-like) -> > - 3 samples B.1.617.2 / Delta (B.1.617.2-like) -> > - 1 sample B.1.525 -> > - 1 sample P.1 +> > 1. The lineage summary we have generated makes this easy to answer: +> > +> > All samples except three got classified as BA.1 or sublineages thereof. BA.1 is the Omicron lineage that was dominant in South Africa when that new variant of concern was first discovered. +> > 2. Again, this is straightforward to answer from the lineage summary: +> > +> > The three samples are **SRR17051933** (which got assigned to lineage B.1), and **SRR17054505** and **SRR17054508** (both of which pangolin left unassigned). +> > 3. We have encountered **SRR17054505** and **SRR17054508** previously as samples with low sequencing coverage. +> > +> > When we ran pangolin, we used the default value for *"Maximum proportion of Ns allowed"* of `0.3`. +> > If you look up the two samples in the original output of pangolin and scroll to the right, you should see that they have a `fail` in the **qc_status** column and the **qc_notes** column explains that they had fractions of *Ambiguous content* (i.e. Ns) of 0.95 and 0.66, respectively. +> > In other words, pangolin refused (and rightly so) to perform lineage assignment for these two samples based on the very limited sequencing information available. +> > +> > **SRR17051933**, on the other hand, has passed pangolin's quality control with very low *Ambiguous content* so we might tend to believe the assignment. +> > However, if you remember earlier questions or re-inspect the variant-frequency plot generated by the reporting workflow, you will find that this sample was the outlier sample that had more or less the same mutations called as the other samples in the batch, but nearly all of them at rather low allele-frequencies. +> > In fact, the only seemingly fixed mutations (with allele-frequencies close to one) are ancestral mutations that have been present in SARS-CoV-2 isolates since the Spring of 2020. +> > Since only these mutations made it into the consensus sequence for that sample, pangolin has based the assignment only on them and has infered the ancestral lineage B.1 as a result. +> > +> > Whether or not this is the correct assignment for this sample is hard to tell without additional information. +> > Several possibilities exist: +> > - the sample could indeed be from a pre-Omicron lineage and might have been lab-contaminated with DNA from one of the other (Omicron) samples processed together with it. +> > - the sample might have been taken from a patient who was infected with Omicron and a pre-Omicron lineage simultaneously. +> > +> > Remember the questions about the outputs of the *reporting* workflow. As part of one of them, you were suggested to generate a report for mutations seen *only* in **SRR17051933**. It turned out that these were mostly observed at an allele frequency of ~ 0.75. This makes the first possibility seem somewhat more plausible: a pre-Omicron sample has been contaminated with traces of Omicron sequence, but most reads still come from the original strain. For the co-infection scenario, on the other hand, a lot would depend on the relative timing of the two infections. > > > {: .solution} {: .question} -## With Nextclade +### Lineage assignment with Nextclade Nextclade assigns clades, calls mutations and performs sequence quality checks on SARS-CoV-2 genomes. -> From consensus sequences to clade assignations using Nextclade +> From consensus sequences to clade assignments using Nextclade > -> 1. {% tool [Nextclade](toolshed.g2.bx.psu.edu/repos/iuc/nextclade/nextclade/0.14.4+galaxy0) %} with the following parameters: -> - {% icon param-file %} *"SARS-CoV-2 consensus sequences (FASTA)"*: `Multisample consensus FASTA` +> 1. {% tool [Nextclade](toolshed.g2.bx.psu.edu/repos/iuc/nextclade/nextclade/2.7.0+galaxy0) %} with the following parameters: +> - {% icon param-file %} *"FASTA file with input sequences"*: `Multisample consensus FASTA` +> - *"Version of database to use"*: `Download latest available database version from web` > - {% icon param-check %} *"Output options"*: `Tabular format report` +> - *"Include header line in output file"*: `Yes` > > 2. Inspect the generated output {: .hands_on} -Column | Field | Meaning ---- | --- | --- -1 | `seqName` | Name of the sequence in the source data, here the sample name -2 | `clade` | The result of the clade assignment of a sequence, as defined by Nextstrain. Currently known clades are depicted in the schema below -3 | `qc.overallScore` | Overall QC score -4 | `qc.overallStatus` | Overall QC status -5 | `totalGaps` | Number of - characters (gaps) -6 | `totalInsertions` | Total length of insertions -7 | `totalMissing` | Number of N characters (missing data) -8 | `totalMutations` | Number of mutations. Mutations are called relative to the reference sequence Wuhan-Hu-1 -9 | `totalNonACGTNs` | Number of non-ACGTN characters -10 | `totalPcrPrimerChanges` | Total number of mutations affecting user-specified PCR primer binding sites -11 | `substitutions` | List of mutations -12 | `deletions` | List of deletions (positions are 1-based) -13 | `insertions` | Insertions relative to the reference Wuhan-Hu-1 (positions are 1-based) -14 | `missing` | Intervals consisting of N characters -15 | `nonACGTNs` | List of positions of non-ACGTN characters (for example ambiguous nucleotide codes) -16 | `pcrPrimerChanges` | Number of user-specified PCR primer binding sites affected by mutations -17 | `aaSubstitutions` | List of aminoacid changes -18 | `totalAminoacidSubstitutions` | Number of aminoacid changes -19 | `aaDeletions` | List of aminoacid deletions -20 | `totalAminoacidDeletions` | Number of aminoacid deletions -21 | `alignmentEnd` | Position of end of alignment -22 | `alignmentScore` | Alignment score -23 | `alignmentStart` | Position of beginning of alignment -24 | `qc.missingData.missingDataThreshold` | Threshold for flagging sequences based on number of sites with Ns -25 | `qc.missingData.score` | Score for missing data -26 | `qc.missingData.status` | Status on missing data -27 | `qc.missingData.totalMissing` | Number of sites with Ns -28 | `qc.mixedSites.mixedSitesThreshold` | Threshold for flagging sequences based on number of mutations relative to the reference sequence -29 | `qc.mixedSites.score` | Score for high divergence -30 | `qc.mixedSites.status` | Status for high divergence -31 | `qc.mixedSites.totalMixedSites` | Number of sites with mutations -32 | `qc.privateMutations.cutoff` | Threshold for the number of non-ACGTN characters for flagging sequences -33 | `qc.privateMutations.excess` | Number of ambiguous nucleotides above the threshold -34 | `qc.privateMutations.score` | Score for ambiguous nucleotides -35 | `qc.privateMutations.status` | Status for ambiguous nucleotides -36 | `qc.privateMutations.total` | Number of ambiguous nucleotides -37 | `qc.snpClusters.clusteredSNPs` | Clusters with 6 or more differences in 100 bases -38 | `qc.snpClusters.score` | Score for clustered differences -39 | `qc.snpClusters.status` | Status for clustered differences -40 | `qc.snpClusters.totalSNPs` | Number of differences in clusters -41 | `errors` | Other errors (e.g. sequences in which some of the major genes fail to translate because of frame shifting insertions or deletions) +Nextclade assigns lineages using an algorithm that differs from the one used by pangolin, and it uses a different "native" nomenclature system. A nice feature of Nextclade, however, is that it can translate between its own clade system, PANGO lineage names and the more coarse-grained WHO classification system. + +The details of and the rationale behind Nextclade "clade" assignment are explained in this [chapter of the Nextclade documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/algorithm/06-clade-assignment.html#clade-assignment). + +The relationships between the clades (the ones known about in the Spring of 2023) and their correspondence to pangolin lineages are shown in Figure 2. + +![Illustration of phylogenetic relationship of clades, as used in Nextclade](../../images/sars-cov-2-variant-discovery/ncov_clades.svg "Illustration of phylogenetic relationship of clades, as used in Nextclade (Source: Nextstrain)") + +Let's use **Datamash** again to obtain a lineage summary report from the Nextclade results comparable to the one we created for pangolin, but including the additional lineage/clade identifiers available with Nextclade. > -> -> How many different lineages have been found? How many samples for each lineage? > -> ![Illustration of phylogenetic relationship of clades, as used in Nextclade](../../images/sars-cov-2-variant-discovery/ncov_clades.png "Illustration of phylogenetic relationship of clades, as used in Nextclade (Source: Nextclade)") +> Can you configure {% tool [Datamash](toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0) %} to produce a report summarizing the Nextclade results by `Nextclade_pango` (column 3 of Nextclade's output) for direct comparison to the pangolin results summary? +> +> This report should include the corresponding info from the `clade` column (column 2, which is Nextclade's native classification system) and from the `clade_who` (column 6), together with the sample counts and identifiers as previously calculated from pangolin's output. > > > > > -> > To summarize the number of lineages and number of samples for each lineage, we can run {% tool [Group data](Grouping1) %} with the following parameters: -> > - {% icon param-file %} *"Select data"*: output of **Nextclade** -> > - *"Group by column"*: `Column: 2` -> > - In *"Operation"*: -> > - In *"1: Operation"*: -> > - *"Type"*: `Count` -> > - *"On column"*: `Column: 1` -> > -> > For our example datasets, we obtain then: -> > - 10 samples 20I (Alpha, V1) -> > - 4 samples 20B (ancestor of 20I) -> > - 3 samples 21A (Delta) -> > - 1 sample 21D (Eta) +> > Configure {% tool [Datamash](toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0) %} like this: +> > - {% icon param-file %} *"Input tabular dataset"*: the output of Nextclade +> > - *"Group by fields"*: 3,2,6 > > +> > We want to group by pangolin lineages. By including columns 2 and 6 we make sure the columns from these values are kept in the summary report (and that additional groups would be formed if any samples with identical assigned pangolin lineage have different Nextclade or WHO assignments, which, of course, shouldn't be the case). +> > - *"Sort input"*: `Yes` +> > - *"Input file has a header line"*: `Yes` +> > - *"Print header line"*: `Yes` +> > - In *"Operation to perform on each group"* +> > - In {% icon param-repeat %} *"1. Operation to perform on each group"* +> > - *"Type"*: `count` +> > - *"On column"*: `Column: 1` +> > - {% icon param-repeat %} *"Insert Operation to perform on each group"* +> > - *"Type"*: `Combine all values` +> > - *"On column"*: `Column: 1` > {: .solution} {: .question} -## Comparison between Pangolin and Nextclade clade assignments +Now you can compare the two summary reports (the one based on pangolin and the one based on Nextclade). -We can compare **Pangolin** and **Nextclade** clade assignments by extracting interesting columns and joining them into a single dataset using sample ids. +Tip: Galaxy's {% icon galaxy-scratchbook %} Window Manager, which you can enable (and disable again) from the menu bar can be very helpful for side-by-side comparisons like this one. -> Comparison clade assignations -> -> 1. {% tool [Cut columns from a table](Cut1) %} with the following parameters: -> - *"Cut columns"*: `c1,c2` -> - *"Delimited by"*: `Tab` -> - {% icon param-file %} *"From"*: output of **Nextclade** -> -> 2. {% tool [Cut columns from a table](Cut1) %} with the following parameters: -> - *"Cut columns"*: `c1,c2,c5` -> - *"Delimited by"*: `Tab` -> - {% icon param-file %} *"From"*: output of **Pangolin** +You should hopefully observe good (though not necessarily perfect) agreement between the reports. +If you are following along with the suggested batch of data, here's a question about the detailed differences. + +> > -> 3. {% tool [Join two Datasets](join1) %} -> - {% icon param-file %} *"Join*": output of first **cut** -> - *"using column"*: `Column: 1` -> - {% icon param-file %} *"with*": output of second **cut** -> - *"and column"*: `Column: 1` +> Which samples have been assigned differently by pangolin and Nextclade, and why? > -> 4. Inspect the generated output -{: .hands_on} +> > +> > +> > **SRR17054508**, one of the two samples left unassigned by pangolin is assigned to BA.1 by Nextclade. +> > +> > **SRR17054505**, the second sample left unassigned by pangolin is assigned to the recombinant lineage XAA by Nextclade. +> > Discovering a recombinant lineage (with two different Omicron parents) this early during the emergence of Omicron would, of course, be a spectacular find, if it was real. However, we already know from the discussion of coverage and pangolin results for this sample that this assignment cannot make sense. +> > +> > Nextclade simply doesn't have pangolin's concept of leaving samples unassigned, and this is why it produced (highly unreliable) assignments for both **SRR17054508** and **SRR17054505**. +> > However, if you inspect the original output of Nextclade, you'll see that the tool classified both samples as `bad` in terms of their **qc.overallStatus**. +> > We could, thus, have used this column to filter out unreliable assignments to avoid the risk of overinterpreting the data. +> > +> {: .solution} +{: .question} -We can see that **Pangolin** and **Nextclade** are globally coherent despite differences in lineage nomenclature. -# Conclusion +# Conclusion and outlook +## What we have covered -In this tutorial, we used a collection of Galaxy workflows for the detection and interpretation of sequence variants in SARS-CoV-2: +In this tutorial, we used a collection of Galaxy workflows for the detection and interpretation of SARS-CoV-2 sequence variants. -![Analysis flow in the tutorial](../../images/sars-cov-2-variant-discovery/schema.png "Analysis flow in the tutorial") +The workflows can be freely used and are immediately accessible through global Galaxy instances. + +Combined, they enable rapid, reproducible, high-quality and flexible analysis of a range of different input data. + +The combined analysis flow is compatible with a high-throughput of samples, but still allows for detailed dives into effects seen in only particular samples, and enables qualified users to draw valid conclusions about individual samples and whole batches of data at the same time. + +## Further automation -The workflows can be freely used and immediately accessed from the three global Galaxy instances. Each is capable of supporting thousands of users running hundreds of thousands of analyses per month. +If at this point you are convinced of the quality of the analysis, but you are thinking that manually triggering those sequential workflow runs through the Galaxy user interface is still a lot of work when scaling to many batches of sequencing data, you may want to start learning about leveraging Galaxy's API to automate and orchestrate workflow executions. +The GTN material has [a dedicated tutorial]({% link topics/galaxy-interface/tutorials/workflow-automation/tutorial.md %}) that explains triggering workflow runs from the command line via the API. -It is also possible to automate the workflow runs using the command line as explained in [a dedicated tutorial]({% link topics/galaxy-interface/tutorials/workflow-automation/tutorial.md %}). +The workflows presented here and API-based orchestration scripts are also among the main building blocks used by the Galaxy Covid-19 project to build a truly high-throughput, automated and reusable SARS-CoV-2 genome surveillance system, which has been used to analyze several hundreds of thousands of public SARS-CoV-2 sequencing datasets over the course of the pandemic and about which you can learn more on the corresponding [Infectious Diseases Toolkit page](https://www.infectious-diseases-toolkit.org/showcase/covid19-galaxy).