Skip to content

Commit 96d109b

Browse files
authored
Merge branch 'main' into devel
2 parents 164d9d2 + fcb2dea commit 96d109b

File tree

2 files changed

+299
-1
lines changed

2 files changed

+299
-1
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,4 @@ Ali Sina Booeshaghi, Xi Chen, Lior Pachter, A machine-readable specification for
3636
- [View example `seqspec` files: `https://www.sina.bio/seqspec-builder/assays.html`](https://www.sina.bio/seqspec-builder/assays.html)
3737
- [Contribute a `seqspec` : `docs/CONTRIBUTING.md`](docs/CONTRIBUTING.md)
3838
- [Watch a YouTube video about `seqspec`](https://youtu.be/NSj6Vpzy8tU)
39-
- [Read the manuscript that describes `seqspec`](https://doi.org/10.1093/bioinformatics/btae168)
39+
- [Read the manuscript that describes `seqspec`](https://doi.org/10.1093/bioinformatics/btae168)

docs/TUTORIAL_FROM_TEMPLATE.md

+298
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
# Writing a seqspec file for a set of FASTQs
2+
A seqspec file describes the library and read structure of a set of genomics data. Understanding both structures is crucial for analyzing sequencing reads. Even if the library construction process is unknown, a seqspec file can still be generated for a set of FASTQ reads.
3+
## Example Dataset
4+
We will generate a seqspec file for the following dataset:
5+
6+
```json
7+
{
8+
"observation_id": "GSM3587010",
9+
"doi": "https://doi.org/10.1084/jem.20191130",
10+
"species": "homo sapiens",
11+
"organ": "colon",
12+
"name": "age 55 years old",
13+
"description": "epithelial cells",
14+
"technology": "10xv2",
15+
"links": [
16+
{
17+
"accession": "GSM3587010",
18+
"filename": "SRR8513796_1.fastq.gz",
19+
"filetype": null,
20+
"filesize": null,
21+
"md5": null,
22+
"urltype": "ftp",
23+
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_1.fastq.gz"
24+
},
25+
{
26+
"accession": "GSM3587010",
27+
"filename": "SRR8513796_2.fastq.gz",
28+
"filetype": null,
29+
"filesize": null,
30+
"md5": null,
31+
"urltype": "ftp",
32+
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_2.fastq.gz"
33+
}
34+
]
35+
}
36+
```
37+
38+
39+
40+
The technology used is 10xv2. The read structure for 10xv2 libraries is:
41+
42+
**R1.fastq.gz** (SRR8513796_1.fastq.gz): 16bp cell barcode followed by a 10bp UMI.
43+
**R2.fastq.gz** (SRR8513796_2.fastq.gz): cDNA, user-defined length.
44+
**Cell barcode onlist**: [Cell barcode list](https://github.com/pachterlab/qcbc/raw/main/tests/10xRNAv2/737K-august-2016.txt.gz).
45+
46+
The library was generated using the Illumina Hiseq X Ten platform with 150-bp paired-end reads.
47+
48+
## Download reads
49+
First, download the FASTQ files and check the lengths of the reads:
50+
```bash
51+
# download the two FASTQ files
52+
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_1.fastq.gz
53+
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_2.fastq.gz
54+
55+
# view the first read in R1
56+
$ zcat SRR8513796_1.fastq.gz | head -2
57+
@SRR8513796.1 1/1
58+
NGATTTCGTTGCGTTAATCATCGTCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAGAACCACATTATTATTTTGATAGATATAGAAATAGTCAAACCTAATCTACAAAAGTGCAGTATCATGCGGGGNCTTCGCAGCGTAGGTGTT
59+
60+
# count the number of bases in the R1 read
61+
$ echo -n "NGATTTCGTTGCGTTAATCATCGTCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAGAACCACATTATTATTTTGATAGATATAGAAATAGTCAAACCTAATCTACAAAAGTGCAGTATCATGCGGGGNCTTCGCAGCGTAGGTGTT" | wc -c
62+
150
63+
64+
# view the first read in R2
65+
$ zcat SRR8513796_1.fastq.gz | head -2
66+
@SRR8513796.1 1/2
67+
NGAGTCTCCCTTCACCATTTCCGACGGCATCTATGGCTCAACATTTTTTGTAGCCACAGGCTTCCGCGGACTTCACGTCATTATTGGCTCAACTTTCGTCACTATCTGCTTTATCAGCCAACTAATATTTCACTTTACATACAAACATCA
68+
69+
# count the number of bases in the R2 read
70+
$ echo -n "NGAGTCTCCCTTCACCATTTCCGACGGCATCTATGGCTCAACATTTTTTGTAGCCACAGGCTTCCGCGGACTTCACGTCATTATTGGCTCAACTTTCGTCACTATCTGCTTTATCAGCCAACTAATATTTCACTTTACATACAAACATCA" | wc -c
71+
150
72+
```
73+
74+
## Download template spec
75+
Start with a template seqspec. The 10xv2 template can be found [here](https://github.com/pachterlab/seqspec/blob/main/examples/specs/template/10xv2-template.yml). Save this file as spec.yaml.
76+
77+
```yaml
78+
!Assay
79+
seqspec_version: 0.2.0
80+
assay_id: 10xv2
81+
name: 10xv2
82+
doi: https://doi.org/10.1126/science.aam8999
83+
date: 15 March 2018
84+
description: 10x Genomics v2 single-cell rnaseq
85+
modalities:
86+
- rna
87+
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html
88+
sequence_protocol: Not-specified
89+
sequence_kit: Not-specified
90+
library_protocol: 10xv2 RNA
91+
library_kit: Not-specified
92+
sequence_spec:
93+
- !Read
94+
read_id: R1.fastq.gz
95+
name: Read 1
96+
modality: rna
97+
primer_id: r1_primer
98+
min_len: 26
99+
max_len: 26
100+
strand: pos
101+
- !Read
102+
read_id: R2.fastq.gz
103+
name: Read 2
104+
modality: rna
105+
primer_id: r2_primer
106+
min_len: 150
107+
max_len: 150
108+
strand: neg
109+
library_spec:
110+
- !Region
111+
parent_id: null
112+
region_id: rna
113+
region_type: null
114+
name: null
115+
sequence_type: null
116+
sequence: AAAAAAAAAAAAAAAANNNNNNNNNNNNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAA
117+
min_len: 59
118+
max_len: 208
119+
onlist: null
120+
regions:
121+
- !Region
122+
parent_id: rna
123+
region_id: r1_primer
124+
region_type: r1_primer
125+
name: r1_primer
126+
sequence_type: fixed
127+
sequence: AAAAAAAAAAAAAAAA
128+
min_len: 16
129+
max_len: 16
130+
onlist: null
131+
regions: null
132+
- !Region
133+
parent_id: rna
134+
region_id: barcode
135+
region_type: barcode
136+
name: barcode
137+
sequence_type: onlist
138+
sequence: NNNNNNNNNNNNNNNN
139+
min_len: 16
140+
max_len: 16
141+
onlist: !Onlist
142+
location: remote
143+
filename: https://github.com/pachterlab/qcbc/raw/main/tests/10xRNAv2/737K-august-2016.txt.gz
144+
md5: 72aa64fd865bcda142c47d0da8370168
145+
regions: null
146+
- !Region
147+
parent_id: rna
148+
region_id: umi
149+
region_type: umi
150+
name: umi
151+
sequence_type: fixed
152+
sequence: NNNNNNNNNN
153+
min_len: 10
154+
max_len: 10
155+
onlist: null
156+
regions: null
157+
- !Region
158+
parent_id: rna
159+
region_id: cdna
160+
region_type: cdna
161+
name: cdna
162+
sequence_type: fixed
163+
sequence: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
164+
min_len: 1
165+
max_len: 150
166+
onlist: null
167+
regions: null
168+
- !Region
169+
parent_id: rna
170+
region_id: r2_primer
171+
region_type: r2_primer
172+
name: r2_primer
173+
sequence_type: fixed
174+
sequence: AAAAAAAAAAAAAAAA
175+
min_len: 16
176+
max_len: 16
177+
onlist: null
178+
regions: null
179+
```
180+
181+
## Modify `sequence_spec`
182+
A seqspec file requires the user to specify the primer used to generate each of the sequencing reads. Since we may not know the exact primer used by the authors, we specify two generic primers (r1_primer generates R1.fastq.gz and r2_primer generates R2.fastq.gz). Then we fill in the relevant information for the `sequence_spec` using the reads and information we can obtain from the fastq files. Note that sequencing this library on a Hiseq generates R1 reads on the positive strand and R2 reads on the negative strand.
183+
184+
```yaml
185+
sequence_spec:
186+
- !Read
187+
read_id: SRR8513796_1.fastq.gz
188+
name: Read 1
189+
modality: rna
190+
primer_id: r1_primer
191+
min_len: 150
192+
max_len: 150
193+
strand: pos
194+
- !Read
195+
read_id: SRR8513796_1.fastq.gz
196+
name: Read 2
197+
modality: rna
198+
primer_id: r2_primer
199+
min_len: 150
200+
max_len: 150
201+
strand: neg
202+
```
203+
204+
## Modify `library_spec`
205+
Ensure that the `library_spec` matches the 10xv2 structure:
206+
- r1 primer
207+
- barcode (with barcode onlist)
208+
- umi
209+
- cdna
210+
- r2 primer
211+
```yaml
212+
- !Region
213+
parent_id: rna
214+
region_id: r1_primer
215+
region_type: r1_primer
216+
name: r1_primer
217+
sequence_type: fixed
218+
sequence: AAAAAAAAAAAAAAAA
219+
min_len: 16
220+
max_len: 16
221+
onlist: null
222+
regions: null
223+
- !Region
224+
parent_id: rna
225+
region_id: barcode
226+
region_type: barcode
227+
name: barcode
228+
sequence_type: onlist
229+
sequence: NNNNNNNNNNNNNNNN
230+
min_len: 16
231+
max_len: 16
232+
onlist: !Onlist
233+
location: remote
234+
filename: https://github.com/pachterlab/qcbc/raw/main/tests/10xRNAv2/737K-august-2016.txt.gz
235+
md5: 72aa64fd865bcda142c47d0da8370168
236+
regions: null
237+
- !Region
238+
parent_id: rna
239+
region_id: umi
240+
region_type: umi
241+
name: umi
242+
sequence_type: fixed
243+
sequence: NNNNNNNNNN
244+
min_len: 10
245+
max_len: 10
246+
onlist: null
247+
regions: null
248+
- !Region
249+
parent_id: rna
250+
region_id: cdna
251+
region_type: cdna
252+
name: cdna
253+
sequence_type: fixed
254+
sequence: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
255+
min_len: 1
256+
max_len: 150
257+
onlist: null
258+
regions: null
259+
- !Region
260+
parent_id: rna
261+
region_id: r2_primer
262+
region_type: r2_primer
263+
name: r2_primer
264+
sequence_type: fixed
265+
sequence: AAAAAAAAAAAAAAAA
266+
min_len: 16
267+
max_len: 16
268+
onlist: null
269+
regions: null
270+
```
271+
272+
## Modify the metadata
273+
Lastly, we update the metadata for the seqspec file
274+
```yaml
275+
!Assay
276+
seqspec_version: 0.2.0
277+
assay_id: 10xv2
278+
name: 10xv2
279+
doi: https://doi.org/10.1084/jem.20191130
280+
date: 21 November 2019
281+
description: 10x Genomics v2 single-cell rnaseq
282+
modalities:
283+
- rna
284+
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html
285+
sequence_protocol: Not-specified
286+
sequence_kit: Not-specified
287+
library_protocol: 10xv2 RNA
288+
library_kit: Not-specified
289+
```
290+
291+
## Format, check, and print the spec
292+
Use the seqspec command line utility to format, check, and view the seqspec file:
293+
294+
```bash
295+
$ seqspec format -o spec.yaml spec.yaml
296+
$ seqspec check spec.yaml
297+
$ seqspec print spec.yaml
298+
```

0 commit comments

Comments
 (0)