Question

Interpretation of SRA files

0

Entering edit mode

4.7 years ago

dominique.massau ▴ 10

I just started using the SRA toolkit but I am quite puzzled about what I have. I downloaded an SRA file with fasterq-dump. The data is indicated to be paired-end and I do get two fastq files. However, the headers of each read are not in the standard format but rather in SRA format, I suppose.

The first 2 headers of the first fastq file are shown here

@ERX2240357.19 SBS123:200:C3PFWACXX:6:1101:2133:1988 length=101

@ERX2240357.20 SBS123:200:C3PFWACXX:6:1101:2347:1955 length=101

And the first 2 headers of the second fastq file are shown here

@ERX2240357.1 SBS123:200:C3PFWACXX:6:1101:1462:1956 length=101

@ERX2240357.2 SBS123:200:C3PFWACXX:6:1101:1487:1970 length=101

Normally I can see the pairs by looking at /1 and /2 in the header, but now this is missing. How can tools like bwa mem recognize that two reads form a pair based on the headers in the above SRA format?

I looked up the documentation but it doesn't get more clear for me...

sra fastq headers paired-end reads • 2.5k views

ADD COMMENT • link updated 4.7 years ago by wm ▴ 570 • written 4.7 years ago by dominique.massau ▴ 10

1

Entering edit mode

EBI-ENA has the fastq files clearly marked as R1 and R2. You can get the data from there.

ADD REPLY • link 4.7 years ago by GenoMax 147k

1

Entering edit mode

it is the updated version of illumina software.

to my understand, the latest illumina output fastq file contains two parts in the name line, separated by a "white space", the name + index information. the name part are identical for read1 and read2. Only the first part will be used in further analysis, that refer to this read.

The following are what I get from HiSeq XTen platform.

$ zcat demo_1.fq.gz | head -n 1
@ST-E00144:1057:H5L7WCCX2:8:1101:5690:1467 1:N:0:NTAGGCAT
$ zcat demo_2.fq.gz | head -n 1
@ST-E00144:1057:H5L7WCCX2:8:1101:5690:1467 2:N:0:NTAGGCAT

To my experience, these paired end reads are compatible with aligners like bwa, bowtie.

Last, your "first 2 headers" are not like the real output:

The name of reads in read1 and read2 must in the same order, or else, the aligner will report errors and stop working.

ADD REPLY • link 4.7 years ago by wm ▴ 570

score 2 · Accepted Answer · 2020-04-04

Reads are compatible for aligners like BWA, bowtie2, etc (I tested), even if the /1, /2 suffix not exists.

You need to be caution, make sure the first part of name in read1 and read2 are identical, and in the same order.

What you paste in the post does not like the first two headers, the order is not correct.

I checked the fastq files from NCBI-SRA and EBI-ENA for the first two read name in read1 and read2

EBI-ENA version

As @genomax pointed out in EBI-ENA, ERR2184190_1.fastq.gz, ERR2184190_2.fastq.gz

found /1 and /2 suffix in tail of read name

==> read1.fq <==
@ERR2184190.1 SBS123:200:C3PFWACXX:6:1101:1462:1956/1
@ERR2184190.2 SBS123:200:C3PFWACXX:6:1101:1487:1970/1

==> read2.fq <==
@ERR2184190.1 SBS123:200:C3PFWACXX:6:1101:1462:1956/2
@ERR2184190.2 SBS123:200:C3PFWACXX:6:1101:1487:1970/2

SRA-toolkit download version

not found /1, /2 suffix

$ prefetch ERR2184190
$ fasterq-dump --threads 8 --split-3 ERR2184190.sra

==> read1.fq <==
@ERR2184190.sra.1 SBS123:200:C3PFWACXX:6:1101:1462:1956
@ERR2184190.sra.2 SBS123:200:C3PFWACXX:6:1101:1487:1970

==> read2.fq <==
@ERR2184190.sra.1 SBS123:200:C3PFWACXX:6:1101:1462:1956
@ERR2184190.sra.2 SBS123:200:C3PFWACXX:6:1101:1487:1970

Read name in alignment file

You may notice that, both read names are separated by a white space. for general purpose, only the first part of read name (eg: ERR2184190.sra.1) are saved in alignment file (bam file).

Here are example for your reads. the first line is read1, and the second line is read2. the read name are identical (1-column), and they are separated by the FLAG field, (2-column).

# subset 100 reads from the file
$ bwa mem Oaureus.fa read1.fq read2.fq | samtools view -Sub -f 2 - | samtools sort -o aln.bam - 
$ samtools view aln.bam | head -n 2

ERR2184190.sra.37   99      VASH01007726.1  2202973 60      101M    =       2203373 501 ...
ERR2184190.sra.37   147     VASH01007726.1  2203373 60      101M    =       2202973 -501 ...