I just started using the SRA toolkit but I am quite puzzled about what I have. I downloaded an SRA file with fasterq-dump. The data is indicated to be paired-end and I do get two fastq files. However, the headers of each read are not in the standard format but rather in SRA format, I suppose.
The first 2 headers of the first fastq file are shown here
@ERX2240357.19 SBS123:200:C3PFWACXX:6:1101:2133:1988 length=101
@ERX2240357.20 SBS123:200:C3PFWACXX:6:1101:2347:1955 length=101
And the first 2 headers of the second fastq file are shown here
@ERX2240357.1 SBS123:200:C3PFWACXX:6:1101:1462:1956 length=101
@ERX2240357.2 SBS123:200:C3PFWACXX:6:1101:1487:1970 length=101
Normally I can see the pairs by looking at /1 and /2 in the header, but now this is missing. How can tools like bwa mem recognize that two reads form a pair based on the headers in the above SRA format?
I looked up the documentation but it doesn't get more clear for me...
EBI-ENA has the fastq files clearly marked as R1 and R2. You can get the data from there.
to my understand, the latest illumina output fastq file contains
two parts
in the name line, separated by a "white space",the name
+index information
. the name part are identical for read1 and read2. Only the first part will be used in further analysis, that refer to this read.The following are what I get from HiSeq XTen platform.
Last, your "first 2 headers" are not like the real output:
The name of reads in read1 and read2 must in the same order, or else, the aligner will report errors and stop working.