Hello Biostars!
In my project, I have to convert several SRA files to fastq files. These SRA files are paired end. I read a previous post about how to use fastq-dump to do so. However, I am still confused about the split step.
For example, after I ran fastq-dump ERR011087.sra
, I got ERR011087.fastq which contains paired end reads with the length of 88. The first read looks like
@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGTATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(
After I ran fastq-dump --split-files ERR011087.sra
, I did get 2 fastq files, ERR011087_1.fastq and ERR011087_2.fastq. The first read of ERR011087_1.fastq is
@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGT
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+
The first read of ERR011087_2.fastq is
@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
ATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(
It seems that fastq-dump --split-files
just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44. Is just spliting the first half and the last half of a read equal to spliting a paired end read into two fragments?
If so, it is very strange to find that the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq is different. I ran grep "@ERR" ERR011087.fastq |wc -l
and got 11640976
, ran grep "@ERR" ERR011087_1.fastq |wc -l
and got 11640674
, ran grep "@ERR" ERR011087_2.fastq |wc -l
and got 11640358
. I think these numbers represent the amount of reads in each file. However, three numbers are NOT the same. I felt very confused because if fastq-dump --split-files
just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44, then the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq should be equal. There must be something wrong with it.
Could anyone explain that?
Thanks.
Hello everyone, I'm replying to this old topic because I'm unfortunately still having a similar issue. I already tried everything I found on this and other similar topics but nothing works for me. I'm trying to download, using fasterq-dump, some SRR* .fastq that are all supposed to be frw and rev (...fastq_1 and ....fastq_2) as they come from illumina sequencing technology. Anyway, some of the files are good, already splitted. Others figure as single fastq files with all reads together (I can see from the lenght that both frw and rev reads are collapsed in one fastq file); I think there was somethig wrong with the NCBI submission.. of course this causes problems for the alignment pipelines. I tried to run a script to manually split and rename the files, that works but takes forever because the files are too many. Any suggestions about some tools to use? I tried reformat.sh from bbmap and all the fastq-dump options already mentioned (even if, for what I understood, the last fasterq-dump now should split the file automatically). Many thanks for the help
Your best bet is to get the data directly as fastq from EBI-ENA. You can use
sra-explorer
to get the necessary command lines (C: sra-explorer : find SRA and FastQ download URLs in a couple of clicks ).Splitting the fastq three ways using --split-e might work here. There may be some erroneous pairs in the set. See fastq-dump help: