How to split paired end SRA file into 2 correct fastq files
4
8
Entering edit mode
8.1 years ago
thustar ▴ 130

Hello Biostars!

In my project, I have to convert several SRA files to fastq files. These SRA files are paired end. I read a previous post about how to use fastq-dump to do so. However, I am still confused about the split step.

For example, after I ran fastq-dump ERR011087.sra, I got ERR011087.fastq which contains paired end reads with the length of 88. The first read looks like

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGTATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(

After I ran fastq-dump --split-files ERR011087.sra, I did get 2 fastq files, ERR011087_1.fastq and ERR011087_2.fastq. The first read of ERR011087_1.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGT   
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+

The first read of ERR011087_2.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
ATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA  
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(

It seems that fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44. Is just spliting the first half and the last half of a read equal to spliting a paired end read into two fragments?

If so, it is very strange to find that the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq is different. I ran grep "@ERR" ERR011087.fastq |wc -l and got 11640976, ran grep "@ERR" ERR011087_1.fastq |wc -l and got 11640674, ran grep "@ERR" ERR011087_2.fastq |wc -l and got 11640358. I think these numbers represent the amount of reads in each file. However, three numbers are NOT the same. I felt very confused because if fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44, then the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq should be equal. There must be something wrong with it.

Could anyone explain that?

Thanks.

next-gen sra fastq • 56k views
ADD COMMENT
1
Entering edit mode

Hello everyone, I'm replying to this old topic because I'm unfortunately still having a similar issue. I already tried everything I found on this and other similar topics but nothing works for me. I'm trying to download, using fasterq-dump, some SRR* .fastq that are all supposed to be frw and rev (...fastq_1 and ....fastq_2) as they come from illumina sequencing technology. Anyway, some of the files are good, already splitted. Others figure as single fastq files with all reads together (I can see from the lenght that both frw and rev reads are collapsed in one fastq file); I think there was somethig wrong with the NCBI submission.. of course this causes problems for the alignment pipelines. I tried to run a script to manually split and rename the files, that works but takes forever because the files are too many. Any suggestions about some tools to use? I tried reformat.sh from bbmap and all the fastq-dump options already mentioned (even if, for what I understood, the last fasterq-dump now should split the file automatically). Many thanks for the help

ADD REPLY
1
Entering edit mode

Your best bet is to get the data directly as fastq from EBI-ENA. You can use sra-explorer to get the necessary command lines (C: sra-explorer : find SRA and FastQ download URLs in a couple of clicks ).

ADD REPLY
0
Entering edit mode

Splitting the fastq three ways using --split-e might work here. There may be some erroneous pairs in the set. See fastq-dump help:

 --split-files                 Write reads into separate files. Read
                                 number will be suffixed to the file
                                 name. NOTE! The `--split-3` option is
                                 recommended. In cases where not all
                                 spots have the same number of reads,
                                 this option will produce files that WILL
                                 CAUSE ERRORS in most programs which
                                 process split pair fastq files.
 --split-e                     3-way splitting for mate-pairs. For each
                                 spot, if there are two biological reads
                                 satisfying filter conditions, the first
                                 is placed in the `*_1.fastq` file, and
                                 the second is placed in the `*_2.fastq`
                                 file. If there is only one biological
                                 read satisfying the filter conditions,
                                 it is placed in the `*.fastq` file.All
                                 other reads in the spot are ignored.
ADD REPLY
11
Entering edit mode
8.1 years ago

--split-files is splitting things according to how the actual reads should be split. If the original dataset happened to be 2x44, then yes it'll just split things in half. The problem with SRA is that a fair number of uploaded datasets are simply crap, i.e., people uploaded poorly formatted or incorrect data. For all ERR* datasets, do not use SRA. Download the original fastq files from ENA. If those have different numbers of reads then that's what was uploaded.

ADD COMMENT
0
Entering edit mode

Thanks for your quick reply.

One more question. In your recommended website, there are many options like Fastq files (ftp) and Submitted files (ftp). Is there any difference? Which file should I download?

ADD REPLY
0
Entering edit mode

Go for the submitted files, noting that they'll have different file names from the accession ID.

ADD REPLY
10
Entering edit mode
8.1 years ago

In addition... Once you get the correct sra files, try to use fastq-dump with the legacy --split-3 command, as it happens in some cases that the paired-end files are not synchronized which is what many programs are expecting

With that command, I got sometimes a third fastq file corresponding to those sequences that are not paired or lack its mate for one reason or another

ADD COMMENT
1
Entering edit mode

What are the exact contents of 3 files?

ADD REPLY
5
Entering edit mode

Read 1, read 2, and orphaned reads.

ADD REPLY
0
Entering edit mode

For what it is worth, +1 bounties to both of you :)

ADD REPLY
0
Entering edit mode

Agree! This is very helpful!

ADD REPLY
4
Entering edit mode
8.1 years ago

Apart from Devon suggestion, when in doubt, always check the Trace DB https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR011087

There you can see "This run has 2 reads per spot", of 44bp each.

Click in the "Reads" tab there. You can see the individual reads and estimate total number of read-pairs (by checking progressive numbers 1,2,3,.. in header). Click on "1164098" or put that in search bar. It will show you some of the very last reads. Do they look unusual? Does that answer your Q?

ADD COMMENT
0
Entering edit mode

With crappy SRA files you cannot rely in what is displayed on that web page. This web page just displays data taken from the SRA file. There do exist some SRA files which have been generated from inconsistent datasets.

ADD REPLY
0
Entering edit mode

A concrete example would have been much more useful than an opinion. Neither do I believe that a the sra website can get better information than what is present in sra file. But the OP's Q was to get the best out of the crappy sra, if you would like to say so.

ADD REPLY
0
Entering edit mode

Any complex technical system has some errors. You may encounter them if you use the system intensively. I am pretty sure you could also find some faulty submissions if you would do some serious analysis of SRA data sets, especially older Illumina data from around 2012. OP asked for a data set from 2011. ERR033684 and companions is an example, where even the FASTQ files offered by ENA are faulty.

ADD REPLY
1
Entering edit mode
2.9 years ago

I have seen that some of this files have the following format:

@sequence_id_1.1 \ sequence \ + \ quality

@sequence_id_1.2 \ sequence \ + \ quality

@sequence_id_2.1 \ sequence \ + \ quality

@sequence_id_2.2 \ sequence \ + \ quality

Assuming that sequence pairs are defined by sequence_id_number.1(forward) and sequence_id_number.2(reverse) we can use awk:

zcat file.fastq.gz | awk 'NR%8==1{c=4} c&&c--' > file_1.fastq

and

zcat file.fastq.gz | awk 'NR%8==5{c=4} c&&c--' > file_2.fastq
ADD COMMENT

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6