Aligning paired end fastq files dumped from SRA
4
9
Entering edit mode
10.0 years ago

Greetings,

I've downloaded a Short Read Archive (SRA) experiment and dumped it to fastq.

~/tools/sratoolkit.2.4.2-centos_linux64/bin/fastq-dump -I  --split-files --gzip SRR1514952/SRR1514952.sra​

BWA mem is throwing and error when I'm aligning the mate pairs:

[mem_sam_pe] paired reads have different names: "SRR1514950.1.1", "SRR1514950.1.2"
[mem_sam_pe] paired reads have different names: "SRR1514950.2.1", "SRR1514950.2.2"
[mem_sam_pe] paired reads have different names: "SRR1514950.3.1", "SRR1514950.3.2"

I'm checking that the files aren't truncated and contain the same number of reads. Has anyone run into this problem before?

paired sra fastq bwa mem • 7.8k views
ADD COMMENT
9
Entering edit mode
10.0 years ago

This seemed to work. Just need to ask for the original read format.

~/tools/sratoolkit.2.4.2-centos_linux64/bin/fastq-dump --origfmt -I --split-files --gzip SRR1514950/SRR1514950.sra
ADD COMMENT
3
Entering edit mode
10.0 years ago
Adrian Pelin ★ 2.6k

it's probably because this isn't the default way paired reads are usually named so bwa is confused. Try a quick sed:

sed -i 's,.1,/1,g' file1 and sed -i 's,.2,/2,g' file2. You will however need to manually fix the first read in file1 and second read in file2 to

SRR1514950.1/1 and SRR1514950.2/2

Hope this works.

ADD COMMENT
0
Entering edit mode
7.8 years ago
Christian ★ 3.1k

The following command worked for me:

cat sra.fq | perl -ne 's/\.([12]) /\/$1 /; print $_' > sra.fix.fq
ADD COMMENT
0
Entering edit mode
7.0 years ago
seelament • 0

I had something similar. The reads I got from SRA look like so:

@SRR1531517.4.1 D3NH4HQ1:58:D091WACXX:7:1101:1448:2140 length=75
AACTTCCAGTGGAAATGAGATTCTGATTCTACCAAAAATGGCCCTCCGAATAGTCAGCATGTAGTTTGTTTGCCC
+SRR1531517.4.1 D3NH4HQ1:58:D091WACXX:7:1101:1448:2140 length=75
CCCFFFFFHHHHGIJIJIJJJJJJJJJJIJJJJJJIJJIJJIGIGIJJIIJIIIIIIJJJJIGIJJJIIJJJHHH

I tried something like this to make it compatible with BWA. It works with both forward and reverse files. I prefer to pipe (and zip) it to another file to keep the original as a backup.

sed 's;@SRR1531517\.\([0-9.]*\)\([0-9]\) \([a-zA-Z:0-9]*\) length=[0-9]*;@\3/\2;' sra.fq | gzip > sra.fix.fq.gz

Which gives me:

@D3NH4HQ1:58:D091WACXX:7:1101:1448:2140/1
AACTTCCAGTGGAAATGAGATTCTGATTCTACCAAAAATGGCCCTCCGAATAGTCAGCATGTAGTTTGTTTGCCC
+SRR1531517.4.1 D3NH4HQ1:58:D091WACXX:7:1101:1448:2140 length=75
CCCFFFFFHHHHGIJIJIJJJJJJJJJJIJJJJJJIJJIJJIGIGIJJIIJIIIIIIJJJJIGIJJJIIJJJHHH
ADD COMMENT
1
Entering edit mode

If you had chosen -F option while fastq-dumping the reads you would not have had to do this transformation. You will have recovered original Illumina format fastq headers.

ADD REPLY

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6