Question

paired reads have different names (bwa-mem)

7

Entering edit mode

7.5 years ago

AISHA ▴ 140

Hi, I am experiencing a problem while running BWA mem on paired end fastq file downloaded from NCBI SRA. When I ran BWA-mem it gives an error like:

[mem_sam_pe] paired reads have different names: "SRR3239806.1.1", "SRR3239806.1.2"

Example Fastq file:

@SRR3239806.1.1 1 length=100 TTGTGTAGGGTGGGTAGGCTCCATGTTTCCCAGCAAAGCTGGAGACATACAGACTACCTGGTGTTACATTTATTTCAGTGCCTCCTGAGTGTCTCTAAAT +SRR3239806.1.1 1 length=100 B@CDDFEFFHFFFI@GHIJGIJJJIJIIJJJGGHHIIIJJJJHIIGEHGIJIIIEGIGGHI@A=AHHFDEFFFFFEDEEECCDDDDD3<@CCCDDAC@CC

@SRR3239806.1.2 1 length=100

@SRR3239806.2.1 2 length=100

@SRR3239806.2.2 2 length=100

(I've just pasted the headers for the sake of brevity.) Can anyone explain how can I fix this error?

next-gen • 21k views

ADD COMMENT • link updated 6.2 years ago by Sasha Fokin ▴ 80 • written 7.5 years ago by AISHA ▴ 140

0

Entering edit mode

Is it actually an error or just a warning? Did you download the fastq files or convert them with SRAtools?

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

Its an error. I downloaded fastq file directly. It was a single file.

ADD REPLY • link 7.5 years ago by AISHA ▴ 140

score 11 · Answer 1 · 2017-06-22

11

Entering edit mode

7.4 years ago

mmfansler ▴ 460

It appears that when the FASTQ file was dumped from the SRA file, the -I | --readids option was used in fastq-dump. BWA requires that paired reads have completely identical read names, so this option isn't compatible.

You could process the file(s) to remove those appended .(1|2)s,

sed -E "s/^((@|\+)SRR[^.]+\.[^.]+)\.(1|2)/\1/" SRR3239806.fastq > SRR3239806.fixed.fastq

or you could rerun the dump from SRA to FASTQ (which could be just as fast if the SRA is cached):

fastq-dump --split-files SRR3239806

or, if you'd like to keep working with an interleaved file:

fastq-dump --split-spot SRR3239806

ADD COMMENT • link 6.7 years ago by mmfansler ▴ 460

1

Entering edit mode

I had the same issue having used the SRA fastq-dump - but without access to the sra file and the files are fastq.gz - I came up with:

gunzip -c test.fastq.gz | sed -E 's/(^[@+]SRR[0-9]+\.[0-9]+)\.[12]/\1/' | gzip -c > test.fixed.fastq.gz

It's a bit slow on a 4GB file but it worked.

ADD REPLY • link 5.4 years ago by Oliver Slay ▴ 60

score 1 · Answer 2 · 2017-05-23

1

Entering edit mode

7.5 years ago

GenoMax 147k

It appears that those reads are interleaved in the file you downloaded.

I suggest you download the fastq files directly from EBI-ENA where you will find the two reads (R1/R2) in separate files.

ADD COMMENT • link 7.5 years ago by GenoMax 147k

2

Entering edit mode

Interleaved files are not a problem for BWA - that's what the -p flag is for.

ADD REPLY • link 7.4 years ago by mmfansler ▴ 460

0

Entering edit mode

Yes! I downloaded the interleaved fastq file. Isn't there any method to remove the above-mentioned error in the file?

ADD REPLY • link 7.5 years ago by AISHA ▴ 140

1

Entering edit mode

Reads you downloaded are using modified SRA headers (if you used fastq-dump to get the data you should have used the -F option to retrieve original Illumina headers. You could mess with the file you have but I suggest that you get the fastq's from ENA or do a new fastq-dump.

ADD REPLY • link 7.5 years ago by GenoMax 147k

score 0 · Answer 3 · 2018-09-03

This is also possible when the fourth and second lines (in fastq file) has differ length. In this case, the Line1 (sequence identifier) of the fastq can be correct

But the program will return the same error:

[mem_sam_pe] paired reads have different names: "@E00576:153:HK75TCCXY:2:1101:23470:1713"

for example

first read (line 2 and line 4 has differ length - and that's the problem):

@E00576:153:HK75TCCXY:2:1101:23470:1713 2:N:0:NTTACTCG+AGGCTATA
AATAATAATAAAATAAAATAATGTGCTATAAGGTCTTATTTGCAAGCTTCATGGTAGCCTCAATTAAACAAACCTGCAAACAAAAAATAAAAAATAAAAA
+
JJJJFFJJFFJJFJJJJJJJJJJJJFJFJJAFFJJJJFJJAFJJFAJF<JJJF<--7A<F<F7FJJFJJJJFAFA)7<<F<---7-<7AFJJFJ<<FFJ

second read (ok):

@E00576:153:HK75TCCXY:2:1101:23470:1713 1:N:0:NTTACTCG+AGGCTATA
GCAGGCTTCTGTGAAGGTGATTTTCTCTGGTGGAATGTTTTAATTTCCTGCTTTTTATTTTTTTTTTCTTGGTTGCAGTTTTGTTTAATTGAGGATACCATGAAGTTTGCAAATAAGACCTTATAGCATTTTATTTTATTTTATTATTAT
+
AAAFFJJJJJJJJFAFJJJ-<-FFJFJFJJJFJJJJJFJ<F-JA-J-FFJJJJJ-F<-FJJJJ<JFJF<JFAAF-A-F-AAJ<FFJA-<A--A-7AF---7<-77-7FJ7<7FJJ<AJA--<FA<-7---7J7AJAJ-<FFA-7FAAFAF