Question

'QNAME' format different in 'bwa mem' output sam file and doesn't contain illumina sequence header information

0

Entering edit mode

6.3 years ago

JoLY ▴ 10

Hello,

Hoping someone can help me with this one as I'm failing to find a solution anywhere online as yet.

I generated sam files using 'bwa mem' as follows:

bwa mem -M -t 28 mm10bwaidx 1.fastq.gz 2.fastq.gz > output.sam

The data were PE 75bp reads, and as I had only one pair of fastq per sample I chose not to include any RG.

I expected the QNAME in the sam file to be the illumina FASTQ sequence header/ID, for example:

K00103:94:H73C2BBXX:7:1103:14194:9737

Rather, what I have is QNAMEs that look like this:

ERR174324.81165065

This seems to be causing me problems as far as detecting and marking optical duplicates using Picard is concerned.

Does anyone know why this is happening and how to redress the issue?

Best Wishes

next-gen sequencing alignment • 2.1k views

ADD COMMENT • link 6.3 years ago by JoLY ▴ 10

1

Entering edit mode

How and where did you download this data from? SRA or EBI? Using the -F option with fastq-dump would have given you the fastq headers in original Illumina format.

Note: ENA fastq version has these headers

@ERR174324.1 HSQ1009_86:1:1101:1192:2116/1

fastq-dump with -F produces

@HSQ1009_86:1:1101:1192:2116

ADD REPLY • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much for your response, the data were indeed downloaded from the EBI ENA.

ADD REPLY • link 6.3 years ago by JoLY ▴ 10

score 0 · Accepted Answer · 2018-08-16

Thank you for your help genomax, as you pointed out in your comment, the EBI ENA FASTQ version header starts with an ENA specific ID. Being more familiar with sed than fastq-dump, I tried removing the ENA ID from the FASTQ as follows:

gzip -cd ENA_formatted.fastq.gz | sed '/^@/ s/.* /@/g' | gzip > new.fastq.gz

This enabled me to generate a valid BAM file using 'bwa mem' and Picard for which the QNAME is the Illumina FASTQ header and for which I could run Picard's 'MarkDuplicates' with optical duplicate detection and without any warnings or errors this time around.