Duplicate @Sq Lines In Sam File Header
1
1
Entering edit mode
11.2 years ago
Ram 44k

Hi,

I'm working on RNA-seq data and have aligned my reads to the assembled transcript using bwa-mem. When I tried to run Picard Tools (SamFormatConverter.jar), the SAM dictionary threw an exception

"Cannot add sequence that already exists in SAMSequenceDictionary".

I then checked for duplication in the SAM file header and sure enough, two entries were duplicated. How do I correct this? Commands used are given below:

bwa index contigs.fa
bwa mem contigs.fa reads_1.fq reads_2.fq >samFile.sam
java -Xmx4g -jar SamFormatConverter.jar INPUT=samFile.sam OUTPUT=bamFile.bam

This is where I faced the error.

I checked for duplicates using:

samtools view -HS samFile.sam | sort | uniq -c

2 @SQ entries were present >1 times.

Any help would be appreciated, thanks!

samtools duplicates • 9.7k views
ADD COMMENT
1
Entering edit mode

I do not know whether it's right nor not, but when using bwa mem, you do not use -M parameter, Mark short split hits as secondary (which for Picard compatibility), may be you should try this before go to picard.

ADD REPLY
0
Entering edit mode

I think that's the problem! I'll run bwa mem again with the -M option. Thank you!

ADD REPLY
1
Entering edit mode

Hi, A fresh run with the -M option turned on still gives the error when I try Picard Tools! :-(

ADD REPLY
6
Entering edit mode
11.2 years ago

The @SQ headers that bwa emits come directly from the reference sequences in the index you're mapping against.

So the likely explanation for your problem is that the same entries are duplicated in your contigs.fa file. You can check by looking at the > headers in your fasta file:

grep '>' contigs.fa | sort | uniq -c

(If there are header lines like >foo blah and >foo blah blah then these will lead to duplicate @SQ SN:foo SAM headers while not being shown as exact duplicate lines by that uniq command. So you may have to be a little careful in interpreting the results of that command.)

ADD COMMENT
1
Entering edit mode

that was my first thought, something up with the contigs file. (full disclosure, I am involved in this project and had no problem running picard post BWA-mem when I was using the trinity.fasta file) These contigs are really ORFs that were generated from a subset of the Trinity contigs, using EMBOSS)

ADD REPLY
1
Entering edit mode

Right on target! I was working with a manually created contigs.fa file and I guess I extracted a couple of sequences twice from the source super-set. Silly me! Thank you so much, @John Marshall!

ADD REPLY
0
Entering edit mode

the "source super-set" I like that :)

ADD REPLY

Login before adding your answer.

Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6