pysam duplicated sequence error
1
0
Entering edit mode
8.9 years ago
xcalle91 ▴ 20

Hi,

I did an alignment with STAR, I got the sam file which I can open easily from the terminal:

samtools view -S STAR2Aligned.out.sam

Now I have to work with it in python, so I import the pysam module and try to open the file by:

file = pysam.AlignmentFile("/home/lpp/Desktop/Star_Results/STAR2Aligned.out.sam", "r")

It prints the following error:

[W::sam_hdr_parse] duplicated sequence 'NODE_4_length_21_cov_1.000000'
[W::sam_hdr_parse] duplicated sequence 'NODE_18_length_23_cov_1.000000'

I tried to find the error in different forums but the most similar one to my problem has no answer: http://seqanswers.com/forums/showthread.php?t=58219

Does anybody now where this error is coming from?

Thanks in advance!

sam alignment • 6.3k views
ADD COMMENT
1
Entering edit mode

If you do a

samtools view -H /home/lpp/Desktop/Star_Results/STAR2Aligned.out.sam | grep NODE_4_length_21_cov_1.000000

then do you get more than one line?

ADD REPLY
0
Entering edit mode

It says:

[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
ADD REPLY
0
Entering edit mode

Oh, there should be an -S flag given to samtools as well, mea culpa.

ADD REPLY
0
Entering edit mode

thankss,

Now It says:

[samopen] SAM header is present: 1457 sequences.
@SQ    SN:NODE_4_length_21_cov_1.000000    LN:41
@SQ    SN:NODE_4_length_21_cov_1.000000    LN:41
ADD REPLY
1
Entering edit mode

I should add that it's likely that you have multiple contigs with the same name in your reference genome. This will simply not work, though STAR won't complain (its output, however, will be broken unless the duplicately-named entries also have duplicate sequence...though even then the MAPQ values and such will be wrong).

ADD REPLY
2
Entering edit mode
8.9 years ago

Samtools is correct, you have a duplicate header. I suspect that you actually have the same sequence twice, but you might want to double check that. You can use the -t option with samtools and give it a two column file with the deduplicated contig names and length and then you'll be able to get a BAM file. Alternatively, remove the duplicate fasta sequences and remap. I suspect that the former method will be both faster and easier.

ADD COMMENT
0
Entering edit mode

Thanks Devon Ryan, that was the problem. I check the contigs and there were two duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 1814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6