Pysam read reference positions and sequence do not match
0
0
Entering edit mode
7.1 years ago
rlorigro ▴ 40

Hi,

In order to parse cigar strings and make my own pileup I need to find the reference sequence at the position which each read aligns. However, read.get_reference_positions() seems to be offset from the actual region of alignment, because comparing samtools tview at the first position of alignment yields a completely different reference sequence than read.get_reference_sequence().

Additionally, read.get_reference_sequence() returns a reference that appears to already be aligned, with dashes '-' inserted. Why is this?

Example:

bamFile = "NA12878.np.chr3.100kb.0.bam"
referenceFile = "hg19.chr3.9mb.fa.txt"

sam = pysam.AlignmentFile(bamFile, "rb" )

start = 80000
end = start+1
chr = str(3)

for read in sam.fetch("chr"+chr, start=start, end=end):
    alignedRefPositions = read.get_reference_positions()
    refStart = alignedRefPositions[0]

    refSequence = read.get_reference_sequence()
    readSequence = read.query_alignment_sequence

Yields:

refStart: 79766
ref:  TCCATCCACAGTCAGAGC-AG-ACTTAATATTAATGACT-CTCCC...
read: TCCATCCACAGTCAGAGCAGACTTAATATTAATGACTCTCCCAGA...

which has no relationship to samtools tview at the read.get_reference_positions[0] or 79766:

 samtools tview -p chr3:79766 -d T NA12878.np.chr3.100kb.0.bam hg19.chr3.9mb.fa 
     79771      79781      79791      79801     79811     79821     79831       
cctgatggtat*ggtgttaagatg*taggag*agatattgcatgattaaaatcctccaattaaatcttagtGTGTGTTTT
........... ............ ...... ................................................
,,,,,,,,,,,*,,,,,,,,,,,,*,,,,,,*,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,********
***,,,,,,,,*a,*,**c,,,,,*,,,,,***,,,,,,,,,,,,,,,,,,,,,,,,,c*c,,,,,,,,,,,,,**,,**
,,,,,,**,,a*,,,,,,,,,,,,*,,,,,,*,,,**,,*****,c,,,,,,,*,,,,,,,,,c,*,,,,,,,,**,,**
 ...C.*....A......*....CT......A........**..*...*.......*......................*

So:

  1. Why doesn't the first reference position correspond with the results of samtools view at that position?
  2. Am I using the correct command to retrieve the reference sequence?
  3. If so, why is it interspersed with dashes?
pysam samtools tview pileup • 6.1k views
ADD COMMENT
0
Entering edit mode

any luck with this?

ADD REPLY

Login before adding your answer.

Traffic: 2516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6