Question

Pysam read reference positions and sequence do not match

0

Entering edit mode

7.6 years ago

rlorigro ▴ 50

Hi,

In order to parse cigar strings and make my own pileup I need to find the reference sequence at the position which each read aligns. However, read.get_reference_positions() seems to be offset from the actual region of alignment, because comparing samtools tview at the first position of alignment yields a completely different reference sequence than read.get_reference_sequence().

Additionally, read.get_reference_sequence() returns a reference that appears to already be aligned, with dashes '-' inserted. Why is this?

Example:

bamFile = "NA12878.np.chr3.100kb.0.bam"
referenceFile = "hg19.chr3.9mb.fa.txt"

sam = pysam.AlignmentFile(bamFile, "rb" )

start = 80000
end = start+1
chr = str(3)

for read in sam.fetch("chr"+chr, start=start, end=end):
    alignedRefPositions = read.get_reference_positions()
    refStart = alignedRefPositions[0]

    refSequence = read.get_reference_sequence()
    readSequence = read.query_alignment_sequence

Yields:

refStart: 79766
ref:  TCCATCCACAGTCAGAGC-AG-ACTTAATATTAATGACT-CTCCC...
read: TCCATCCACAGTCAGAGCAGACTTAATATTAATGACTCTCCCAGA...

which has no relationship to samtools tview at the read.get_reference_positions[0] or 79766:

 samtools tview -p chr3:79766 -d T NA12878.np.chr3.100kb.0.bam hg19.chr3.9mb.fa 
     79771      79781      79791      79801     79811     79821     79831       
cctgatggtat*ggtgttaagatg*taggag*agatattgcatgattaaaatcctccaattaaatcttagtGTGTGTTTT
........... ............ ...... ................................................
,,,,,,,,,,,*,,,,,,,,,,,,*,,,,,,*,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,********
***,,,,,,,,*a,*,**c,,,,,*,,,,,***,,,,,,,,,,,,,,,,,,,,,,,,,c*c,,,,,,,,,,,,,**,,**
,,,,,,**,,a*,,,,,,,,,,,,*,,,,,,*,,,**,,*****,c,,,,,,,*,,,,,,,,,c,*,,,,,,,,**,,**
 ...C.*....A......*....CT......A........**..*...*.......*......................*

So:

Why doesn't the first reference position correspond with the results of samtools view at that position?
Am I using the correct command to retrieve the reference sequence?
If so, why is it interspersed with dashes?

pysam samtools tview pileup • 6.4k views

ADD COMMENT • link updated 7.5 years ago by Biostar 20 • written 7.6 years ago by rlorigro ▴ 50

0

Entering edit mode

any luck with this?

ADD REPLY • link 6.6 years ago by elzedleeu ▴ 20