Hi,
In order to parse cigar strings and make my own pileup I need to find the reference sequence at the position which each read aligns. However, read.get_reference_positions() seems to be offset from the actual region of alignment, because comparing samtools tview at the first position of alignment yields a completely different reference sequence than read.get_reference_sequence().
Additionally, read.get_reference_sequence() returns a reference that appears to already be aligned, with dashes '-' inserted. Why is this?
Example:
bamFile = "NA12878.np.chr3.100kb.0.bam"
referenceFile = "hg19.chr3.9mb.fa.txt"
sam = pysam.AlignmentFile(bamFile, "rb" )
start = 80000
end = start+1
chr = str(3)
for read in sam.fetch("chr"+chr, start=start, end=end):
alignedRefPositions = read.get_reference_positions()
refStart = alignedRefPositions[0]
refSequence = read.get_reference_sequence()
readSequence = read.query_alignment_sequence
Yields:
refStart: 79766
ref: TCCATCCACAGTCAGAGC-AG-ACTTAATATTAATGACT-CTCCC...
read: TCCATCCACAGTCAGAGCAGACTTAATATTAATGACTCTCCCAGA...
which has no relationship to samtools tview at the read.get_reference_positions[0] or 79766:
samtools tview -p chr3:79766 -d T NA12878.np.chr3.100kb.0.bam hg19.chr3.9mb.fa
79771 79781 79791 79801 79811 79821 79831
cctgatggtat*ggtgttaagatg*taggag*agatattgcatgattaaaatcctccaattaaatcttagtGTGTGTTTT
........... ............ ...... ................................................
,,,,,,,,,,,*,,,,,,,,,,,,*,,,,,,*,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,********
***,,,,,,,,*a,*,**c,,,,,*,,,,,***,,,,,,,,,,,,,,,,,,,,,,,,,c*c,,,,,,,,,,,,,**,,**
,,,,,,**,,a*,,,,,,,,,,,,*,,,,,,*,,,**,,*****,c,,,,,,,*,,,,,,,,,c,*,,,,,,,,**,,**
...C.*....A......*....CT......A........**..*...*.......*......................*
So:
- Why doesn't the first reference position correspond with the results of samtools view at that position?
- Am I using the correct command to retrieve the reference sequence?
- If so, why is it interspersed with dashes?
any luck with this?