Hi!!
I want to read and process BAM files using my own python scripts. Pysam is a python library designed to do this, but I don't understand how to use it.
For example, I would like get a human-readable print of a BAM file using pysam. The same output I get when I use the following command:
samtools view file.bam
I tried to do this using this code:
import sys
import pysam
def main(BAM):
samfile = pysam.Samfile( BAM, "rb" )
for row in samfile:
print row
if __name__ == '__main__':
main(sys.argv[1])
But I get an output that is similar than a SAM file, but doesn't have the CIGAR line and have strange fields that I don't understand:
ERR030860.11710_HWI-BRUNOP16X_0001:5:1:10517:2607#0 0 23 43534648 0 [(0, 89), (3, 285), (0, 11)] -1 -1 100 ATGAAGCAATTGCTGAATTGGATACGCTGAATGAAGAGTCTTATAAAGACAGCACTCTGATCCTGCAGTTACTTAGGGACAATCTCCCTCTGTGGACATC 5565555555@GGAGGGGGC5=><554455DDD.D>?A@?A<?>>444444452*4344555*55DFDDDFDDD<55445C??=8,+14444444<A?@@ [('NM', 2), ('IH', 1), ('HI', 1), ('XS', '+')]
ERR030860.11718_HWI-BRUNOP16X_0001:5:1:11798:2613#0 0 8 30267395 102 [(0, 91)] -1 -1 91 CAAAGGAAGCTCTTGGTATACGGAAAAAGTATTCAACAATAAATTAGGCATGGTTGCTTCCATTTTCTGCCTCACATACTTTTTTTTTTCG GHFGHHHGHHHHHHEHHHHDDFBDD44445HHHHHHHHHHGHHHHDC=GGHHHHHHHHHHHEH=HHHHHHHHG>HHHHEHHHHHHCCCBC@ [('NM', 3), ('IH', 1), ('HI', 1)]
ERR030860.11720_HWI-BRUNOP16X_0001:5:1:11974:2612#0 0 27 133472526 0 [(0, 21), (3, 791), (0, 38)] -1 -1 59 AGTTCTATGGGTCAAAAGAGGATCCACAGACTTTCTATTATGCTGTTGCTGTGGTGACG HHHHHHHGHH<AD@D444(4445553444/45544EEGGGHHGHHHHHHDGGGGGD:,D [('NM', 1), ('IH', 1), ('HI', 1), ('XS', '+')]
ERR030860.11722_HWI-BRUNOP16X_0001:5:1:12095:2607#0 0 9 10608405 0 [(0, 95)] -1 -1 95 GCCATTGAAGACTGGAATAATGAAAGCAGCATGCCCTGTTGTGTCCTTCAGCTTGGCGATATCATCGATGGATATAATGCACAGTATAATGCATC HHHHHHHHHHHHHHHBD?DD45556@@A?AEEEHHA?DDA4446/BDDADH=HHH>+=<:/4655F?BAFC/5A@4345*1424531*+2@D<5? [('NM', 1), ('IH', 1), ('HI', 1)]
ERR030860.11724_HWI-BRUNOP16X_0001:5:1:12418:2610#0 0 2 4114112 0 [(0, 64)] -1 -1 64 TTTGGGGAAGGGGTGGCTAAGGAAGATGGTATAGGTGAAGGCGGCTGTGTGACCACTTTACCCC GGHHHHGHHHHHHHHBCGHH55555425515444514455C4CCCCE?E244553A=@>09?=A [('NM', 1), ('IH', 1), ('HI', 1)]
ERR030860.11725_HWI-BRUNOP16X_0001:5:1:13271:2613#0 0 27 9482322 0 [(0, 60), (3, 894), (0, 27)] -1 -1 87 GACACTATTAATAAGACTGAATTGGCCTGTAATAACACAGTTATTGGTTCCCAAATGCAGTTACAGCTGGGACGAGTCACTCGTGTT HBHHHHHHHFHHHHHHHGHHEHHHHEHHHEF@BFFFHHEHHGHHHHH:HHHHHHEHHHHHHHHHHHEHHHB@*D743150414536- [('NM', 1), ('IH', 1), ('HI', 1), ('XS', '+')]
Can anybody tell me how to get a SAM output using pysam??
I'm also interested in other ways to read BAM files using python scripts.
But when I print row.cigar I get the tuples, not the CIGAR. Well... I at least need to understand the meaning of this tuples... for example I have tuples like this: [(0, 25), (3, 1642), (0, 44)] .... It's means that I have 25 matches, one intron of 1642 and 44 matches ??
Yes, you don't get the original string, but a processed version, which is what people usually want. If you need a Python script to simply access the original string representation of the aligned reads in the samefile, then it's probably easier to use Python's subprocess module and read the output from samtools view.
I'm used to directly analyse the SAM lines, but that I understand this processed version, it's even easier manipulate BAM/SAM files. THANKS!!