What are the columns from pysam fetch?
1
0
Entering edit mode
3.5 years ago
Eric • 0

Hello,

I am working on parsing some BAM files from pysam and I am was looking for some information on what the output from pysam fetch produdes.

This is what I am working with code wise:

import pysam
import sys
from ast import literal_eval as make_tuple
samfile = pysam.AlignmentFile('sample1.bam', "rb")
for c in contigs:
    for read in samfile.fetch(c):
        r=str(read).split("\t")
        cig=read.cigarstring
        MD_TAG=make_tuple(r[-1])[-2][1] 
        print(read)

This pulls some information such as the CIGAR string and MD Tag. I want to know what the other features that are being output are though from fetch (I looked around but I haven't found a satisfactory explanation). This is the output example from one of my BAM files:

SRR5720260.4404562.1    83      0       1135779 40      144M    0       1135607 144     AATTATTGTCACTTTGCCAATATTTTCATTTAATTCCTCAAATCCAATTATGATTGAAGGGTCAATTGGTGTGGTAAGGCTAGAAGGTTTGTAAATAGTAGATTCAGTTTGAGTCGGTTGATTTTCCCCAGTCCAGTTTGATAG       array('B', [36, 36, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 36, 36, 36, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 32, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 32, 32, 32, 32])  [('AS', -30), ('XN', 0), ('XM', 6), ('XO', 0), ('XG', 0), ('NM', 6), ('MD', '9A58T0C8T31C32A0'), ('YS', -10), ('YT', 'CP')]

These are the columns I have been able to extrapolate:

  1. Read --> SRR5720260.4404562.1
  2. unknowcn --> 83
  3. unknown --> 0
  4. unknown --> 1135779 40
  5. CIGAR --> 144M
  6. unknown --> 0
  7. unknown --> 1135607 144
  8. query sequence: AATTATTGTCACTTTGCCAATATTTTCATTTAATTCCTCAAATCCAATTATGATTGAAGGGTCAATTGGTGTGGTAAGGCTAGAAGGTTTGTAAATAGTAGATTCAGTTTGAGTCGGTTGATTTTCCCCAGTCCAGTTTGATAG
  9. unknown (I think this is quality scores) --> array('B', [36, 36, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 36, 36, 36, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 32, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 32, 32, 32, 32])
  10. Read info --> [('AS', -30), ('XN', 0), ('XM', 6), ('XO', 0), ('XG', 0), ('NM', 6), ('MD', '9A58T0C8T31C32A0'), ('YS', -10), ('YT', 'CP')]
        Known values: 
        - Paired-end alignment quality (AS)
        - Amplicon name tag (XN)
        - Edit distance tag (NM)
        - the number of mismatches in the alignment(XM) 
        - the number of gap opens, for both read and reference gaps, in the alignment (XO)
        - the number of gap extensions, for both read and reference gaps, in the alignment.(XG)
        - edit distance to the reference, including ambiguous bases but excluding clipping(NM) 
        - MD tag (MD)
        Unknown values:
        - YS
        - YT
    

Any help would be appreciated and please correct me if I have misunderstood any of these values!

pysam • 2.8k views
ADD COMMENT
0
Entering edit mode

Eric - Do not delete posts that have received feedback.

ADD REPLY
0
Entering edit mode
3.5 years ago

pysam.AlignmentFile.fetch() pulls back objects of class AlignedSegment, which is basically a wrapper around the delegated c structure from htslib representing an alignment.

You can see the code of the class here: https://github.com/pysam-developers/pysam/blob/bd4b335f03583f150f1fd161af84a6ab2edb468b/pysam/libcalignedsegment.pyx#L906

In particular, the code of AlignedSegment.__str__() is here: https://github.com/pysam-developers/pysam/blob/bd4b335f03583f150f1fd161af84a6ab2edb468b/pysam/libcalignedsegment.pyx#L967

Its really only one line and is very easy to understand.

The output from str(read) is very similar, but not quite identical to SAM format. The columns are:

  1. Query Name
  2. Flag
  3. Reference id (note: this is different from reference name. BAM files don't store the reference name in string format, but store a reference id which the header maps to name strings. This is why a BAM file is invalid without a header).
  4. Reference start
  5. Mapping quality
  6. CIGAR string
  7. Next reference id (i.e. the reference id of the mate)
  8. Next reference start
  9. Query sequence
  10. Query quality in decimal
  11. Tags

All of these are better accessed from the corresponding class accessor than via parsing str, and you want to write to file you are better off with AlignmentFile.write() - even if you want to write stdout, you are better off creating an AlignmentFile on stdout (AlignmentFile("-", "w", template=inbam)).

If you want a SAM representation for debugging, then you are probably better off with AlignedSegment.to_string which gives you the actual SAM line, rather than something that just looks a lot like it.

ADD COMMENT

Login before adding your answer.

Traffic: 1815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6