Question

Bug in bx-python maf?

0

Entering edit mode

10.0 years ago

Endre Bakken Stovner ▴ 970

I've downloaded the 7-way hg38 file from http://hgdownload-test.sdsc.edu/goldenPath/hg38/multiz7way/ and checked that the md5sum was correct.

I made an index using

maf_build_index.py

Then I tried looking up the specific sequence I need:

python bx_python.py  hg38.7way.maf hg38 chr1 67134378 67134397

(The script I use is this one: Help With Maf'S In Bx-Python)

This results in a

endrebak@havpryd ~/b/p/data> python bx_python.py  hg38.7way.maf hg38 chr1 67134378 67134397
Range error: 67134397 not in 67134365-67134381

The error is from this code:

https://bitbucket.org/james_taylor/bx-python/src/a3b93fa3e507377b3de63b8b6d2cbce452d57996/lib/bx/align/core.py?at=default#cl-323 (you need to wrap the string in an exception to get it to work).

The sequence seems to be fine and conserved, so that is not the problem:

http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:67134378-67134397&hgsid=200715944_BistQamLcK1zoIuhBe7NKLn8VJba

Anyone have an inkling what might be wrong?

Tips for other ways to look up sequence conservation appreciated.

The maf file looks like the following (region before and after what I want to look up) - i.e. the region I want to look up is not conserved, still, shouldn't give an error:

s hg38.chr1     67134365 16 + 248956422 GGAGTATTATTGTGGG
s panTro4.chr1  67749272 16 + 228333871 GCAGTATTATTGTGGG
i panTro4.chr1 C 0 C 0
s rheMac3.chr1  70514523 16 + 229590362 GGAGTATTATTGTGGG
i rheMac3.chr1 C 0 C 0
s mm10.chr4    103297296 14 + 156508116 GGAACACTTTTATA--
i mm10.chr4    I 2110 C 0
s canFam3.chr5  45500988 15 -  88915250 GGAGTATTATTTTGA-
i canFam3.chr5 C 0 C 0
e rn5.chr5     126698426 2158 + 177180328 I
e monDom5.chr8 243126344  0 + 312544902 I

a score=425734.000000
s hg38.chr1     67134381 1047 + 248956422 CGACTAATCAGATAATGATTGCAAGAATTGATTAGCCAGCTCGAAATCGCAGCACAATTACCGCAGGGGCGATCAG
s panTro4.chr1  67749288 1045 + 228333871 TGACTAATCAGATAATGGTTGGAAGAATTGATTAGCCAGCTCGAAATCGCAGCACAATTACCGCAGGGGCGATCAG
i panTro4.chr1 C 0 C 0
s rheMac3.chr1  70514539 1024 + 229590362 CAACTAATCTGATAATAATTGGAAGAATTGATTAACCAGCTCGAAATCGCAGCACAATTACCGCAGGGGTGATCAG
i rheMac3.chr1 C 0 C 0
s mm10.chr4    103297310  943 + 156508116 CAACTAATCAAATAAAGGTTCAAAGAGTTAATAGGCTGACCTGAAATTATACTATAGTCTCAGTAAGAGTTGATAG
i mm10.chr4    C 0 I 882
s rn5.chr5     126700584  948 + 177180328 CAGCTAATAAAATAATGGTTCAAAGAATTAATGGGCCAACCTGAAATTGTACTATAATCTCCACAAGAGTCGATAG
i rn5.chr5     I 2158 C 0
s canFam3.chr5  45501003 1103 -  88915250 TAACCAATCAGATAACAATTGGAAGAAT-----------CTCGAAATT---TCACAATTGTTGCAAGAGTGATCAA
i canFam3.chr5 C 0 C 0
e monDom5.chr8 243126344    0 + 312544902 I

bx maf • 3.0k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Endre Bakken Stovner ▴ 970

0

Entering edit mode

http://biopython.org/wiki/Multiple_Alignment_Format seems to be an alternative; will try out.

ADD REPLY • link 10.0 years ago by Endre Bakken Stovner ▴ 970

0

Entering edit mode

I suspect the reason is that the new maf files are whole-genome, while the old ones were split by chromosome. Will look further into this. Edit: This was not the reason.

ADD REPLY • link 10.0 years ago by Endre Bakken Stovner ▴ 970

score 1 · Answer 1 · 2014-12-05

Yes, it seems to not be able to handle a few cases where the region you want to look up isnt a proper subregion of the block it looks up in:

endrebak@havpryd ~/local> python bx_python.py chromo_mafs/hg38.7way.chr1.maf hg38 chr1 67134378 67134382
index.get [<bx.align.core.Alignment object at 0x7fc3cb7cda90>, <bx.align.core.Alignment object at 0x7fc3caf87150>]
Traceback (most recent call last):
  File "bx_python.py", line 31, in <module>
    main(*sys.argv[1:])
  File "bx_python.py", line 19, in main
    region_align = align.slice_by_component(region_name, start, end)
  File "/local/home/endrebak/anaconda/lib/python2.7/site-packages/bx/align/core.py", line 137, in slice_by_component
    end_col = ref.coord_to_col( end )
  File "/local/home/endrebak/anaconda/lib/python2.7/site-packages/bx/align/core.py", line 322, in coord_to_col
    raise Exception("Range error: %d not in %d-%d" % ( pos, start, end ))
Exception: Range error: 67134382 not in 67134365-67134381

score 1 · Answer 2 · 2014-12-09

Get bioruby instead (depends on Kyoto-cabinet):

maf_extract --maf test.maf --mode slice --interval hg38.chr1:67134378-67134397
##maf version=1
a score=25734.000000
s hg38.chr1                67134378  3 + 248956422 GGG
s panTro4.chr1             67749285  3 + 228333871 GGG
s rheMac3.chr1             70514536  3 + 229590362 GGG
s mm10.chr4               103297309  1 + 156508116 A--
s canFam3.chr5             45501001  2 -  88915250 GA-

a score=425734.000000
s hg38.chr1                67134381 16 + 248956422 CGACTAATCAGATAAT
s panTro4.chr1             67749288 16 + 228333871 TGACTAATCAGATAAT
s rheMac3.chr1             70514539 16 + 229590362 CAACTAATCTGATAAT
s mm10.chr4               103297310 16 + 156508116 CAACTAATCAAATAAA
s rn5.chr5                126700584 16 + 177180328 CAGCTAATAAAATAAT
s canFam3.chr5             45501003 16 -  88915250 TAACCAATCAGATAAC

score 0 · Answer 3 · 2014-12-12

0

Entering edit mode

10.0 years ago

Endre Bakken Stovner ▴ 970

# Biopython can also handle these contiguous blocks:
# http://biopython.org/wiki/Multiple_Alignment_Format

import sys

# replace "./alignio-maf" with the full path of the alignio-maf branch
# you cloned from github, or keep if it's in the current directory
sys.path.insert(1, "./alignio-maf")

try:
    from Bio.AlignIO import MafIO
except ImportError:
    print "oops, the import didn't work"

# idx = MafIO.MafIndex(sqlite_file, maf_file, target_seqname)
idx = MafIO.MafIndex("chr1.mafindex", "chr1.maf", "hg38.chr1")

results = idx.get_spliced([67134378], [67134397])
for r in results:
    print r

ADD COMMENT • link 10.0 years ago by Endre Bakken Stovner ▴ 970

0

Entering edit mode

Hi, I am a little confused.

Where does the 'sqlite_file' come from - or in this specific case - 'chr1.mafindex'?