I've downloaded the 7-way hg38 file from http://hgdownload-test.sdsc.edu/goldenPath/hg38/multiz7way/ and checked that the md5sum was correct.
I made an index using
maf_build_index.py
Then I tried looking up the specific sequence I need:
python bx_python.py hg38.7way.maf hg38 chr1 67134378 67134397
(The script I use is this one: Help With Maf'S In Bx-Python)
This results in a
endrebak@havpryd ~/b/p/data> python bx_python.py hg38.7way.maf hg38 chr1 67134378 67134397
Range error: 67134397 not in 67134365-67134381
The error is from this code:
https://bitbucket.org/james_taylor/bx-python/src/a3b93fa3e507377b3de63b8b6d2cbce452d57996/lib/bx/align/core.py?at=default#cl-323 (you need to wrap the string in an exception to get it to work).
The sequence seems to be fine and conserved, so that is not the problem:
Anyone have an inkling what might be wrong?
Tips for other ways to look up sequence conservation appreciated.
The maf file looks like the following (region before and after what I want to look up) - i.e. the region I want to look up is not conserved, still, shouldn't give an error:
s hg38.chr1 67134365 16 + 248956422 GGAGTATTATTGTGGG
s panTro4.chr1 67749272 16 + 228333871 GCAGTATTATTGTGGG
i panTro4.chr1 C 0 C 0
s rheMac3.chr1 70514523 16 + 229590362 GGAGTATTATTGTGGG
i rheMac3.chr1 C 0 C 0
s mm10.chr4 103297296 14 + 156508116 GGAACACTTTTATA--
i mm10.chr4 I 2110 C 0
s canFam3.chr5 45500988 15 - 88915250 GGAGTATTATTTTGA-
i canFam3.chr5 C 0 C 0
e rn5.chr5 126698426 2158 + 177180328 I
e monDom5.chr8 243126344 0 + 312544902 I
a score=425734.000000
s hg38.chr1 67134381 1047 + 248956422 CGACTAATCAGATAATGATTGCAAGAATTGATTAGCCAGCTCGAAATCGCAGCACAATTACCGCAGGGGCGATCAG
s panTro4.chr1 67749288 1045 + 228333871 TGACTAATCAGATAATGGTTGGAAGAATTGATTAGCCAGCTCGAAATCGCAGCACAATTACCGCAGGGGCGATCAG
i panTro4.chr1 C 0 C 0
s rheMac3.chr1 70514539 1024 + 229590362 CAACTAATCTGATAATAATTGGAAGAATTGATTAACCAGCTCGAAATCGCAGCACAATTACCGCAGGGGTGATCAG
i rheMac3.chr1 C 0 C 0
s mm10.chr4 103297310 943 + 156508116 CAACTAATCAAATAAAGGTTCAAAGAGTTAATAGGCTGACCTGAAATTATACTATAGTCTCAGTAAGAGTTGATAG
i mm10.chr4 C 0 I 882
s rn5.chr5 126700584 948 + 177180328 CAGCTAATAAAATAATGGTTCAAAGAATTAATGGGCCAACCTGAAATTGTACTATAATCTCCACAAGAGTCGATAG
i rn5.chr5 I 2158 C 0
s canFam3.chr5 45501003 1103 - 88915250 TAACCAATCAGATAACAATTGGAAGAAT-----------CTCGAAATT---TCACAATTGTTGCAAGAGTGATCAA
i canFam3.chr5 C 0 C 0
e monDom5.chr8 243126344 0 + 312544902 I
http://biopython.org/wiki/Multiple_Alignment_Format seems to be an alternative; will try out.
I suspect the reason is that the new maf files are whole-genome, while the old ones were split by chromosome. Will look further into this. Edit: This was not the reason.