Question

parsing alignment file

3

Entering edit mode

8.2 years ago

a.rex ▴ 350

I am having some difficulty producing a script for parsing an alignment file in the following format generated from RepeatMasker. For example:

665  28.45  2.93  5.02  g5129s420  7350  7882  (1924)  C  MIR#SINE/MIR  (1)  261  28  3

  g5129s420         7350 ATCATAACAAACATTTAT--GGTGCCTCCTATGGAGCAGGGATTTTGCTT 7397   
                           v     v           i i  i v     viv    v i v v  v
C MIR#SINE/MIR       261 ATAATAACCAACATTTATTGAGCGCTTACTATGTGCCAGGCACTGTTCTA 212    

  g5129s420         7398 AGGACTCTGAACTATAT---CTTACTT-GTCTTCATTAAAAACCTTATGA 7443   
                           vi  i iv   i        i i   i  i    i  v    i     
C MIR#SINE/MIR       211 AGCGCTTTACA-TGTATTAACTCATTTAATCCTCA-CAACAACCCTATGA 164    

  g5129s420         7444 AAAAGGTACTATTATTAACTGGGGXTGGGTTGTTTAACAGATAAGAAAGC 7787   
                         iiv              v i         iii   v      i  i  i 
C MIR#SINE/MIR       163 GGTAGGTACTATTATTATCC---------CCATTTTACAGATGAGGAAAC 123    

  g5129s420         7788 TTAAGAATTAGAGAGATAAATTATCTTGCTTAAGGTAACACAGTTAACAA 7837   
                          v i v  i      i v  v  v     ii     v      i  ii  
C MIR#SINE/MIR       122 TGAGGCA-CAGAGAGGTTAAGTAACTTGCCCAAGGTCACACAGCTAGTAA 74     

  g5129s420         7838 GCATTAG-GTCAAAGTTTGAACTCGGGCAGTCTGACTACAGAGCCC 7882   
                          iivi    i iiii  i    i i         i  v     i  
C MIR#SINE/MIR        73 GTGGCAGAGCCGGGATTCGAACCCAGGCAGTCTGGCTCCAGAGTCC 28     

Transitions / transversions = 1.96 (45 / 23)
Gap_init rate = 0.03 (8 / 234), avg. gap size = 2.38 (19 / 8)

I would like to parse the file using BioPython such that I obtain the chromosome/scaffold name g5129s420, start + end (7350 7882), and the Transitions/transversions. Any ideas on how to write this script would be most welcome, as I am a complete novice to scripting.

genome python parsing • 1.9k views

ADD COMMENT • link 8.2 years ago by a.rex ▴ 350

0

Entering edit mode

8.2 years ago

a.rex ▴ 350

Thank you so much for this - helpful to see examples of how to parse this file. Unfortunately, I wish to parse out all sequences in the file and do not have the names of particular scaffolds. Is there a simpler way of doing this using BioPython?

ADD COMMENT • link 8.2 years ago by a.rex ▴ 350

0

Entering edit mode

You can use above script to parse all the sequences in the file but you need sequence names. It would be very difficult to parse without seq names, some BioPython module may do.

ADD REPLY • link 8.2 years ago by second_exon ▴ 210

0

Entering edit mode

Please use ADD COMMENT to reply to earlier posts, as such this thread remains logically structured and easy to follow.

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2016-09-12

There may be some BioPython modules or some other parser exists for this job. Here is my solution based on Python 3.x if you know chromosome/scaffold name

seqname = ['g5129s420','g5129s421'] #list sequence names here
dic, trans = {}, {}
for names in seqname:
    dic[names] = []
    trans[names] = []
with open("file.txt") as f:
    for line in f:
        line = line.rstrip().lstrip().split(' ')
        line = [item for item in line if item is not '']
        if len(line)>0:
            #print(line)
            if line[0] in seqname:
                dic[line[0]].append(line[1])
                dic[line[0]].append(line[-1])
                seqname1 = line[0]
            elif line[0] == 'Transitions':
                trans[seqname1].append(line[4])
                trans[seqname1].append(line[5].strip('('))
                trans[seqname1].append(line[7].strip(')'))

for key, values in dic.items():
    if len(dic[key])>0:
        print(key,':', min(values), max(values), trans[key][0], trans[key][1], trans[key][2])

Output:

g5129s420 : 7350 7882 1.96 45 23