Identify Domains In Dna Sequences
1
0
Entering edit mode
11.9 years ago

Dear all,

I have the following output from a HMM.

>hg19_dna range=chr7:113770949-115775846 5'pad=0 3'pad=0 strand=+ repeatMasking=none
?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

and also with Ns. For example:

?V
NNNNNNNNNNNNNNNNNNNNNNNN

These are all stored in a file. I would like to identify and take those regions with Us and those with Ns and make a BED file out of those according to the coordinates that are specified above. Also to generalize for more symbols, not only two (Us and Ns) but for now two is enough. Anyone that could help???

Thank you in advance

perl python script • 2.3k views
ADD COMMENT
0
Entering edit mode
11.9 years ago

You could convert your FASTA input (assuming it is FASTA) with the following Perl script:

#!/usr/bin/env perl

use strict;
use warnings;

while (<>) {
    if ($_ =~ /^>/) {
        chomp;
        my $ln = $_;
        $ln =~ s/^>//s;
        my ($id, $rangeField, $fivePad, $threePad, $strandField, $maskFlag) = split(/[\s]/, $ln);
        my ($rangeKey, $chr, $start, $stop) = split(/[=|:|-]/, $rangeField);
        my ($strandKey, $strand) = split(/[=]/, $strandField);
        my $score = 0;
        print STDOUT join("\t", ($chr, $start, $stop, $id, $score, $strand))."\n";
    }
}

To use it, for example:

$ hmm_fasta2bed.pl < my_hmm.fa
chr7    113770949       115775846       hg19_dna        0       +
...

To save the result to a sorted BED file, for example:

$ hmm_fasta2bed.pl < my_hmm.fa | sort-bed - > my_hmm.bed
ADD COMMENT

Login before adding your answer.

Traffic: 3634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6