Extracting information from a large fasta header formatted file
1
0
Entering edit mode
8.2 years ago
User 6777 ▴ 20

I have a large fasta header formatted file (input.txt) like:

>NC_23689
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                      1          25          25
#   2                     39          47           9
#
>68469409
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   None.
#
# Prediction profile output:
#   Columns:
#   1 - Amino acid number
#   2 - One letter code
#   3 -  probability value
#   4 - output
#
1   M     0.1325        0
2   S     0.1341        0
3   S     0.1384        0
>68464675
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                     13          24          12
#   2                     31          53          23
#   3                     81          95          15
#   4                    115         164          50
#
...
...

I want to extract each fasta header and its corresponding start-end value(s) (after Predicted binding regions line) in a (output.txt file). For the above (input.txt), the output will be:

NC_23689: 1-25, 39-47
68464675: 13-24, 31-53, 81-95, 115-164

I have tried some regular perl and python scripts ut it looks like not that much straightforward to me. I was wondering whats the best way to get these lines out from the input file for every single header?

Thanks

perl python • 1.7k views
ADD COMMENT
1
Entering edit mode

Interesting way of describing the file (fasta header formatted). A fasta formatted file is expected to have sequence data in it (protein or DNA). This has neither.

ADD REPLY
2
Entering edit mode
8.2 years ago

As your file is not a FASTA file, you can't use existing FASTA parsers. Here is a naive perl approach. Modify as appropriate.

#!/usr/bin/perl

use strict;
use warnings;

$/ = ">";
my %data;
open my $fh,"<", $ARGV[0] or die "\nERROR: Can't read file $ARGV[0]: $!";
while(my $record =<$fh>) {
  my ($header, @table) = split(/\n/,$record);
  my $read = 0;
  foreach my $i(0..$#table) {
    if ($table[$i]=~/Predicted binding region/i && $table[$i+1]=~/Start\s+End/i) {
      $read = 1;
      next
    }
    if ($read) {
      next if ($table[$i]=~/Start\s+End/i);
      my @cells = split(/\s+/,$table[$i]);
      next unless ($cells[1]);
      push @{$data{$header}},"$cells[2]-$cells[3]";
    }
  }
}
close $fh;
foreach my $header(keys %data) {
  print "$header: ",join(", ",@{$data{$header}}),"\n";
}
ADD COMMENT
0
Entering edit mode

Exactly what I need. Thank you

ADD REPLY
0
Entering edit mode

Go ahead and accept the answer (use the check mark against @Jean-Karim's post.

ADD REPLY

Login before adding your answer.

Traffic: 2649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6