Question

Extracting information from a large fasta header formatted file

0

Entering edit mode

8.2 years ago

User 6777 ▴ 20

I have a large fasta header formatted file (input.txt) like:

>NC_23689
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                      1          25          25
#   2                     39          47           9
#
>68469409
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   None.
#
# Prediction profile output:
#   Columns:
#   1 - Amino acid number
#   2 - One letter code
#   3 -  probability value
#   4 - output
#
1   M     0.1325        0
2   S     0.1341        0
3   S     0.1384        0
>68464675
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                     13          24          12
#   2                     31          53          23
#   3                     81          95          15
#   4                    115         164          50
#
...
...

I want to extract each fasta header and its corresponding start-end value(s) (after Predicted binding regions line) in a (output.txt file). For the above (input.txt), the output will be:

NC_23689: 1-25, 39-47
68464675: 13-24, 31-53, 81-95, 115-164

I have tried some regular perl and python scripts ut it looks like not that much straightforward to me. I was wondering whats the best way to get these lines out from the input file for every single header?

Thanks

perl python • 1.7k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 8.2 years ago by User 6777 ▴ 20

1

Entering edit mode

Interesting way of describing the file (fasta header formatted). A fasta formatted file is expected to have sequence data in it (protein or DNA). This has neither.

ADD REPLY • link 8.2 years ago by GenoMax 147k

score 2 · Accepted Answer · 2016-08-30

2

Entering edit mode

8.2 years ago

Jean-Karim Heriche 27k

As your file is not a FASTA file, you can't use existing FASTA parsers. Here is a naive perl approach. Modify as appropriate.

#!/usr/bin/perl

use strict;
use warnings;

$/ = ">";
my %data;
open my $fh,"<", $ARGV[0] or die "\nERROR: Can't read file $ARGV[0]: $!";
while(my $record =<$fh>) {
  my ($header, @table) = split(/\n/,$record);
  my $read = 0;
  foreach my $i(0..$#table) {
    if ($table[$i]=~/Predicted binding region/i && $table[$i+1]=~/Start\s+End/i) {
      $read = 1;
      next
    }
    if ($read) {
      next if ($table[$i]=~/Start\s+End/i);
      my @cells = split(/\s+/,$table[$i]);
      next unless ($cells[1]);
      push @{$data{$header}},"$cells[2]-$cells[3]";
    }
  }
}
close $fh;
foreach my $header(keys %data) {
  print "$header: ",join(", ",@{$data{$header}}),"\n";
}

ADD COMMENT • link 8.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Exactly what I need. Thank you

ADD REPLY • link 8.2 years ago by User 6777 ▴ 20

0

Entering edit mode

Go ahead and accept the answer (use the check mark against @Jean-Karim's post.

ADD REPLY • link 8.2 years ago by GenoMax 147k