Counting N'S Within Fasta
10
4
Entering edit mode
12.6 years ago
PoGibas 5.1k

My multiple fasta looks like that:

>ptnt1  
ATCTAGTCGATTTTTCGTCTAGTnnnnnnnnnnnnnnnnnnnnnCTAGTCCCTATGCCTGATATA  
>ptnt2  
GGCGGATGGTnnnnnnnnnnnnnnnnnTGTCTGGAnnnnnnnnnnnnnnnnCATGCTAG  
>ptnt3  
CACTGATTGGGTAACACTTAGTATGCTTAGCCAnnnnnnnnnnnnnn

I would like to:
1.count the length of every sequence;
2.count the ratio of nnnnn & normalACTG (this one is too tricky for me. I don't know how to count n's). Perl preferably (sorry guys forgot to mention this before).

Wanted output would look like (name; length; length of n's; ratio n/ACTG). Hope someone can help me with this.

fasta counts perl • 21k views
ADD COMMENT
0
Entering edit mode

The FASTA format includes > before the sequence name.

ADD REPLY
15
Entering edit mode
12.6 years ago

You can parse the fasta file with BioPython or whatever custom parser you have. Then use the count method in python to count the occurrence of nucleotides. For example:

#would give you counts of 'N'. I used lower() to convert the sequence to lowercase just in case it's not.
nCount = fastaSequence.lower().count('n') 
#would give you length of the sequence
length = len(fastaSequence)
ADD COMMENT
13
Entering edit mode
12.6 years ago
Neilfws 49k

EDIT: added line to convert n to N before count.

First, if you want a Perl solution: learn Bioperl and in particular study the Bio::SeqIO module. When you have mastered that, you will be able to read in any FASTA file and do whatever you like with it. In other words, your problem is simply a specific case of the more general problem "How to read a sequence file and do (operation X) to it."

Now, some Bioperl code, assuming that your sequences are in the file seqs.fa:

#!/usr/bin/perl -w 
use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "seqs.fa", -format => 'fasta');
while(my $seq = $seqio->next_seq) {
  my $s = $seq->seq;
  $s =~ s/n/N/g;
  my $id  = $seq->display_id; # name (ID)
  my $len = $seq->length; # length
  my %count;
  $count{$_}++ foreach split //, $s; # count bases
  $count{"N"} = 0 if !exists($count{"N"}); # create N = 0 if no N in sequence
  my $ratio = $count{"N"} / ($len - $count{"N"}); # ratio of N to not N
  my @output = ($id, $len, $count{"N"}, $ratio);
  print join(",", @output), "\n";
}

Result using your sequences:

ptNt1,65,21,0.477272727272727
ptNt2,59,33,1.26923076923077
ptNt3,47,14,0.424242424242424
ADD COMMENT
1
Entering edit mode

I don't think this will work the way it is written, you probably need $count{"n"}. Also, I don't mean to be critical because the approach is clever, but it does not seem efficient to read the sequence into a hash just to get a count. If you try this on a large number of sequences (or really long sequences), transliterate will be much faster.

ADD REPLY
1
Entering edit mode

You should be critical as you are correct on both points. I changed n to N while testing the code and forgot to change it back - now fixed. And yes, transliteration is better for counting large sequences, provided that altering the string is not a problem.

ADD REPLY
0
Entering edit mode

I assume you mean this by transliteration?

$countN= $seq->seq =~ tr/Nn/Nn/;

Also then no need for "$s =~ s/n/N/g;"

ADD REPLY
0
Entering edit mode

Yes, that's the one.

ADD REPLY
9
Entering edit mode
12.6 years ago

in C:

#include <stdio.h>
int main(int argc,char** argv)
    {
    int N=0;
    int T=0;
    for(;;)
        {
        int c=fgetc(stdin);
        switch(c)
            {
            case EOF: case '>':
                {
                if(T>0) printf("\t%d\t%d\t%f\n",N,T,N/(double)T);
                if(c==EOF) return 0;
                N=0;
                T=0;
                while((c=fgetc(stdin))!=EOF && c!='\n')
                    {
                    fputc(c,stdout);
                    }
                fputc('\t',stdout);
                break;
                }
            case ' ':case '\n': case '\r': break;
            case 'N': case 'n': ++N;/* continue */
            default: ++T;
            }
        }
    return 0;
    }

.

gcc -O3 -Wall input.c

.

./a.out < hg19.fa
chr1        23970000    249250621    0.096168
chr2        4994855    243199373    0.020538
chr3        3225295    198022430    0.016288
chr4        3492600    191154276    0.018271
chr5        3220000    180915260    0.017798
chr6        3720001    171115067    0.021740
chr7        3785000    159138663    0.023784
chr8        3475100    146364022    0.023743
chr9        21070000    141213431    0.149207
chr10        4220009    135534747    0.031136
chr11        3877000    135006516    0.028717
chr12        3370502    133851895    0.025181
chr13        19580000    115169878    0.170010
chr14        19060000    107349540    0.177551
chr15        20836626    102531392    0.203222
chr16        11470000    90354753    0.126944
chr17        3400000    81195210    0.041874
chr18        3420019    78077248    0.043803
chr19        3320000    59128983    0.056148
chr20        3520000    63025520    0.055850
chr21        13023253    48129895    0.270586
chr22        16410021    51304566    0.319855
chrX        4170000    155270560    0.026856
chrY        33720000    59373566    0.567930
chrM        0    16571    0.000000
ADD COMMENT
8
Entering edit mode
12.6 years ago
lh3 33k

With seqtk:

seqtk comp input.fa | awk '{x=$3+$4+$5+$6;y=$2;print $1,y-x,y,(y-x)/y}'

Identical output to Pierre's for GRCh37. seqtk comp reads gzipped files and both fasta and fastq. It also reports some other statistics.

For long sequences, C can be several times faster than scripts and even faster than Bio*.

ADD COMMENT
5
Entering edit mode
12.6 years ago

Use Biopieces:

Per sequence:

read_fasta -i input.fna | analyze_seq

Summary:

read_fasta -i input.fna | analyze_seq | analyze_vals -k SEQ_LEN,HARD_MASK% -x
ADD COMMENT
4
Entering edit mode
12.6 years ago
Anima Mundi ★ 2.9k

This Python function could help you. As it is, it works as a script.py that calls the function and takes the input from the terminal. If you remove the raw_input statement and the function call (last line) you obtain the function per se. Hope this helps.

def Counter():

    #Counter evaluates N nucleotides ratio.

    initial_string = raw_input("Enter the string:\n")

    print

    string = str.upper(initial_string)

    n_A = float(string.count("A"))
    n_T = float(string.count("T"))
    n_C = float(string.count("C"))
    n_G = float(string.count("G"))
    n_N = float(string.count("N"))

    print "A occurred", int(n_A), "times"
    print "T occurred", int(n_T), "times"
    print "C occurred", int(n_C), "times"
    print "G occurred", int(n_G), "times"
    print "N occurred", int(n_N), "times"
    print

    tot_len_string_nucl = n_A + n_T + n_C + n_G + n_N #Counts just the nucleotides.
    tot_len_string_all = float(len(string)) #Includes any "garbage" typing.

    tot_differ = tot_len_string_all - tot_len_string_nucl

    if tot_differ != 0.0:
        print "Non-nucleotide typing are present in the sequence. They were ignored"
        print "Total (nucleotide) length is", int(tot_len_string_nucl)
    else:
        print "Total length is", int(tot_len_string_nucl)

    print

    perc_A = 100*(n_A / tot_len_string_nucl)
    perc_T = 100*(n_T / tot_len_string_nucl)
    perc_C = 100*(n_C / tot_len_string_nucl)
    perc_G = 100*(n_G / tot_len_string_nucl)
    perc_N = 100*(n_N / tot_len_string_nucl)

    print "%A = " + str(perc_A) + "%"
    print "%T = " + str(perc_T) + "%"
    print "%C = " + str(perc_C) + "%"
    print "%G = " + str(perc_G) + "%"
    print "%N = " + str(perc_N) + "%"
    print


Counter()
ADD COMMENT
4
Entering edit mode
12.6 years ago
SES 8.6k

Here is another BioPerl example that may be easier to understand. This will print the percentage of Ns also, which is probably more informative than the ratio.

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my $usage = "$0 infile\n";
my $in = shift or die $usage;

my $seq_in = Bio::SeqIO->new(-file => $in, -format => 'fasta');

my ($n_count, $total, $seq_ct, $n_ratio, $n_ratio_t, $n_perc, $n_perc_t) = (0, 0, 0, 0, 0, 0, 0);

print "Seq_ID\tLength\tN_count\tN_ratio\tN_perc\n";

while(my $seq = $seq_in->next_seq) {
    if ($seq->length > 0) {
        $seq_ct++;
        $total += $seq->length;
        $n_count = ($seq->seq =~ tr/Nn//);
        $n_perc  = sprintf("%.2f",$n_count/$seq->length);
        $n_ratio = sprintf("%.2f",$n_count/($seq->length - $n_count));
        $n_ratio_t += $n_ratio;
        $n_perc_t  += $n_perc; 
        print join("\t",($seq->id,$seq->length,$n_count,$n_ratio,$n_perc)),"\n";
    }
}

my $len_ave     = sprintf("%.0f",$total/$seq_ct);
my $n_ratio_ave = sprintf("%.2f",$n_ratio_t/$seq_ct);
my $n_perc_ave  = sprintf("%.2f",$n_perc_t/$seq_ct);

print "\nAverage length, N-ratio and N-percent for $in :   $len_ave\t$n_ratio_ave\t$n_perc_ave\n";

Assuming your sequences are in a file called "N.fas" save the script as "bp_count_n.pl" (or whatever you want) then execute it:

perl bp_count_n.pl N.fas

and see the results (which you could redirect to a file).

Seq_ID    Length    N_count    N_ratio    N_perc
ptnt1    65    21    0.48    0.32
ptnt2    59    33    1.27    0.56
ptnt3    47    14    0.42    0.30

Average length, N-ratio and N-percent for N.fas : 57    0.72    0.39
ADD COMMENT
4
Entering edit mode
11.8 years ago

There is also a tool from UCSC: faCount

Example:

faCount reads.fa

#seq    len    A    C    G    T    N    cpg
ptnt1    65    9    10    7    18    21    2
ptnt2    59    4    4    11    7    33    1
ptnt3    47    9    7    7    10    14    0
total    171    22    21    25    35    68    3

Your question (Wanted output would look like (name; length; length of n's; ratio n/ACTG):

faCount reads.fa | perl -ane 'next if(/\#/); print "$F[0]\t$F[1]\t$F[6]\t".($F[6]/($F[1]-$F[6]))."\n";'

ptnt1    65    21    0.477272727272727
ptnt2    59    33    1.26923076923077
ptnt3    47    14    0.424242424242424
total    171    68    0.660194174757282
ADD COMMENT
2
Entering edit mode
12.6 years ago
ngsgene ▴ 380

If you were to use perl without bio perl, using an if ($l !~ />/) loop in the script can get you to your sequence after the fasta description and length() can give you the length. For counting n's you can split the sequence and increment a hash when an n comes while another hash for counting a/c/g/t

Store all this information in variables for you to print, for name you can use ($l =~ />($1)/) and then grab $1 and print it for the name, length from length(), length of n's can be determined from the hash, and you can calculate the ratio using both the hashes.

The key is to print the variables without over writing them.

ADD COMMENT
2
Entering edit mode
11.8 years ago
fo3c ▴ 450

Something simpler, in perl as requested, without bio- dependencies.

#!/usr/bin/perl

my ($h, $n, $l);

open(I,$ARGV[0]) or die($!);
while(<I>){
  chomp;
  if(/^>/){
    $h=substr($_,1);
  }else{
    $n=($_=~tr/nN/nN/);
    $l=length($_);
    print $h,"\t",$l,"\t",$n,"\t",$n/($l-$n),"\n";
  }
}
close(I);

To use it: $ perl script_name.pl sequences.fasta

ADD COMMENT

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6