Question

Unique-Ify The Contents Of A Fasta-Formatted File And Associated Read Numbers

0

Entering edit mode

12.1 years ago

2011101101 ▴ 110

I have a file and it's format is in below

>PH1_2044586      3
CAAGT 
>SM1_2913489      2
CAAGT 
>PH2_3543355      2
TTCTG
>WP2_579263      1
TTCTG

The result is like this

>1    5
CAAGT
>2    3
TTCTG

fasta format reads • 3.2k views

ADD COMMENT • link updated 12.1 years ago by Alex Reynolds 36k • written 12.1 years ago by 2011101101 ▴ 110

Ram · Accepted Answer · 2013-04-02

1

Entering edit mode

12.1 years ago

David Langenberger 11k

You can use groupBy from bedtools:

cat your.fa | paste - - | perl -ane 'print join("\t",@F)."\n";' | sort -k3 | groupBy -g 3 -c 2 -o sum | perl -ane '$a++;print ">$a\t$F[1]\n$F[0]\n";'

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 12.1 years ago by David Langenberger 11k

0

Entering edit mode

You are assuming that every sequence will be very short, your paste command will break on longer ones. Also, there is no need to use cat, just run paste - - your.fa | perl ....

ADD REPLY • link 12.1 years ago by terdon ▴ 430

0

Entering edit mode

You're right, it can be written much nicer. I just wanted to show one way of solving the problem and I always start with a 'cat', or a 'more; and build up the pipe step-by-step, without cleaning up the previous call(s). And I assume, that the sequences are NGS reads and thus, they will not be of great length and 'paste' should work great here. But I think if there are millions of reads, the 'sort' might run forever. As I said.... just one possibility. When having millions of entries, a perl-hash needs a lot of main memory and a 'sort' needs a lot of time.... :)

ADD REPLY • link 12.1 years ago by David Langenberger 11k

0

Entering edit mode

A Perl hash won't use that much memory (an array will use more), storing 5 million 100bp seqs used 1.5g of RAM on my machine, and if that is prohibitive you don't have to store things in memory. The downside is that memory is a lot faster than disk IO. The memory usage should be low for this data set if these are the actual read lengths.

ADD REPLY • link 12.1 years ago by SES 8.6k

0

Entering edit mode

well.... current NGS experiments can have up to 200 million 100bp sequences. But you are right. Here is the command line code with a perl hash: "perl -ane 'if(/>/){$c=$F[1];}else{$h{$F[0]}+=$c;}END{foreach(keys %h){$a++;print ">$a\t$h{$_}\n$_\n";}}' test.fa"

ADD REPLY • link 12.1 years ago by David Langenberger 11k

0

Entering edit mode

You are right about NGS data sets and memory usage; I would take a different approach in that case. We don't know anything about the data/question in this case, so I think any approach should be equally fine.

ADD REPLY • link 12.1 years ago by SES 8.6k

0

Entering edit mode

Thank you for your answer ,my data is small RNA and the unique reads about 700million ,but I have solved my question by using David Langenberger's command .

ADD REPLY • link 12.1 years ago by 2011101101 ▴ 110

score 1 · Accepted Answer · 2013-04-02

1

Entering edit mode

12.1 years ago

SES 8.6k

Here is a BioPerl version:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my %seqs;
my $cts = 0;
my $seqio = Bio::SeqIO->new(-fh => \*DATA, -format => 'fasta');

while (my $seq = $seqio->next_seq) {
    $seqs{$seq->seq} += $seq->desc;
}

for my $mer (reverse sort {$seqs{$a} <=> $seqs{$b}} keys %seqs) {
    $cts++;
    print ">$cts  $seqs{$mer}\n$mer\n";
}

__DATA__
>PH1_2044586      3
CAAGT
>SM1_2913489      2
CAAGT
>PH2_3543355      2
TTCTG
>WP2_579263      1
TTCTG

Output:

$ perl biostars67954.pl 
>1  5
CAAGT
>2  3
TTCTG

ADD COMMENT • link 12.1 years ago by SES 8.6k

0

Entering edit mode

Thank you very much .I will learn the bioperl,Thank you again

ADD REPLY • link 12.1 years ago by 2011101101 ▴ 110