Unique-Ify The Contents Of A Fasta-Formatted File And Associated Read Numbers
2
0
Entering edit mode
11.7 years ago
2011101101 ▴ 110

I have a file and it's format is in below

>PH1_2044586      3
CAAGT 
>SM1_2913489      2
CAAGT 
>PH2_3543355      2
TTCTG
>WP2_579263      1
TTCTG

The result is like this

>1    5
CAAGT
>2    3
TTCTG
fasta format reads • 3.0k views
ADD COMMENT
1
Entering edit mode
11.7 years ago

You can use groupBy from bedtools:

cat your.fa | paste - - | perl -ane 'print join("\t",@F)."\n";' | sort -k3 | groupBy -g 3 -c 2 -o sum | perl -ane '$a++;print ">$a\t$F[1]\n$F[0]\n";'
ADD COMMENT
0
Entering edit mode

You are assuming that every sequence will be very short, your paste command will break on longer ones. Also, there is no need to use cat, just run paste - - your.fa | perl ....

ADD REPLY
0
Entering edit mode

You're right, it can be written much nicer. I just wanted to show one way of solving the problem and I always start with a 'cat', or a 'more; and build up the pipe step-by-step, without cleaning up the previous call(s). And I assume, that the sequences are NGS reads and thus, they will not be of great length and 'paste' should work great here. But I think if there are millions of reads, the 'sort' might run forever. As I said.... just one possibility. When having millions of entries, a perl-hash needs a lot of main memory and a 'sort' needs a lot of time.... :)

ADD REPLY
0
Entering edit mode

A Perl hash won't use that much memory (an array will use more), storing 5 million 100bp seqs used 1.5g of RAM on my machine, and if that is prohibitive you don't have to store things in memory. The downside is that memory is a lot faster than disk IO. The memory usage should be low for this data set if these are the actual read lengths.

ADD REPLY
0
Entering edit mode

well.... current NGS experiments can have up to 200 million 100bp sequences. But you are right. Here is the command line code with a perl hash: "perl -ane 'if(/>/){$c=$F[1];}else{$h{$F[0]}+=$c;}END{foreach(keys %h){$a++;print ">$a\t$h{$_}\n$_\n";}}' test.fa"

ADD REPLY
0
Entering edit mode

You are right about NGS data sets and memory usage; I would take a different approach in that case. We don't know anything about the data/question in this case, so I think any approach should be equally fine.

ADD REPLY
0
Entering edit mode

Thank you for your answer ,my data is small RNA and the unique reads about 700million ,but I have solved my question by using David Langenberger's command .

ADD REPLY
1
Entering edit mode
11.7 years ago
SES 8.6k

Here is a BioPerl version:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my %seqs;
my $cts = 0;
my $seqio = Bio::SeqIO->new(-fh => \*DATA, -format => 'fasta');

while (my $seq = $seqio->next_seq) {
    $seqs{$seq->seq} += $seq->desc;
}

for my $mer (reverse sort {$seqs{$a} <=> $seqs{$b}} keys %seqs) {
    $cts++;
    print ">$cts  $seqs{$mer}\n$mer\n";
}

__DATA__
>PH1_2044586      3
CAAGT
>SM1_2913489      2
CAAGT
>PH2_3543355      2
TTCTG
>WP2_579263      1
TTCTG

Output:

$ perl biostars67954.pl 
>1  5
CAAGT
>2  3
TTCTG
ADD COMMENT
0
Entering edit mode

Thank you very much .I will learn the bioperl,Thank you again

ADD REPLY

Login before adding your answer.

Traffic: 1681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6