Entering edit mode
11.7 years ago
2011101101
▴
110
I have a file and it's format is in below
>PH1_2044586 3
CAAGT
>SM1_2913489 2
CAAGT
>PH2_3543355 2
TTCTG
>WP2_579263 1
TTCTG
The result is like this
>1 5
CAAGT
>2 3
TTCTG
You are assuming that every sequence will be very short, your
paste
command will break on longer ones. Also, there is no need to usecat
, just runpaste - - your.fa | perl ...
.You're right, it can be written much nicer. I just wanted to show one way of solving the problem and I always start with a 'cat', or a 'more; and build up the pipe step-by-step, without cleaning up the previous call(s). And I assume, that the sequences are NGS reads and thus, they will not be of great length and 'paste' should work great here. But I think if there are millions of reads, the 'sort' might run forever. As I said.... just one possibility. When having millions of entries, a perl-hash needs a lot of main memory and a 'sort' needs a lot of time.... :)
A Perl hash won't use that much memory (an array will use more), storing 5 million 100bp seqs used 1.5g of RAM on my machine, and if that is prohibitive you don't have to store things in memory. The downside is that memory is a lot faster than disk IO. The memory usage should be low for this data set if these are the actual read lengths.
well.... current NGS experiments can have up to 200 million 100bp sequences. But you are right. Here is the command line code with a perl hash: "perl -ane 'if(/>/){$c=$F[1];}else{$h{$F[0]}+=$c;}END{foreach(keys %h){$a++;print ">$a\t$h{$_}\n$_\n";}}' test.fa"
You are right about NGS data sets and memory usage; I would take a different approach in that case. We don't know anything about the data/question in this case, so I think any approach should be equally fine.
Thank you for your answer ,my data is small RNA and the unique reads about 700million ,but I have solved my question by using David Langenberger's command .