Question

Count Sequences In Multi Fasta File

1

Entering edit mode

11.2 years ago

empyrean999 ▴ 180

Hello,

I have 10 fasta files with sequenced reads information with read sizes from 15 - 35 . I have combined the reads and collapsed in to unique reads and filtered for sizes 18 - 26 bp long unique reads. Now i wanted to count each unique read appearance in all the fasta files and make a table with sample names as columns and reads as rows. I tried to use "grep -w "sequence name" file name " to count the tags but this seems to take long time. does anyone know how to do this faster?

Edited: Sorry for the confusion. Here is the input and output

This is the input file where it contains unique sequences. i have more than million such unique sequences.

Query:
>tag1
TCGGA
>tag2
TCTCA
>tag3
TCTCGC

These are multiple files. for example i am showing with 3 files. i have more than 20 such files. each file contains more than 10 million sequences each

File1:
>file1_id1
TCGGA
>file1_id1
TCGGAT
>file1_id2
TCTCA
>file1_id3
TCTCA

File2:
>file2_id1
TCTCA
>file2_id2
TCTCA
>file2_id3
TCTCACTA
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC

File3:
>file1_id1
TCGGA
>file1_id1
TCGGAT
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC

I need the following output. Search has to be exact for the count. output:

       sequence      file1      file2      file3
tag1    TCGGA        1        0        1
tag2    TCTCA        2        2        0        
tag3    TCTCGC        0        2        2

fasta perl python unix • 7.0k views

ADD COMMENT • link updated 11.2 years ago by Kenosis ★ 1.3k • written 11.2 years ago by empyrean999 ▴ 180

score 2 · Answer 1 · 2014-03-10

EDIT: question edited after I posted this answer, so it may not be relevant.

It's not clear from your question what part of the fasta file you are trying to count and what criteria specify a "unique read appearance".

If we assume that reads can appear more than once in the file myfile.fa and that the same read (sequence) will have the same ID - for example:

>id1
acgt...
>id1
acgt....
>id2
ggga...

then you can extract and count IDs like this:

grep -o -E "^>\w+" myfile.fa | tr -d ">" | uniq -c

  2 id1
  1 id2

score 2 · Answer 2 · 2014-03-11

Given your datasets--one query file and three seqnence files--the following produces your desired results:

use strict;
use warnings;
use File::Basename;
use Sort::Naturally;

local $/ = '>';
my ( %files, %query, %omit, @F );
$omit{$_}++ for ( $ARGV[0], basename $0);

while (<>) {
    chomp;
    $query{ $F[0] } = $F[1] if @F = split;
}

local $/ = "\n";
my @files = nsort @ARGV = grep !$omit{$_}, <*>;

while (<>) {
    next if /^>/;
    chomp;
    $files{$_}{$ARGV}++;
}

print join( "\t", 'tag', 'seq', @files ), "\n";

for my $key ( nsort keys %query ) {
    my @seqCount;
    push @seqCount, $files{ $query{$key} }->{$_} || 0 for @files;
    print join( "\t", $key, $query{$key}, @seqCount ), "\n";
}

Output on your datasets:

tag    seq    file1    file2    file3
tag1    TCGGA    1    0    1
tag2    TCTCA    2    2    0
tag3    TCTCGC    0    2    2

Place the script file in the same directory as all of your other files, and call it from the command line as follows:

perl script.pl queryFile

The script first processes the queryFile, creating a hash where the keys are the tags and the associated values are the sequences. Next, it reads all the file names in the directory, omitting (via grep) the query and script files. It builds a hash of hashes (HoH) to tally the sequences in the seq files. Finally, it prints the results.

Sort::Naturally is used to keep the alphanumeric tag and file names in proper order.

Hope this helps!

score 0 · Answer 3 · 2014-03-10

0

Entering edit mode

11.2 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.6k

You tagged Perl. You can use a hash and for the key you can concatenate the FASTA id and sequence with a unique delimiter.

$key = join "~", $id,$seq;
$seqHash{$key}++;

....

foreach $key (sort keys %seqHash){
@temp = split /~/, $key;
print $temp[0]."\t".$seqHash{$key};
}

ADD COMMENT • link 11.2 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k

0

Entering edit mode

i dont want to combine id and sequence and search as the id is different in tags and other fasta files

ADD REPLY • link 11.2 years ago by empyrean999 ▴ 180

0

Entering edit mode

$id == $sample as you mentioned above. If you already uniqued your seqs? So are you trying to count identical reads across samples? You would have to use the sequence as a key in Perl. But I'm sure there's a better way to do this like in Python

ADD REPLY • link 11.2 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k