Hello,
I have 10 fasta files with sequenced reads information with read sizes from 15 - 35 . I have combined the reads and collapsed in to unique reads and filtered for sizes 18 - 26 bp long unique reads. Now i wanted to count each unique read appearance in all the fasta files and make a table with sample names as columns and reads as rows. I tried to use "grep -w "sequence name" file name " to count the tags but this seems to take long time. does anyone know how to do this faster?
Edited: Sorry for the confusion. Here is the input and output
This is the input file where it contains unique sequences. i have more than million such unique sequences.
Query:
>tag1
TCGGA
>tag2
TCTCA
>tag3
TCTCGC
These are multiple files. for example i am showing with 3 files. i have more than 20 such files. each file contains more than 10 million sequences each
File1:
>file1_id1
TCGGA
>file1_id1
TCGGAT
>file1_id2
TCTCA
>file1_id3
TCTCA
File2:
>file2_id1
TCTCA
>file2_id2
TCTCA
>file2_id3
TCTCACTA
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC
File3:
>file1_id1
TCGGA
>file1_id1
TCGGAT
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC
I need the following output. Search has to be exact for the count. output:
sequence file1 file2 file3
tag1 TCGGA 1 0 1
tag2 TCTCA 2 2 0
tag3 TCTCGC 0 2 2
Perfect ! this works really welly and easy to understand as well. Something new i learnt today. Thank you
You're most welcome!