Creating a consensus based on 'x' number of fasta files
1
1
Entering edit mode
8.7 years ago

I am looking for a way to create a consensus sequence based on the occurrence of individual sequences for a given tag. For example, I have the following FASTA sequences in one file for a given tag (i.e. CTAGGCAC). The number in the header refers to their occurrence.

>A_2795
TCAGAAAGAACCTC

>B_10
TCAGAAAGCACCTC

>C_3
TCTGAAAGCACTTC

>Consensus
TCAGAAAGAACCTC

Manually this is not a problem. However, it would be great to be able to make use of perl or bash (including bio packages) to create a consensus based on occurrence of each sequence. I appreciate anyone's thoughts/help.

Thanks!

alignment sequence perl bash • 2.3k views
ADD COMMENT
1
Entering edit mode
8.7 years ago

for fun, using Bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

echo -e ">A_2795\nTCAGAAAGAACCTC\n>B_10\nTCAGAAAGCACCTC\n>C_3\nTCTGAAAGCACTTC" | java -jar ~/src/jvarkit-git/dist/bioalcidae.jar -F fasta -e 'var consensus=[];while(iter.hasNext()) { var seq=iter.next();out.printlnseq.name+"\t"+seq.sequence);for(var i=0;i< seq.length();++i) {while(consensus.length <= i) consensus.push({}); var b = seq.charAt(i);if(b in consensus[i]) {consensus[i][b]++;} else {consensus[i][b]=1;} } } out.print("Cons.\t"); for(var i=0;i< consensus.length;i++) {var best=0,base="N"; for(var b in consensus[i]) {if(consensus[i][b]>best) { best= consensus[i][b];base=b;}} out.print(base);} out.println();' 


A_2795      TCAGAAAGAACCTC
B_10         TCAGAAAGCACCTC
C_3           TCTGAAAGCACTTC
Cons.        TCAGAAAGCACCTC
ADD COMMENT
0
Entering edit mode

My apologies for being long overdue, but thanks Pierre! This is exactly what I was looking for.

ADD REPLY

Login before adding your answer.

Traffic: 1660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6