I was asked to identify which ASVs came from which samples. My understanding is that an ASV can be present in multiple samples but then that led me to the question of whether or not an ASV hit in a sample corresponds to 100% alignment percent identity. That is, are they only EXACT matches in DADA2 when it's building the counts table?
For example, if you had the following table:
ASV_1 ASV_2 ASV_3
Sample_1 0 100 5
Sample_2 1 10 200
Sample_3 10 20 30
Consider ASV_2 to be an amplicon sequence variant, GTCATGCATGCAGAGAGACGAGTCA
Would that mean that 100 paired-end reads from Sample_1 mapped to ASV_2 with this EXACT sequence at 100% identity?
That would also mean there were 10 paired-end reads from Sample_2 and 20 paired-end reads from Sample_3 that mapped at 100% identity.
That is, 100, 10, and 20 paired reads aligned at 100% identity to ASV_1 from the different samples.
I don't think
mapped
is the right term for what dada2 actually does. The ASV is a product of a single read-pair. The fact that one ASV is found in two different samples is because read-pairs from these samples generate the exact same sequence.That makes sense and thank you for clarifying. I know the implementation is much more sophisticated, but is this the general idea of how it works?
The approach used by dada2 is a little bit different. Basically, for each sample you create a sample-by-sequence matrix by counting how many time a ceraint sequense appear in that sample (dada2 does that by collapsing identical sequence) and then merge the sample-by-sequence matrices into a sigle matrix ("OTU" table). See multiSample.R for a detailed description of the workflow.
By the way this is not the complex part of the dada2 pipeline. At this point dada2 already knows the ASV per sample