Question

When counting ASVs in DADA2, does a hit/count only occur at 100% alignment percent identity?

2

Entering edit mode

16 months ago

O.rka ▴ 740

I was asked to identify which ASVs came from which samples. My understanding is that an ASV can be present in multiple samples but then that led me to the question of whether or not an ASV hit in a sample corresponds to 100% alignment percent identity. That is, are they only EXACT matches in DADA2 when it's building the counts table?

For example, if you had the following table:

    ASV_1   ASV_2   ASV_3
Sample_1    0   100 5
Sample_2    1   10  200
Sample_3    10  20  30

Consider ASV_2 to be an amplicon sequence variant, GTCATGCATGCAGAGAGACGAGTCA

Would that mean that 100 paired-end reads from Sample_1 mapped to ASV_2 with this EXACT sequence at 100% identity?

That would also mean there were 10 paired-end reads from Sample_2 and 20 paired-end reads from Sample_3 that mapped at 100% identity.

That is, 100, 10, and 20 paired reads aligned at 100% identity to ASV_1 from the different samples.

metagenomics qiime2 amplicon 16s dada2 • 1.0k views

ADD COMMENT • link updated 16 months ago by andres.firrincieli 3.8k • written 16 months ago by O.rka ▴ 740

0

Entering edit mode

That would also mean there were 10 paired-end reads from Sample_2 and 20 paired-end reads from Sample_3 that mapped at 100% identity.

I don't think mapped is the right term for what dada2 actually does. The ASV is a product of a single read-pair. The fact that one ASV is found in two different samples is because read-pairs from these samples generate the exact same sequence.

ADD REPLY • link 16 months ago by andres.firrincieli 3.8k

0

Entering edit mode

That makes sense and thank you for clarifying. I know the implementation is much more sophisticated, but is this the general idea of how it works?

# Create a dictionary of dictionaries where each value is an integer
counts_table = defaultdict(lambda: defaultdict(int))

# Samples to sequences
samples_to_sequences = {"sample_a":[seq_1, seq_2, etc}

# Iterate through samples and sequences 
for id_sample, asv_sequences in samples_to_sequences.items():
     for seq in asv_sequences:
          id_hash = hash(seq)
          counts_table[id_sample][id_hash] += 1

# Reformat table
counts_table = pd.DataFrame(counts_table)

ADD REPLY • link 16 months ago by O.rka ▴ 740

1

Entering edit mode

The approach used by dada2 is a little bit different. Basically, for each sample you create a sample-by-sequence matrix by counting how many time a ceraint sequense appear in that sample (dada2 does that by collapsing identical sequence) and then merge the sample-by-sequence matrices into a sigle matrix ("OTU" table). See multiSample.R for a detailed description of the workflow.

By the way this is not the complex part of the dada2 pipeline. At this point dada2 already knows the ASV per sample

ADD REPLY • link 16 months ago by andres.firrincieli 3.8k