Hello All,
I would like to start a pipeline to analyse 16S RNA data, however I am not sure weather to include the step of removing duplicates or not. The aim of metagenomics is to compare different samples and see how the microbiota is different within samples of different conditions/time points.
After aligning, each sequence will match a sequence from a database, which will be then mapped to a bacteria and assigned taxonomy. the aim is to find how microbial content changes within each sample. If I removed duplicates, each sequence will be represented once, then taxonomy to this organism will be assigned once! so I am losing data here, which is the content of unknown bacteria in a sample with quantity for each one.
I am not sure if I am perceiving this right or not! I appreciate your comments.
Regards,
Bioinfguy
Thanks for the reply, however I found that in mothur pipeline they remove duplicates! if you search "unique.seqs" in the link you sent, you will find this.. I am now confused why they do this despite the fact that this will let us loose important data!
I don't think so. They still count the total number of reads. They remove the duplicates to make the algorithm run faster but the count is still the same. Look at the further steps, they always have these
yes they do keep the total number of reads in the statistics, but the output fasta file will contain only one representative sequence for all duplicates. This will align once with reference gene. right?
yep. The output fasta has only unique sequences but the count file keeps track of everything so at the end we count duplicates.
so in this case I have to continue aligning using mothur. I was planning to use my custom pipeline, where I take this output fasta file of unique.seqs and blast it using self made tool then proceed to the pipeline, but in this case I have to include the count file in consideration to keep abundance.
you don't have to use mothur, that was just an example. you can use QIIME, it has many already created pipelines but you can also modify them. Keep in mind that clustering and aligning in QIIME is different from mothur. I don't know how big is your data but blasting it might be not such a good idea ;)
I hear about QIIME but I had problems trying to install it. Is there any easy way to install it rather than virtual machines?
download "anaconda" and install qimme using pip; so easy