Hi,
I am trying to figure out the necessary pre-processing steps before using sequencing data retreived from online databases. I work with metagenomes from Huaman gut microbiome. I figured out that the main three steps for this type of data are:
1) Identify and mask Human reads
2) Remove duplicate reads
3) Trim low quality bases
Here is an example of a study from which I would like to use data.
I can't figure out at what stage those data are. I beleive Human reads masking should have been performed already, as this has to deal with subjects privacy/ethics. But I don't find a clear information telling me that this is the case or not. Are sequencing data available on inline repositories always already cleared of Human reads already?
Thank you, Camille
Assume that the provided data is raw, if there are no notes about it being processed. If all reads are the same length then it is not even scanned/trimmed.
I suggest that you use removehuman decontamination protocol using BBMap suite. Other tools in suite
clumpify.sh
(will help you remove dups) andbbduk.sh
will help trim the data.Great, Thanks for the tools recommendation!