Entering edit mode
19 days ago
Maxwell
▴
70
Wondering if publicly available data for microbiome studies already have removed host sequences before uploading to the SRA or other public database?
Is this typical or is it not uniform?
In my experience I think most are host removed, is this correct?
Thanks for any help
Bioinformatician's creed #1:
Never trustAlways be skeptical of others' data.Yeah well that's why I'm asking, because you can't exactly check 300,000 accessions if that's your analysis..
I'm wondering if this is what is done in the publicly available data or not. So you're saying that it's variable then ?
For example this is on the SRA website, but it's not clear if this is for shotgun or amplicon or both. I think it's both but not sure:
Metagenomic data Human metagenomic studies may contain human sequences and require that the donor provide consent to archive their data in an unprotected database. If you would like to archive human metagenomic sequences in the public SRA database please contact the SRA and we will screen and remove human sequence contaminants from your submission
So then maybe it's only human host removed data sure, but what about other host species-- that's probably not removed because there's no privacy concerns?
Exactly, most institutes' ethics committees will not allow submission of human data to public repositories without broad consent for very valid privacy reasons.
I don't think it is such a problem for mice or other hosts, but the researchers might have screened the host reads out anyway, reasoning that they are not useful for other researchers.
Im trying to get at a rigorous answer, what is typically done? is this required for submission? How could the submissions include both host and metagenome and still be labeled as metagenome? Is there a way to tell whether or not the uploaded data includes both?
I think stating researchers might have screened the host reads out anyways isnt really getting at my question enough unfortunately as yeah I get that!