Hi,
I am downloading samples from the metaHIT project (metagenomes from faecal samples). From the paper, it is said that the 'raw illumina reads are deposited at ENA with accession number ERP003612', so there
However when downloading the files (submitted fastq) the naming is of the form 'MetaHIT-MH0318_110425.clean.rmhost.1.fq.gz'.
The 'clean' and 'rmhost' makes me wonder if those have actually been filtered and contaminant DNA (especially human contaminants). Is this 'clean' and 'rmhost' is a common nomenclature such that I can safely assumed that those reads have already been filtered and that I can use them directly?
I am not a bioinformatician at all and so if possible I wish to avoid going through all the filtering process.
Alternatively, do you have advice on a all-in-one tool from which I could do this properly? I looked into BBmap but could not get it to work. I also heard about MOCAT, any return on this?
Many thanks, Camille
Where do you see these file names? I am looking at one of the samples here and I only see regular ENA file names.
Thanks for your reply. I see those names on the downloaded files. In the link you sent, if you click on the link for the submitted fastq ('Fastq file 1'), the name of the file is: MetaHIT-MH0001_081026.rmHuman.rmHost.&.fq.gz'. I either get files with 'rmHuman.rmHost' or 'clean.rmHost' depending on the samples.
Then you should refer to the paper/supplemental materials to see if there is additional information about what those names mean. Files under fastq FTP should be the original data. You could compare the pair and see how they differ. If you don't want to do the processing yourself then using the cleaned files may be the easier option.