Question

State of processing in ENA, metaHIT samples

0

Entering edit mode

6.4 years ago

CAnna ▴ 20

Hi,

I am downloading samples from the metaHIT project (metagenomes from faecal samples). From the paper, it is said that the 'raw illumina reads are deposited at ENA with accession number ERP003612', so there

However when downloading the files (submitted fastq) the naming is of the form 'MetaHIT-MH0318_110425.clean.rmhost.1.fq.gz'.

The 'clean' and 'rmhost' makes me wonder if those have actually been filtered and contaminant DNA (especially human contaminants). Is this 'clean' and 'rmhost' is a common nomenclature such that I can safely assumed that those reads have already been filtered and that I can use them directly?

I am not a bioinformatician at all and so if possible I wish to avoid going through all the filtering process.

Alternatively, do you have advice on a all-in-one tool from which I could do this properly? I looked into BBmap but could not get it to work. I also heard about MOCAT, any return on this?

Many thanks, Camille

filtering metahit ENA • 1.3k views

ADD COMMENT • link 6.4 years ago by CAnna ▴ 20

0

Entering edit mode

Where do you see these file names? I am looking at one of the samples here and I only see regular ENA file names.

ADD REPLY • link 6.4 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your reply. I see those names on the downloaded files. In the link you sent, if you click on the link for the submitted fastq ('Fastq file 1'), the name of the file is: MetaHIT-MH0001_081026.rmHuman.rmHost.&.fq.gz'. I either get files with 'rmHuman.rmHost' or 'clean.rmHost' depending on the samples.

ADD REPLY • link 6.4 years ago by CAnna ▴ 20

0

Entering edit mode

Then you should refer to the paper/supplemental materials to see if there is additional information about what those names mean. Files under fastq FTP should be the original data. You could compare the pair and see how they differ. If you don't want to do the processing yourself then using the cleaned files may be the easier option.

ADD REPLY • link 6.4 years ago by GenoMax 147k

score 0 · Accepted Answer · 2018-06-20

Thanks for your advice. Following this I emailed the authors who would confirm:

"All fastq files correspond to processed reads, i.e. filtered to remove both low quality reads and contaminant human sequences. I suppose that the terminology heterogeneity comes from the fact that these metagenomic samples were deposited by our colleagues from BGI at two different times, according to their use in two different publications"

If that can help other people in the future.