State of processing in ENA, metaHIT samples
1
0
Entering edit mode
6.4 years ago
CAnna ▴ 20

Hi,

I am downloading samples from the metaHIT project (metagenomes from faecal samples). From the paper, it is said that the 'raw illumina reads are deposited at ENA with accession number ERP003612', so there

However when downloading the files (submitted fastq) the naming is of the form 'MetaHIT-MH0318_110425.clean.rmhost.1.fq.gz'.

The 'clean' and 'rmhost' makes me wonder if those have actually been filtered and contaminant DNA (especially human contaminants). Is this 'clean' and 'rmhost' is a common nomenclature such that I can safely assumed that those reads have already been filtered and that I can use them directly?

I am not a bioinformatician at all and so if possible I wish to avoid going through all the filtering process.

Alternatively, do you have advice on a all-in-one tool from which I could do this properly? I looked into BBmap but could not get it to work. I also heard about MOCAT, any return on this?

Many thanks, Camille

filtering metahit ENA • 1.3k views
ADD COMMENT
0
Entering edit mode

Where do you see these file names? I am looking at one of the samples here and I only see regular ENA file names.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I see those names on the downloaded files. In the link you sent, if you click on the link for the submitted fastq ('Fastq file 1'), the name of the file is: MetaHIT-MH0001_081026.rmHuman.rmHost.&.fq.gz'. I either get files with 'rmHuman.rmHost' or 'clean.rmHost' depending on the samples.

ADD REPLY
0
Entering edit mode

Then you should refer to the paper/supplemental materials to see if there is additional information about what those names mean. Files under fastq FTP should be the original data. You could compare the pair and see how they differ. If you don't want to do the processing yourself then using the cleaned files may be the easier option.

ADD REPLY
0
Entering edit mode
6.4 years ago
CAnna ▴ 20

Thanks for your advice. Following this I emailed the authors who would confirm:

"All fastq files correspond to processed reads, i.e. filtered to remove both low quality reads and contaminant human sequences. I suppose that the terminology heterogeneity comes from the fact that these metagenomic samples were deposited by our colleagues from BGI at two different times, according to their use in two different publications"

If that can help other people in the future.

ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6