Question

Help obtaining metahit metadata

0

Entering edit mode

11.1 years ago

tal.kr • 0

I really hope this is the correct place to ask.

I've been wanting to check out something on the metahit sequencing data. I've looked at the data deposited here, referred from this article by Nielsen et. al.

My problem is that the supplementary tables provided by this article, as well as other articles from the metahit consortium, samples are names MHXXXX (e.g. MH0186), or two kind of other codes (such as O2.UC53.0 or V1.CD25.4). The name of the sample files from the depository are either ERRXXXXXXX or MHXXXX_<date>, when there are multiple files per MHXXXX, sometimes from very far off dates.

Contrary to other publicly available datasets, I couldn't find any conversion between the two, and was wondering if anyone could offer some advice on the matter.

public-data sequencing • 3.6k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 11.1 years ago by tal.kr • 0

0

Entering edit mode

I still face the same problem of unique gene naming, not only with the metahit data but also the integrated gene catalog (IGC).

They use unique gene name (e.g. MH0321_GL0035043) unlike public databases.

So how did you manage to get common gene name from those unique names?

Thanks!

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 7.0 years ago by ShahdEzzeldin • 0

Ram · Answer 1 · 2014-07-10

0

Entering edit mode

11.1 years ago

Josh Herr 5.8k

I'm not entirely sure what your question is: you want to download the METAHIT data, but you provided a link to the data repositories. Do you need to write a script to download the data from the FTP? Are you unsure of the differences between the data file types?

I think the problem you're facing the the conundrum of weird naming schemes for the archive data files. These names can be hard to understand -- the bottom line is that you can download the data (which you can from your link) and then you'll need to tie the sequence data to the metadata provided. It doesn't seem like you are accessing the metadata file from the database (the one in XML format with the sample data).

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by Josh Herr 5.8k

0

Entering edit mode

Hi Josh, I've had no problem at all downloading the data. My problem is exactly with tying the sequence data with the metadata. Which XML metadata file were you referring to? perhaps that's what I'm missing. I was looking at the supplementary tables of the article for metadata.

ADD REPLY • link 11.1 years ago by tal.kr • 0

0

Entering edit mode

Yes, the XML file will have the metadata for the samples. This XML file can be parsed. The sample names can be confusing, but you have to make sure you have all the sample information.

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by Josh Herr 5.8k

Ram · Answer 2 · 2014-07-14

If you download the "TEXT" file from the ENA, you can get the sample names in the metadata file from the second last column (some minor parsing needed.) As for the difference between dates, that is true. They sampled the participants with an average of 6 months I believe (I only skimmed the article the day it came out).

Oh, and you should probably do a fuzzy matching between them. For some reason "V1.UC57.0" turned into "V1.UC57-0" in the ENA TEXT file.

I hope that helps! Otherwise I really suggest contacting the corresponding author.