Question

Forum:How do you process your reference data? (in genomic, metabolomic, transcriptomic, proteomic?)

0

Entering edit mode

7.9 years ago

Chloe Riou ▴ 40

Hi everyone,

I need your help and advice (###question### at the end of the paragraph for the really busy person). I am currently working on BioMAJ which is a workflow engine dedicated to data synchronization and processing. Its purpose is to manage and supervise databank updates, it also could do the pre or post processes of each banks (GenBank, EMBL, DDBJ, PBD...). This management saves a considerable amount of time, you do not have to think when to check databank, and what is the local version of your databank. BioMAJ has the ambition to adapt to new uses of databases, and to do this I will need your advice as bioinformatician/bioinformaticist or biologist or person interested by the question :

How to you process your reference databank? or reference genome? (for example blast, indexation, extraction of gene sets, request in RDF/SPARQL ... ) Which data types do you work with?

Thank you for reading this post. I would be grateful for each of your contribution!

https://github.com/genouest/biomaj/wiki

OMICS Databanks • 2.1k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 7.9 years ago by Chloe Riou ▴ 40

6

Entering edit mode

How do you season your data?

Spicey all the way!

On a more serious note: What exactly did you want to ask there? How does one pre-process data?

ADD REPLY • link 7.9 years ago by GenoMax 146k

0

Entering edit mode

I was waiting for this one! And really?

I would like to know what operations do you usually use on your reference data? For example if your reference data is a genome, do you extract a set of gene? A chromosome? Do you index it? Do you blast it? Another example is: if you download all GenBank what will you do with the data? Will you extract a part of it with RDF/SPARQL? Is that clearer?

ADD REPLY • link 7.9 years ago by Chloe Riou ▴ 40

0

Entering edit mode

Ideally you would not want to mess with the reference (in terms of changing things). When available from an authoritative source (e.g. NCBI, Ensembl, UCSC) I would get the source data (sequence, indexes) as is. If there are derived things needed (e.g. BBMap indexes) then I build them using the reference files mentioned before. This is pretty much a do it once and not repeat until absolutely needed thing.

There are tools to extract information about a gene/chromosome that can be run on the fly (e.g. bedtools, eutils) so it does not make sense to precompute those things.

If you start internally processing the reference data then you get on a slippery slope. You would need to keep doing this over every time a new version of the reference data comes out (e.g. with GenBank every night).

ADD REPLY • link 7.9 years ago by GenoMax 146k

0

Entering edit mode

I hear a James Bond dialogue here:

Villain: How would you like your data, Sir?
Bond: I like my data how I like my cars. Fast, wild and European.

ADD REPLY • link 7.9 years ago by Istvan Albert 101k

0

Entering edit mode

Et tu Istvan? Then fall NCBI.

On a serious note, do you prefer EMBL over NCBI? Any reason?

ADD REPLY • link 7.9 years ago by Ram 44k

1

Entering edit mode

Yes he has been on record saying so .. for annotations at least :)

ADD REPLY • link 7.9 years ago by GenoMax 146k

1

Entering edit mode

hey it is Bond that said it not me :-)

But yes as genomax2 stated, I think human gene annotation are better at Ensembl. And a lot of data is much easier to obtain from their FTP sites.

ADD REPLY • link 7.9 years ago by Istvan Albert 101k

1

Entering edit mode

Hi- Just a suggestion... From the github wiki you link I read:

BioMAJ (BIOlogie Mise A Jour) is a workflow engine dedicated to data synchronization and processing. The Software automates the update cycle and the supervision of the locally mirrored databank repository.

I think there is a little too much jargon and technicality here, especially if you are trying to attract an audience not very familiar. Maybe a more gentle explanation would help...?

ADD REPLY • link 7.9 years ago by dariober 15k

0

Entering edit mode

Thanks for your remark, BioMAJ is a software useful to update and manage your databanks on your computer. For example if you want to download every new version of the human genome you could do that with BioMAJ, it will check if there is a newest version of the genome and automatically download it. Maybe I could remove the link because I just want informations about the processes used on databank by users, and not really talk about BioMAJ.

ADD REPLY • link 7.9 years ago by Chloe Riou ▴ 40

1

Entering edit mode

Umm -- that's not a good thing man. I don't want my reference genome update in the middle of an analysis :-/ Gosh, on some projects i'm involved in we're using software and data that is over 5 years old. Deliberately.

ADD REPLY • link 7.9 years ago by John 13k

3

Entering edit mode

More precisely with BioMAJ, you will have the choice of what version you want to use and you could deliberately keep the old one. (it will "publish" the new version only if you want to)

ADD REPLY • link 7.9 years ago by Chloe Riou ▴ 40

1

Entering edit mode

Ah, ok thats pretty convenient :)

ADD REPLY • link 7.9 years ago by John 13k

0

Entering edit mode

How to you process your reference databank ?

I'm not sure if I understood this correctly, but a reference genome is indexed for alignment tools such as bwa/tophat2/hisat2/STAR..., so that's a 'useful' processing step.

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes this is one of the possible answers. Is that what you do with reference data? Do you do other type of process? On which domain do you work? (in genomic, metabolomic, transcriptomic, proteomic? Etc. )

ADD REPLY • link 7.9 years ago by Chloe Riou ▴ 40