Hi everyone,
I need your help and advice (###question### at the end of the paragraph for the really busy person). I am currently working on BioMAJ which is a workflow engine dedicated to data synchronization and processing. Its purpose is to manage and supervise databank updates, it also could do the pre or post processes of each banks (GenBank, EMBL, DDBJ, PBD...). This management saves a considerable amount of time, you do not have to think when to check databank, and what is the local version of your databank. BioMAJ has the ambition to adapt to new uses of databases, and to do this I will need your advice as bioinformatician/bioinformaticist or biologist or person interested by the question :
How to you process your reference databank? or reference genome? (for example blast, indexation, extraction of gene sets, request in RDF/SPARQL ... ) Which data types do you work with?
Thank you for reading this post. I would be grateful for each of your contribution!
Spicey all the way!
On a more serious note: What exactly did you want to ask there? How does one pre-process data?
I was waiting for this one! And really?
I would like to know what operations do you usually use on your reference data? For example if your reference data is a genome, do you extract a set of gene? A chromosome? Do you index it? Do you blast it? Another example is: if you download all GenBank what will you do with the data? Will you extract a part of it with RDF/SPARQL? Is that clearer?
Ideally you would not want to mess with the reference (in terms of changing things). When available from an authoritative source (e.g. NCBI, Ensembl, UCSC) I would get the source data (sequence, indexes) as is. If there are derived things needed (e.g. BBMap indexes) then I build them using the reference files mentioned before. This is pretty much a do it once and not repeat until absolutely needed thing.
There are tools to extract information about a gene/chromosome that can be run on the fly (e.g. bedtools, eutils) so it does not make sense to precompute those things.
If you start internally processing the reference data then you get on a slippery slope. You would need to keep doing this over every time a new version of the reference data comes out (e.g. with GenBank every night).
I hear a James Bond dialogue here:
Et tu Istvan? Then fall NCBI.
On a serious note, do you prefer EMBL over NCBI? Any reason?
Yes he has been on record saying so .. for annotations at least :)
hey it is Bond that said it not me :-)
But yes as genomax2 stated, I think human gene annotation are better at Ensembl. And a lot of data is much easier to obtain from their FTP sites.
Hi- Just a suggestion... From the github wiki you link I read:
I think there is a little too much jargon and technicality here, especially if you are trying to attract an audience not very familiar. Maybe a more gentle explanation would help...?
Thanks for your remark, BioMAJ is a software useful to update and manage your databanks on your computer. For example if you want to download every new version of the human genome you could do that with BioMAJ, it will check if there is a newest version of the genome and automatically download it. Maybe I could remove the link because I just want informations about the processes used on databank by users, and not really talk about BioMAJ.
Umm -- that's not a good thing man. I don't want my reference genome update in the middle of an analysis :-/ Gosh, on some projects i'm involved in we're using software and data that is over 5 years old. Deliberately.
More precisely with BioMAJ, you will have the choice of what version you want to use and you could deliberately keep the old one. (it will "publish" the new version only if you want to)
Ah, ok thats pretty convenient :)
I'm not sure if I understood this correctly, but a reference genome is indexed for alignment tools such as bwa/tophat2/hisat2/STAR..., so that's a 'useful' processing step.
Yes this is one of the possible answers. Is that what you do with reference data? Do you do other type of process? On which domain do you work? (in genomic, metabolomic, transcriptomic, proteomic? Etc. )