Question

store sequence in a local database

0

Entering edit mode

10.8 years ago

ravihansa82 ▴ 130

Dear friends I have set of sequences(about 4000) in fasta formats as shown in the following

>ENSG00000127837;ENST00000248450;AAMP;sequence length: 431
GTGAGAACTGCCGCTCCTCAGGCCATGGGACAGGAGACGCTCACCCCTGGCCTCTGACTCCTGCTT

I need to store these sequences to further analysis. I tried to store sequences with MS ACESS but I could not put sequences in to one row and I could not extracted sequence length for example 431 in above. because during importing this fasta file in to the access whole "sequence length: 431" part coming as one field. and DNA sequence could not put in to a one row it was stored in several rows under first column.

I need to store all the sequence as one a record as shown in following format. Please give advice and your suggestions

gene ID            transcript ID      gene symbol    sequence length    sequence
ENSG00000127837    ENST00000248450    AAMP           431                GTGAGAACTGCC

windows sequence database • 5.2k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by ravihansa82 ▴ 130

3

Entering edit mode

Do you just want one file, containing all your different FASTA or do you want a real Database (e.g. MySQL)?

In my opinion one file should be sufficient, since 4000 sequences are not that many.

Can you please tell what your preferred programming language is, if you can program at all.

ADD REPLY • link 10.8 years ago by linus ▴ 360

0

Entering edit mode

thanx for the reply friend..yes I need a one file at the moment..rather than put them in to a real database. I am dealing with java , perl and python....

ADD REPLY • link 10.8 years ago by ravihansa82 ▴ 130

3

Entering edit mode

You need to show us a bit more about the import process and provide the SQL code or whatever you are using, I do not believe there is a 'fasta importer' that comes with MS Access. As a general recommendation I would try to resort to open source solutions such as MySQL/or PostgreSQL. The acceptance of these tools by the bioinformatics community is much higher, and thereby our ability to help.

Please consider whether your data needs to be stored in a database at all. Most likely, leaving small sequence data in FASTA files is sufficient for the purpose of any analysis. Storing sequence data in an Access database most likely provides no advantage you could utilize unless you have an extremely specific type of analysis pipeline in your company or department.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Michael 55k

0

Entering edit mode

thank you friend..

I imported the set of sequences stored in the text file. through the access, I imported them as text file with ";" as delimited. Once I read you comment here ,I think I did something which can't try with access...So that I have to try with other way...isn't it?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by ravihansa82 ▴ 130

0

Entering edit mode

tnx ..friend

yes..actually this is part of my study. I have such sequences in FASTA format. I want to further deal with these set of sequences. that is why I wanted to store such data in a particular form of one file /database type. because my study further expect to extend towards the patter matching of such sequences(these sequences are used to find set of cis-elements ) so that I want to readily identify gene name, transcript name and length easily. That is why I wanna store such data in to particular format in order to look at and retrieve easily.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by ravihansa82 ▴ 130

Ram · Answer 1 · 2014-07-31

I do not get why you want to have a database with such few data.

Here is a idea/solution without any database, but with a single file containing all your informations:

It is quite simple. You store your whole data in a CSV file. An example would be:

gene ID;           transcript ID;     gene symbol;    sequence length;    sequence
ENSG00000127837    ENST00000248450    AAMP            431                 GTGAGAACTGCC

Of course without the whitespace.

This kind of data has two advantages. The first one is, that if your sequences are not that long you can still do an easy lookup in excel. The second is, that there is in every proper programming language an already implemented CSV reader/writer, allowing you easy access to your data. For example in python there is: https://docs.python.org/2/library/csv.html

So basically you need to parse your multi FASTA files into for example python and then just write them into a single CSV file.

I hope my idea helps you. If you could describe your use-cases more, we can probably give more or better advice.