How to create a dna database that can be analyze with R ?
1
0
Entering edit mode
5.1 years ago
Gautier • 0

Hi, I'm not a computer scientist and I only have basics knowledge in bioinformatics. But I would like to create a private database with all my dna sequences obtained through next gene sequencing. The idea is to be able to process all my sequences through Rstudio. The database should be able to carry all of my sequences which means a library of 400 samples. Each sample is constituted of 200 000 rows and 350 columns.

So how can I create such a database that will be easily manageable and that I can call and analyse with R ? Thank you in advance for your help.

sequencing DNA Database R NGS • 2.2k views
ADD COMMENT
1
Entering edit mode

Any reason you want to go through the trouble to create a new data/file format and not use existing bioinformatics software and formats?

ADD REPLY
0
Entering edit mode

What kind of software would you recommand ? I was wondering what would be the easiest way to have all my data structured in a unique base that could be process with R.

ADD REPLY
0
Entering edit mode

What do you want to achieve?

ADD REPLY
0
Entering edit mode

So, I'll have to :

  • Find similarities between the sequence of my sample.
  • Calculate frequencies, indicies, enrichment
  • Plot the result
  • Get statistical analysis

All of this will be based on the sequences of my sample.

ADD REPLY
0
Entering edit mode

Store the data in a database, e.g.: sqlite, then we can import chunks of data using sqldf package.

But I'd rather look for existing solutions (including non R solutions).

What the rows and columns represent, what kind of data? If the files are standard, maybe no need for database, and use fast read and write to access the data directly from files, see data.table::fread, fwrite.

ADD REPLY
0
Entering edit mode

I heard that SQL was not the best base for a dynamic database (I will have to upload arround 10 new samples per week). Moreover, I'm not familiar with SQL.

I could keep each file without any database but with 400 samples, I believe that 400 files will not suite well my analysis. I also need to look for similarities between files without knowing each file to compare.

I need to perform statistical analysis and I'm familiar with R. That s why I would like the files (or database) to be easily called through R.

ADD REPLY
2
Entering edit mode

I heard that SQL was not the best base for a dynamic database

https://xkcd.com/285/

ADD REPLY
0
Entering edit mode

Isn't it the case ? As I said I'm not against the use of SQL but I would just like to start with the right thing in order not to loose time.

ADD REPLY
0
Entering edit mode

Based on your description ('a library of 400 samples. Each sample is constituted of 200 000 rows and 350 columns'), I didn't see anything special in your datasets. Thus, I assume any relational database will work, such as MySQL. Probably even plain csv files will work.

ADD REPLY
1
Entering edit mode
5.1 years ago

use sqlite3 or any other sql database and store your data using this SQL engine.

ADD COMMENT

Login before adding your answer.

Traffic: 2729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6