Hi, I'm not a computer scientist and I only have basics knowledge in bioinformatics. But I would like to create a private database with all my dna sequences obtained through next gene sequencing. The idea is to be able to process all my sequences through Rstudio. The database should be able to carry all of my sequences which means a library of 400 samples. Each sample is constituted of 200 000 rows and 350 columns.
So how can I create such a database that will be easily manageable and that I can call and analyse with R ? Thank you in advance for your help.
Any reason you want to go through the trouble to create a new data/file format and not use existing bioinformatics software and formats?
What kind of software would you recommand ? I was wondering what would be the easiest way to have all my data structured in a unique base that could be process with R.
What do you want to achieve?
So, I'll have to :
All of this will be based on the sequences of my sample.
Store the data in a database, e.g.: sqlite, then we can import chunks of data using sqldf package.
But I'd rather look for existing solutions (including non R solutions).
What the rows and columns represent, what kind of data? If the files are standard, maybe no need for database, and use fast read and write to access the data directly from files, see data.table::fread, fwrite.
I heard that SQL was not the best base for a dynamic database (I will have to upload arround 10 new samples per week). Moreover, I'm not familiar with SQL.
I could keep each file without any database but with 400 samples, I believe that 400 files will not suite well my analysis. I also need to look for similarities between files without knowing each file to compare.
I need to perform statistical analysis and I'm familiar with R. That s why I would like the files (or database) to be easily called through R.
Isn't it the case ? As I said I'm not against the use of SQL but I would just like to start with the right thing in order not to loose time.
Based on your description ('a library of 400 samples. Each sample is constituted of 200 000 rows and 350 columns'), I didn't see anything special in your datasets. Thus, I assume any relational database will work, such as MySQL. Probably even plain csv files will work.