In my research on bioinformatics relate to the process of moving data into the database.
Constraint occurs because the data to be inserted into the database is very large (approximately 1-2 million data), thus affecting the speed performance. Is there an algorithm that is able to assist the process of moving data faster. or is there any other solution?
-in this study i use mysql with php programming language.
This case is, I have 2 files:
1. output.vcf (2 MB)
with the contents of the data like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT /home/abrari/bioinf/jobs/39/sorting/output.bam
10 389659 . T C 4.13 . DP=2;RPB=9.668049e-01;AF1=0.4999;AC1=1;DP4=0,1,0,1;MQ=42;FQ=4.93;PV4=1,1,1,1 GT:PL:GQ 0/1:32,0,35:31
10 1221790 . A T 17.8 . DP=2;VDB=6.560000e-02;AF1=1;AC1=2;DP4=0,0,0,2;MQ=42;FQ=-33 GT:PL:GQ 1/1:49,6,0:10
....
....
up to 20.000 lines
2. align.bam (7GB)
with the contents of the data like this:
ACTGTACGTACGGGATT.....
to enter data from case files "output.vcf" This database has been solved, but the problem that occurs is the data per row is entered into the database has the inclusion of a long process. it actually does not include the case of entering the data from the file align.bam.
simple example is like this:
from output.vcf:
9 46607305 . A G 222 . DP=90;VDB=0.000000e+00;AF1=1;AC1=2;DP4=0,0,90,0;MQ=42;FQ=-282 GT:PL:GQ 1/1:255,255,0:99
from align.bam:
.....AAAAATCCCCC.....
bold letters "T" that was a position to 46607305 (refer to file output.vcf).
and some important information will be entered into the database is processed:
- 9 //is chrom
- A //ref
- G //alt
- 46607305 //is position
- AAAAA //is 5 the position of the data before the file align.bam
- CCCCC //is 5 the position of the data after the file align.bam
and is processed to row 20,000 lines.
I believe this process is very long to get into the database. due to data entry of case files output.vcf alone is long, not to mention taking the data from the file align.bam is very large file.
is there any solution that could solve this case?