I'm trying to define a new way to create a "unique_key" for each genotype in a vcf file to use as an index into an SQL database.
For example for this two lines:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
I would get a list of unique indexes like this:
20-14370-G-G
20-14370-A-G
20-14370-A-A
20-1110696-G-T
20-1110696-T-T
Obs: For 1|2
and 2|1
I would get the same id 20-1110696-G-T
My primary reason for this would be to quickly compare and filter some genotypes from two tables, for example:
table1 has 40k rows and table2 has 100 million rows
If I had a "unique id" like that I could quickly eliminate genotypes from table1 that would be present in table2.
Does it make any sense to create a unique id for each genotype or am I trying to reinvent the wheel somehow ?
Is there any other more correct way of achieving this goal of generating a unique id for each genotype present in the VCF?
PS: Yes I know how to intersect positions from a BED or a VCF file, but this time I need to find a way to generate a unique key for each genotype in order to do this comparison quickly using an SQL query.
Thank you for your attention! :)
Thanks a lot Pierre! Once I come up with this unique id, I could use md5 for generating a hash with it. That's a good idea!
My concern was more about generating this unique_id by translating the genotypes in the last column of the VCF (Ex. 0/0, 0/1 and 1/1) using the columns REF and ALT (Ex. G and A) to something like this G-G, G-A and A-A and combining this with the 20-14370 part.
Linking two tables, one with chr-position and another with the genotypes, its all I want to avoid because of performance issues :) This task will have to be repeated a lot of times!
About GEMINI, i checked their table and I didn't see anything like that in here http://gemini.readthedocs.org/en/latest/content/database_schema.html#the-variants-table
For me it's strange no one ever defined a unique id to express exactly the chr-pos-genotype of an individual. The only way I could imagine how to do this would be by combining some columns of the VCF.
I'm assuming there is no simple way to do this, so my algorithm will be like this:
1) Get columns CHROM (1), POS (2), REF (4), ALT (5) and columns with genotypes
2) Generate a list with REF and ALT columns Ex. [A,G,T]
3) Translate 0/0, 0/1, 1/1 to A-A, A-G, A-T
Here i have to define a way to sort the genotypes cause i want 0/1 == 1/0 == 0|1 == 1|0 in order to generate the same unique id for this genotypes. So G-A would become A-G and T-A would become A-T, just to get the same "md5" for both of them.
4) Integrate the previous string with the chr-position part (20-14370) to FINALLY get my UNIQUE ID for each genotype:
20-14370-G-A
Does it sound too complicated ? :)
Cheers!
I think people work with indexed VCF and tools like vcf-tools or GATK, people don't need indexes on genotype.
you could also use a nosql solution like berkeleydb, using chrom-pos as the key, and an array of A1/A2 as the value.
You are right, I guess im the only one trying to generate indexes on genotypes ... :)
I have used mongodb and I did exactly what you said chr-pos for the key and genotypes in the array.
What happened was like that, I would query the database to get the records with the same key "chr-pos" in both tables and them compare the genotypes in the array programatically. I guess nowadays everyone compares the genotypes in this way.
But men, this takes a lot longer than using this unique key, specially when you have a lot of genotypes to compare. I guess I will implement this and report back the results, there is no better way to prove my point :)
Thanks a lot again!