Hello,
I am attempting to compare two genomes from the same individual and finding small differences between them. My idea would be to write a script to find small windows (~15-20 BP) that are present in one genome (lets call it abnormal) but not the other (normal.) The program will break each genome into ~10-20 BP windows, using a sliding window. The windows go into two databases, one for windows from a normal cell genome and one for windows from an abnormal cell genome. The database would be keyed by the actual window itself and for each key there would be a link to each spot in the genome where it is found.
The idea is to do the following:
- For each window in the abnormal database, see if it is also present in the normal database. If it is, delete it.
- Return all windows that remain in abnormal database, prioritize those with the highest number of occurances.
I have done this for small test data. For real human genomes, there will be about 6 billion or so windows per database. Right now I have two ways I can think of handling this:
Install OS and MySQL on a 1TB SSD, use script to directly populate databases. Use normal RAID for mass storage needed for original genomes. Use database queries to compare genomes.
Install OS on 1TB SSD, with around 700 GB for swap. Use RAID for mass storage of genome and MySQL DB for results of comparing genomes. Instead of comparing databases within MySQL, implement simple hash tables in Perl script, let OS go to swap as structure builds. Do comparisons within Perl script and dump results to MySQL database on the RAID.
What would be the best way to handle this? Is there an existing solution out there that does exactly what I'm trying to do?
Thank you in advance!
Do you have a reference genome for this organism? If so, why not perform variant calling and compare the results?
That’s a good idea and I considered that too.
Unfortunately, the application I had in mind only works on a per-individual basis. The typical approach I've seen others do is derive drug targets based on genes found across an entire population. This idea is definitely in the personalized care category. Even the small variations between different individuals would throw things off. That's part of the reason I want to run this little script. I already know the variation between individuals in a population is too much, but the genome of cells within a person may be too similar, which would make the idea unworkable. It would be nice to vet the idea computationally first before trying any wet work.