I am currently working on a project where my goal is to in sorts replicate what services like Ancestry and 23andMe are doing. The end result is the output of Alleles of various SNP's.
I am taking as input a FASTA file for a human chromosome. I don't have any real human FASTA data, so I am mutating a reference as input. Typically I am working with Chromosome 21 since its smaller than the others.
I am then doing a global alignment of this to the corresponding reference chromosome.
I will then look at various positions based on SNP data, to see what the Alleles are at the aligned positions, and even give some report on some various information that can be ascertained from this data.
I am using stretcher as my global alignment program. I originally tried needle, but it said my files were too big, and so I am using stretcher.
My concerns are:
- It seems to be taking a long time on my 3.3Ghz i7 dual core system. It appears the global alignment is single threaded. I am not sure what realistic estimate is for how long an alignment takes of a human chromosome. I Mutated it about 1%. 10 hours? 20 hours? 10 days?
- I am not sure what to expect after alignment, as far as how to correlated indexes to the aligned data. Obviously with alignment the original indexes shift, how is this typically accounted for? For example SNP rs4477212 is at position 82154. How can I accurately query that position on an alignment file so that I grab the correct Alleles (biopython)?
Brian
BTW, I have no clue what "alleles of various SNPs" could mean. Perhaps you mean something like "phased SNPs" or haplotypes?
Edit: Moved to a comment, since this doesn't exactly answer your question.
Devon, thanks for your reply. I am not sure I want to be aligning whole chromosome files. I simply want to build an example of how some input data is taken in FASTA format (perhaps a gene, not a full chromosome is better), aligned to a reference sequence, and then positions are looked at to examine the SNP. Then of course some probability of characteristics can be gleaned from that information (eye color, etc). I don't want to re-invent the wheel.
I have no actual human FASTA data other than reference data. So I was taken chromosome reference data, mutating it, and then calling that my input. I realize now maybe that's way too much data to be working with.
What I meant with Alleles of various SNP's, is when you get a report from one of the companies out there, such as 23andMe, Ancestry, etc, it looks like this:
So for each SNP, it shows your allele against the reference allele.
So for example I wish to take data, perhaps just a gene, one that say contains chromosome 1, position 798959. That is inside of genes SAMD11 and AL645608.1. So Maybe I download reference gene info of those genes, align them to candidate data I find (or create by mutating reference data), and then examine the position to see the status of this SNP rs11240777. A lot of the SNP's are not inside of genes. I think I gravitated toward chromosomes because its a container that I can query, download, etc. Perhaps its easier to work with Genes as a smaller piece of data. I would still need to align to the full reference chromosome right?
Ah, you're trying to make your life more difficult than needed. There's no need to do any alignment with that sort of data, you already know what the alignment is and what the resulting SNP calls will be (it's your input after all).
You could look at things like SNPeff or Ensembl's VEP (variant effect predictor, or something like that) for predicting actual functional outcomes. However, not that these variants may or may not be directly causative themselves. Particularly with 23andMe, you have data from a SNP array of some sort, so you're really just looking at a fixed set of things that may or may not be linked to a large number of possible traits. So the trick will be getting a large number of samples and associated phenotypic information.