Hello,
I want to ask about consensus sequence generated from variant data. Let's say I have a region like below:
ACATGACGATACTAACGGAACC
From that region, I found 2 SNP on the 3rd and 10th nucleotide like below:
POS -- REF -- ALT
3 -- A -- C
10 -- T -- A
My question is, if I want to apply the consensus function, there are 2 possible sequence:
heterozygous sequences with 1 sequence only 1 mutation on 3rd nucleotide AND 1 sequence mutated on 10th nucleotide
heterozygous sequences with 1 sequence is similar to reference and other sequence consist of both mutation on 3rd and 10th nucelotides.
My question is, how to decide which is the best represntative of the consensus sequence?
The question is: What do you want to do with the consensus sequence? What's the biological question you are trying to answer?
I want to check the protein translated from the variation sequence. Probably one SNP can change the start codon or stop codo. So, probably the variations change some amino acid sequence and I want to see whether it affect the protein sequence or not. I want to find that in the data.
So (assuming your organism of interest is diploid) you would need to know if those two variants are in cis or in trans?
My organism is human. I think I will need to know that. So, basically, I wrote a simple program to map the variation to transcript sequence and I want to know what kind of transcript sequence (in FASTA format) it has with variation substituted to the transcript reference.
Essentially, you need to know if both variants are on the same allele/chromosome or not. This is called phasing variants. That's trivial if you have reads spanning from one position to the other.
Please state that from the beginning when asking questions. Try to be as informative as possible.
Can you please explain about phasing variants a bit more? Also, what do you mean about "trivial if you have reads spanning from one position to the other". I am really new to variant data.
Sorry, I forget to explain the organism, I will add that.
Anyway, currently I am thinking to generate all possible combination of ploid because I think it is not that hard and maybe there will be not that many variant in 1 transcript. What do you think about that?
Since your organism is diploid, for a given combination of two SNPs there are two possible scenarios. Either the SNPs are from the same chromosome/allele, or from different chromosomes.
Scenario 1:
maternal: ACCTGACGAAACTAACGGAACC
paternal: ACATGACGATACTAACGGAACC
Scenario 2:
maternal: ACCTGACGATACTAACGGAACC
paternal: ACATGACGAAACTAACGGAACC
Obviously, both scenarios seriously influence how the obtained protein will be affected!
Phasing variants means figuring out which of those scenario's you have, and when you have reads spanning from SNP1 to SNP2 that's quite trivial, because then you can see if the variants are always in the same read or never (and therefore you know if they originate from the same molecule/chromosome).
Ok. So, after I do some simulation, the number of scenario will increase with the number of heterozygous variation in a region. For 4 heterezygous variation, I will have 4 scenario. So, do you know how to figure out which scenario is the best? Maybe any tools or software you know? Thank you.