Reference genome, VCF files, genotype
1
0
Entering edit mode
2.6 years ago
tea.vuki ▴ 20

Hello,

I have trouble grasping certain concepts while analysing VCF file. I need to find all transitions. I have REF A ALT G and 1/1 genotype, so I would say that the person has GG genotype and that 2 transitions happened (2x A->G). Correct me if I'm wrong.

However, when I think about this more I realise that I don't understand certain things (a lot of things :( ) :

  1. Does this mean that REF genome has A on fw chain and T on another, while sequenced sample has G on one and C on another (for both chromosomes)? How can I now what is the reference for the another chromosome in pair?

  2. How can reference genome contain information regarding only one chromosome? What about alleles? If we know that gene is "defined" by combination of alleles, what this information in reference genome actually tells us?

  3. Let's be very visual - I isolate DNA from a certain person and I am interested in a certain gene, let's call it gene X. That gene comes in two copies: one from mom's and one from dad's chromosome, both having 2 chains (4 chains overall). What do we sequence here during paired-end sequencing? Everything, fw and rev strand from both chromosomes?

  4. What exactly are reads that I am aligning? Do I just take all reads of fw strands and then align it to reference genome? Or I align both fw and rev?

vcf • 1.1k views
ADD COMMENT
0
Entering edit mode

tea.vuki

Why did you delete the post?

ADD REPLY
2
Entering edit mode
2.6 years ago

Welcome to bioinformatics - where reality always turns out to be more complicated than we anticipated

Transitions happen in coding regions, the coding region is defined relative to a strand. So when we say A->G then we do refer to the actual coding sequence, not the reference genome (as you note the reference genome represents the forward strand)

References genomes are a means to represent information. Reference genomes do not have to be a "real" or "functional" sequence. We use the reference genome as a means to identify changes or lack of changes and to ensure that we are talking about the same changes/differences.

Having different alleles for the same sequence does indeed complicate every analysis.

When we sequence DNA we are sequencing small fragments, usually from all chromosomes that are present, but we ever know which read came from which copy :-). Resolving that can again be a fairly complex process, it is called phasing the variants

The reads come from the double-stranded DNA fragments that are broken up into single strands. Not all fragments will be sequenced. About half of the reads will come from each strand. When we align we align against the forward strand only.

ADD COMMENT
0
Entering edit mode

Thank you so much! Now only part that confuses me is this part about counting number of transitions, in this exact example (from my VCF file):

chr1    924533  .   A   G   50  PASS    .   GT:DP:ADALL:AD:GQ:IGT:IPS:PS    1|1:456:0,226:0,0:259:1/1:.:PATMAT

Sequenced sample has 1/1 -> GG genotype. REF is A, ALT is G. Does this mean that 2 transitions occurred? Can I called this that happened SNP? Sorry to bother, but I honestly didn't understand this part about coding sequence and reference genome that you wrote :(

ADD REPLY
0
Entering edit mode

just to clarify, when we talk about SNPs those are relative to forward strand,

it is when we talk about the effect on the coding regions, is when we take account for bases in the proper orientation

1|1 means that it is a diploid genome, a homozygous, phased mutation. Both copies carry the mutation.

ADD REPLY

Login before adding your answer.

Traffic: 2203 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6