When I read data about genomic variants, I saw something like this "chr19:5237294" to indicate the location in a sequence. I am really confused that the “19” is the pair number or chromosome number.
- If it is the pair number, how can we determine which chromosome in that pair they are talking about?
- If it is the chromosome number, do we have some notation like “chr25, chr26,….”? Actually, I have never seen them.
- In both cases, a chromosome has 2 strands, how can we know the variant (for example, an SNP) happens on which strand?
Could you please help me to understand such things above?
Thank you so much
Thank you @cmdcolin From your reply, I understand as follow:
the variants in vcf file are listed based on the reference genome, not based on the sample genomes.
the reference genome is haploid, so that each pair of chromosome, there is only one corresponding reference "strand". So, there are 22 reference strands for 22 chromosome pairs, and one reference strand for X and one strand for Y.
If my understand is correct, each location on reference strand will compare with 4 possible value of the genome (2 chromosomes, each chromosome has 2 strands) In vcf files, I saw that there are two columns REF and ALT, which REF is the value from reference strand, and ALT contains values from actual samples. However, most of ALT only has 1 or 2 values (seperated by ','), not 4. What make me wrong understanding here?
the base pairing of DNA generally means you don't need to represent the strands separately (if one strand has a G, the other strand has a C, etc), so if you know one strand, you automatically know the other, so the information is all calculated relative to the reference genome which again is single stranded and haploid