Question

Chromosome pair numbers in variant annotation.

0

Entering edit mode

2.2 years ago

Peter • 0

When I read data about genomic variants, I saw something like this "chr19:5237294" to indicate the location in a sequence. I am really confused that the “19” is the pair number or chromosome number.

If it is the pair number, how can we determine which chromosome in that pair they are talking about?
If it is the chromosome number, do we have some notation like “chr25, chr26,….”? Actually, I have never seen them.
In both cases, a chromosome has 2 strands, how can we know the variant (for example, an SNP) happens on which strand?

Could you please help me to understand such things above?

Thank you so much

number chromosome annotation • 957 views

ADD COMMENT • link updated 2.2 years ago by cmdcolin ★ 4.0k • written 2.2 years ago by Peter • 0

score 4 · Accepted Answer · 2022-08-29

4

Entering edit mode

2.2 years ago

cmdcolin ★ 4.0k

If you are referring to the diploid pairs of chromosomes in the human genome, then this is sort of "ignored" or "not relevant" for the human reference genome

The human reference genome (which you can download as a FASTA file from e.g. http://hgdownload.soe.ucsc.edu/downloads.html#human) is "haploid" meaning it only contains one copy of chr1-22 and X and Y.

For the strandedness, only a single strand is represented in the reference genome, so coordinates are presented reletive to that (a single string of letters), and data in e.g. a VCF file (which contains variants) will refer to variants relative to the reference genome on the same strand that is in the reference genome.

Footnote 1: This may just be trivia or add confusion but there is/was a notion of strandedness in dbSNP https://www.ncbi.nlm.nih.gov/core/assets/snp/docs/RefSNP_orientation_updates.pdf but you would likely not need to be concerned by this

Footnote 2: The human reference genome is "haploid", but people can still assemble a "diploid" genome e.g. if they were analyzing you. This can be tricky to disentangle the maternal and paternal parts of your genome but things like phasing and trio binning can determine the maternal from the paternal alleles in your genome. Long reads and programs like hifiasm demonstrate this

ADD COMMENT • link 2.2 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

Thank you @cmdcolin From your reply, I understand as follow:

the variants in vcf file are listed based on the reference genome, not based on the sample genomes.
the reference genome is haploid, so that each pair of chromosome, there is only one corresponding reference "strand". So, there are 22 reference strands for 22 chromosome pairs, and one reference strand for X and one strand for Y.

If my understand is correct, each location on reference strand will compare with 4 possible value of the genome (2 chromosomes, each chromosome has 2 strands) In vcf files, I saw that there are two columns REF and ALT, which REF is the value from reference strand, and ALT contains values from actual samples. However, most of ALT only has 1 or 2 values (seperated by ','), not 4. What make me wrong understanding here?

ADD REPLY • link 2.2 years ago by Peter • 0

1

Entering edit mode

the base pairing of DNA generally means you don't need to represent the strands separately (if one strand has a G, the other strand has a C, etc), so if you know one strand, you automatically know the other, so the information is all calculated relative to the reference genome which again is single stranded and haploid

ADD REPLY • link 2.2 years ago by cmdcolin ★ 4.0k