Understanding Used Assembly: Why aren't authors specific about patch version?
2
2
Entering edit mode
18 months ago

I am finding difficulty finding the exact assembly version (e.g. patch version) of GRCh38 used for major databases.

For instance, gnomad says "GRCh38". But the only information on the version, for v3.1 comes from here, which says it "uses an updated version of Variant Effect Predictor (VEP) based on the most recent Gencode v35. When I click that link, I discover Gencode v35 is based on GRCh38.p13... which is great except it doesn't tell me if the GenBank (GCA_000001405.28) or RefSeq (GCF_000001405.39) assembly was used. This is important as those versions are not the same (unlike early patches of GRCh38). Additionally, the files I downloaded were v3.1.2... where no assembly information beyond "GRCh38" is used.

Then I decided to look up the latest update to the 1000 genome (e.g. 2022 Byrska-Bishop et al). And with a quick scan I am not finding anything about the patch version. Do I simply assume the original 2013 release was used?

The fact that patch-version does matter for looking up variants at an exact position makes me wonder if I approaching this wrong... 1) Are people generally not using positions to look up variant information and instead using rsIDs? What about rare variants based on whole-genome sequencing, which don't have an rsID? 2) Is there some quick way/tool to figure out the patch version?

For my work, being able to extract variants using a position and mapping them exactly onto the chromosome is important. However, I am trying to incorporate multiple databases, which used different patches. Thus, I am also wondering...

3) Is there a reliable way to remove patches and convert VCF files to the original 2013 version of GRCh38? Is this consider a bad practice?

Really appreciate any feedback

Thank you!

1000genomes GRCh38 gnomad assembly freeze • 1.5k views
ADD COMMENT
2
Entering edit mode
18 months ago
GenoMax 147k

You may have seen explanation of the patches from GRC --> (LINK).

The fact that patch-version does matter for looking up variants at an exact position makes me wonder if I approaching this wrong.

Important thing to keep in mind


Patches are accessioned scaffold sequences that represent assembly updates. They add information to the assembly without disrupting the chromosome coordinates.


ADD COMMENT
0
Entering edit mode

GenoMax how does this work, formally?

ADD REPLY
0
Entering edit mode

Did you check the link above?

ADD REPLY
0
Entering edit mode
18 months ago

"The fact that patch-version does matter for looking up variants at an exact position"

This is incorrect. A freeze is a freeze. No one aligns to contigs outside the primary assembly (which includes unscaffolded contigs but not patches) unless they are doing really esoteric work.

It is true that you may see differences in annotation that are (IMO unfortunately and confusingly) linked to patch releases but those shouldn't affect your ability to use (chrom, pos, ref, alt) to reliably locate variants in your GRCh38 VCF. The "id" field can be affected by the dbSNP version, but those shouldn't move within a freeze.

So, to answer your questions:

  1. Yes they do use chr-pos-ref-alt
  2. Not a thing
  3. Not a thing
ADD COMMENT

Login before adding your answer.

Traffic: 1615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6