GRCh38 BAM with hg38 VCFs
3
6
Entering edit mode
10.1 years ago

I have a BAM file containing an alignment to the GRCh38 "analysis pipeline" set with decoys (from ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/ ).

I remapped NCBI's "resource bundle" VCF files to UCSC's hg38 using a combination of UCSC's chain files, CrossMap, and custom scripts. I want to run GATK indel realignment, base calibration, and variant calling on my BAM file using some of these remapped VCF files (e.g. -known on RealignerTargetCreator).

The BAM and VCF come from different assemblies (GRCh38 with decoys and hg38, respectively). I can change GRCh38 chromosome names to hg38 chromosome names (NCBI provides a map), but is this sufficient to ensure that tools interpret the VCF files' coordinates correctly?

genome next-gen • 8.5k views
ADD COMMENT
2
Entering edit mode
10.1 years ago
Denise CS ★ 5.2k

Unplaced: not known which chromosome the scaffold belongs to.

Unlocalized: although the chromosome within which the scaffold occurs is known, the scaffold's position or orientation is not known.

ADD COMMENT
1
Entering edit mode
10.1 years ago
Denise CS ★ 5.2k

GRCh38 and hg38 are different names for the same (latest) human assembly. The former is the official name by the Genome Reference Consortium whereas the latter is the UCSC Genome Browser Assembly ID. So the coordinates should be exactly the same.

More info can be found here.

ADD COMMENT
0
Entering edit mode

Thanks.

The sequences for GRCh38 and hg38 are nearly identical. However, hg38 seems to have more 'N's and 'n's, at least. Second, the GRCh38 "analysis set" contains coordinates that hg38 (and GRCh38) doesn't. Third, the headers differ wildly:

GRCh38: >gi|568336022|gb|CM000664.2| Homo sapiens chromosome 2, GRCh38 reference primary assembly
hg38: >chr2

My concern is that to GATK RealignerTargetCreator I'm passing the GRCh38 analysis set using --reference_sequence and a BAM file aligned to this assembly using --input_file, but VCF files mapped to hg38 using --known. How does GATK know that '568336022' is the same as 'chr2'?

I confirmed that RealignerTargetCreator produces different target intervals if I remove that argument (I also reran with --known to confirm that it's completely deterministic). This is a great sign, but it doesn't prove that the VCF file is being treated correctly.

So my question is how to ensure that tools (especially GATK) treat the mixed references (GRCh38 analysis set and hg38) correctly. Is it enough to change 'chr2' to '568336022'?

ADD REPLY
1
Entering edit mode
10.1 years ago
Denise CS ★ 5.2k

Hmm, now I'm curious. Got in touch with Kerstin Howe from GRC (that's the beauty of working on the same campus). Can you send me some examples so that I can pass them to her please? The headers are not a problem per se but the Ns are.

ADD COMMENT
1
Entering edit mode

Re your first issue: It seems that the UCSC download replaces all ambiguous bp with Ns. You are losing some information there, but as this only affects about 28 assembly components it probably is not a big problem.

Second issue: I am assuming you are working with the GRCh38 analysis set that includes the alts (note there is no decoy for GRCh38) and are therefore observing that it contains sequence that is not included in any download of the primary assembly (= no alts) from UCSC or directly from the GRC. If you want to avoid this, please use the "no_alts" analysis set offered here. More details can be found here.

ADD REPLY
0
Entering edit mode

Thanks! I'm glad that the masking isn't an issue.

For the second concern, I am using no_alts, but the readme mentions it includes an Epstein-Barr viral sequence as a decoy as well as unplaced and unlocalized scaffolds (aside: what's the difference between unplaced and unlocalized?). My understanding is that these sequences are not included in UCSC's hg38. Is that not correct?

ADD REPLY
0
Entering edit mode

These links are dead. NCBI moved the content to another folder and declared it frozen. Check this.

ADD REPLY

Login before adding your answer.

Traffic: 2071 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6