Question

GATK - Output differences in Calls when Split by chromosome intervals and without splitting

0

Entering edit mode

5 months ago

Maverick ▴ 10

Hello,

I have a GATK pipeline following best practices workflow to call rare germline SNPs and INDELs. I wanted to reduce time and so I split the processing at chromosome level

In one analysis, I split the aligned and duplicates marked bam file by chromosomes and ran base recalibration(BQSR) and haplotype caller (regular mode, not gvcf) and then concatenated the vcf files of each chromosome of a sample into 1.

In another one, I ran the exact same commands without splitting at chromosome level. I filtered the non standard chromosomes from the result and when I tried to compare the results to the earlier vcf file , there were a couple 100 different variant calls, and in the calls that were matching handful of them had different genotype.

At the moment I am comparing the vcf calls using vcfR package and merging based on CHROM, POS, REF and ALT columns.

Is this expected behavior? I thought that base recalibration and haplotype caller don't consider regions across contigs for it's functionality.

Should I filter the non-standard chromosomes even before running base recalibration by providing the standard chromosomes as a list in -L in the analysis where I am not splitting the chromosomes?

Thank you,

Your guidance is greatly appreciated.

GATK PICARD MergeVcf vcftools • 1.2k views

ADD COMMENT • link 4 months ago by Maverick ▴ 10

0

Entering edit mode

hat were matching handful of them had different genotype.

Show us some examples.

ADD REPLY • link 5 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

This is the script (I am running it on JupyterLab):

# Load necessary libraries
library(tidyverse)
library(vcfR)

options(width = 200) 

# Step 1: Read VCF files
cat("Reading VCF files...\n")
vcf1 <- read.vcfR("vcf split by chromosomes")
vcf2 <- read.vcfR("vcf without any splitting")
cat("VCF files read successfully.\n")

# Step 2: Extract fixed fields, INFO fields, and GT (genotype) from both VCF files and convert to data frames
cat("Extracting fields and converting VCFs to data frames...\n")
vcf1_df <- as.data.frame(cbind(getFIX(vcf1), GT = extract.gt(vcf1, as.numeric = FALSE, return.alleles = FALSE)[,1]))
vcf2_df <- as.data.frame(cbind(getFIX(vcf2), GT = extract.gt(vcf2, as.numeric = FALSE, return.alleles = FALSE)[,1]))
cat("VCF to data frame conversion complete. Rows in vcf1_df:", nrow(vcf1_df), ", Rows in vcf2_df:", nrow(vcf2_df), "\n")

# Step 3: Filter vcf2_df to include only standard chromosomes
cat("Filtering for standard chromosomes...\n")
standard_chromosomes <- paste0("chr", c(as.character(1:22), "X", "Y", "M"))
vcf2_df <- vcf2_df %>% filter(CHROM %in% standard_chromosomes)
cat("Standard chromosome filtering complete. Rows remaining in vcf2_df:", nrow(vcf2_df), "\n")

# Step 4: Function to split rows with multiple alternate alleles and create a unique variant identifier
cat("Splitting alternate alleles and creating variant IDs...\n")
split_alternate_alleles <- function(df) {
  df %>%
    separate_rows(ALT, sep = ",") %>%
    mutate(variant_id = paste(CHROM, POS, REF, ALT, sep = "_"))
}
vcf1_df <- split_alternate_alleles(vcf1_df)
vcf2_df <- split_alternate_alleles(vcf2_df)
cat("Alternate alleles split. Rows in vcf1_df:", nrow(vcf1_df), ", Rows in vcf2_df:", nrow(vcf2_df), "\n")

# Step 5: Filter out unwanted genotypes (0/0, ./., or NA)
cat("Filtering out unwanted genotypes (0/0, ./., NA)...\n")
exclude_calls <- c("0/0", "0|0", "./.", NA)
vcf1_df <- vcf1_df %>% filter(!(GT %in% exclude_calls))
vcf2_df <- vcf2_df %>% filter(!(GT %in% exclude_calls))
cat("Genotype filtering complete. Rows in vcf1_df after filter:", nrow(vcf1_df), ", Rows in vcf2_df after filter:", nrow(vcf2_df), "\n")

# Additional Step: Remove matching POS rows in both VCF data frames based on filtered positions
cat("Identifying and removing matching POS rows based on filtered positions...\n")
filtered_positions_vcf1 <- setdiff(vcf1_df$POS, (vcf1_df %>% filter(!(GT %in% exclude_calls)))$POS)
filtered_positions_vcf2 <- setdiff(vcf2_df$POS, (vcf2_df %>% filter(!(GT %in% exclude_calls)))$POS)
positions_to_remove <- union(filtered_positions_vcf1, filtered_positions_vcf2)
vcf1_df <- vcf1_df %>% filter(!(POS %in% positions_to_remove))
vcf2_df <- vcf2_df %>% filter(!(POS %in% positions_to_remove))
cat("POS filtering complete. Rows remaining - vcf1_df:", nrow(vcf1_df), ", vcf2_df:", nrow(vcf2_df), "\n")

# Step 6: Merge the filtered data frames on variant_id
cat("Merging filtered data frames on variant_id...\n")
merged_df <- merge(vcf1_df, vcf2_df, by = "variant_id", suffixes = c("_vcf1", "_vcf2"))
cat("Merge complete. Rows in merged_df:", nrow(merged_df), "\n")

# Step 7: Calculate the percentage of matching variants without considering genotype
cat("Calculating the percentage of matching variants...\n")
total_variants <- nrow(merged_df)
cat("Total number of variants in common (excluding 0/0 and ./.):", total_variants, "\n")
percentage_common_variants <- (total_variants / min(nrow(vcf1_df), nrow(vcf2_df))) * 100
cat("Percentage of matching variants between vcf1 and vcf2 (without considering genotype):", round(percentage_common_variants, 2), "%\n")

# Step 8: Check for any remaining NA values in GT columns after merging
na_gt_vcf1 <- sum(is.na(merged_df$GT_vcf1))
na_gt_vcf2 <- sum(is.na(merged_df$GT_vcf2))
cat("Number of NA values in GT_vcf1 after merging:", na_gt_vcf1, "\n")
cat("Number of NA values in GT_vcf2 after merging:", na_gt_vcf2, "\n")

if (na_gt_vcf1 == 0 && na_gt_vcf2 == 0) {
  # Step 9: Compare genotypes if no NA values remain
  cat("Comparing genotypes for exact matches...\n")
  merged_df <- merged_df %>% mutate(genotype_match = GT_vcf1 == GT_vcf2)
  matching_genotypes <- sum(merged_df$genotype_match)
  percentage_match <- (matching_genotypes / nrow(merged_df)) * 100

  # Output the results
  cat("Number of matching genotypes:", matching_genotypes, "\n")
  cat("Percentage of matching genotypes:", round(percentage_match, 2), "%\n")
} else {
  cat("NA values detected in genotype columns after merging. Please check the GT columns for unexpected values.\n")
}

# Step 10: Find and print unique variants in each VCF
cat("Identifying unique variants in each VCF...\n")
unique_vcf1_variants <- setdiff(vcf1_df$variant_id, vcf2_df$variant_id)
unique_vcf2_variants <- setdiff(vcf2_df$variant_id, vcf1_df$variant_id)
cat("Number of variants unique to vcf1:", length(unique_vcf1_variants), "\n")
cat("Number of variants unique to vcf2:", length(unique_vcf2_variants), "\n")

Output:

Reading VCF files...
Scanning file to determine attributes.
File attributes:
  meta lines: 738
  header_line: 739
  variant count: 216244
  column count: 10
Meta line 738 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 216244
  Character matrix gt cols: 10
  skip: 0
  nrows: 216244
  row_num: 0
Processed variant: 216244
All variants processed
Scanning file to determine attributes.
File attributes:
  meta lines: 735
  header_line: 736
  variant count: 221663
  column count: 10
Meta line 735 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 221663
  Character matrix gt cols: 10
  skip: 0
  nrows: 221663
  row_num: 0
Processed variant: 221663
All variants processed
VCF files read successfully.
Extracting fields and converting VCFs to data frames...
VCF to data frame conversion complete. Rows in vcf1_df: 216244 , Rows in vcf2_df: 221663 
Filtering for standard chromosomes...
Standard chromosome filtering complete. Rows remaining in vcf2_df: 216614 
Splitting alternate alleles and creating variant IDs...
Alternate alleles split. Rows in vcf1_df: 217467 , Rows in vcf2_df: 217826 
Filtering out unwanted genotypes (0/0, ./., NA)...
Genotype filtering complete. Rows in vcf1_df after filter: 217467 , Rows in vcf2_df after filter: 217826 
Identifying and removing matching POS rows based on filtered positions...
POS filtering complete. Rows remaining - vcf1_df: 217467 , vcf2_df: 217826 
Merging filtered data frames on variant_id...
Merge complete. Rows in merged_df: 216919 
Calculating the percentage of matching variants...
Total number of variants in common (excluding 0/0 and ./.): 216919 
Percentage of matching variants between vcf1 and vcf2 (without considering genotype): 99.75 %
Number of NA values in GT_vcf1 after merging: 0 
Number of NA values in GT_vcf2 after merging: 0 
Comparing genotypes for exact matches...
Number of matching genotypes: 216876 
Percentage of matching genotypes: 99.98 %
Identifying unique variants in each VCF...
Number of variants unique to vcf1: 548 
Number of variants unique to vcf2: 907

Common Variants:

head(merged_df)

variant_id  CHROM_vcf1  POS_vcf1    ID_vcf1 REF_vcf1    ALT_vcf1    QUAL_vcf1   FILTER_vcf1 GT_vcf1 CHROM_vcf2  POS_vcf2    ID_vcf2 REF_vcf2    ALT_vcf2    QUAL_vcf2   FILTER_vcf2 GT_vcf2 genotype_match
<chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>
1   chr10_100024772_A_G chr10   100024772   NA  A   G   37.32   NA  1/1 chr10   100024772   NA  A   G   37.32   NA  1/1 TRUE
2   chr10_1000772_G_A   chr10   1000772 NA  G   A   3085.06 NA  1/1 chr10   1000772 NA  G   A   3086.06 NA  1/1 TRUE
3   chr10_100081396_A_G chr10   100081396   NA  A   G   3687.06 NA  1/1 chr10   100081396   NA  A   G   3680.06 NA  1/1 TRUE
4   chr10_100114749_T_C chr10   100114749   NA  T   C   163.96  NA  1/1 chr10   100114749   NA  T   C   163.96  NA  1/1 TRUE
5   chr10_100220702_A_G chr10   100220702   NA  A   G   540.06  NA  1/1 chr10   100220702   NA  A   G   541.06  NA  1/1 TRUE
6   chr10_100233166_TA_T    chr10   100233166   NA  TA  T   79.60   NA  0/1 chr10   100233166   NA  TA  T   79.60   NA  0/1 TRUE

Unique Variants:

print("Unique variants in vcf1:")

head(unique_vcf1_df)

print("Unique variants in vcf2:")

head(unique_vcf2_df)

[1] "Unique variants in vcf1:"
A tibble: 6 × 9
CHROM   POS ID  REF ALT QUAL    FILTER  GT  variant_id
<chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>
chr1    14932   NA  G   T   31.64   NA  0/1 chr1_14932_G_T
chr1    1308516 NA  C   T   30.48   NA  1/1 chr1_1308516_C_T
chr1    11878153    NA  G   A   78.32   NA  1/1 chr1_11878153_G_A
chr1    11878154    NA  C   A   78.32   NA  1/1 chr1_11878154_C_A
chr1    15599722    NA  G   A   30.48   NA  1/1 chr1_15599722_G_A
chr1    15990021    NA  T   C   30.48   NA  1/1 chr1_15990021_T_C
[1] "Unique variants in vcf2:"
A tibble: 6 × 9
CHROM   POS ID  REF ALT QUAL    FILTER  GT  variant_id
<chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>
chr1    36178158    NA  A   C   71.64   NA  0/1 chr1_36178158_A_C
chr1    36178159    NA  G   C   68.64   NA  0/1 chr1_36178159_G_C
chr1    36178164    NA  G   C   46.64   NA  0/1 chr1_36178164_G_C
chr1    44753958    NA  CTTTTT  C   35.44   NA  1/1 chr1_44753958_CTTTTT_C
chr1    54836263    NA  T   G   35.48   NA  1/1 chr1_54836263_T_G
chr1    54836282    NA  CGCCTGCCTGCCTGCCTGCCTGCCT   C   35.44   NA  1/1 chr1_54836282_CGCCTGCCTGCCTGCCTGCCTGCCT_C

ADD REPLY • link 5 months ago by Maverick ▴ 10

1

Entering edit mode

thanks but that is not what I meant. Please show us those variants in the VCF files, i want to see the quality, the FILTER, the DEPTH, etc...

for example chr1-14932 on grch38 has usually a very bad quality https://gnomad.broadinstitute.org/variant/1-14932-G-T?dataset=gnomad_r4

ADD REPLY • link 5 months ago by Pierre Lindenbaum 166k

1

Entering edit mode

furthermore, you'd better use bcftools isec to compare VCF files.

ADD REPLY • link 5 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

I did not compare them after filtration. this was as soon as I called the haplotype caller to verify my pipeline's integrity after splitting by chromosomes.

So these are the calls that are common but had different genotypes:

REF_vcf1    ALT_vcf1    GT_vcf1 variant_id  CHROM_vcf2  POS_vcf2    REF_vcf2    ALT_vcf2    GT_vcf2
<chr>   <dbl>   <chr>   <chr>   <chr>   <chr>   <chr>   <dbl>   <chr>   <chr>   <chr>
chr1    39684254    C   CCCCGGGTGCCTCGCCTCCTCCCAT   0/1 chr1_39684254_C_CCCCGGGTGCCTCGCCTCCTCCCAT   chr1    39684254    C   CCCCGGGTGCCTCGCCTCCTCCCAT   1/1
chr1    215640844   CAA C   1/2 chr1_215640844_CAA_C    chr1    215640844   CAA C   0/1
chr10   1015399 C   CCTGGGGTCCTGAGCGCTGAGCCTGGGAGCGGGG  0/1 chr10_1015399_C_CCTGGGGTCCTGAGCGCTGAGCCTGGGAGCGGGG  chr10   1015399 C   CCTGGGGTCCTGAGCGCTGAGCCTGGGAGCGGGG  1/1
chr10   71730924    G   A   1/1 chr10_71730924_G_A  chr10   71730924    G   A   0/1
chr11   6244585 TACACACACACACACACACAC   T   1/2 chr11_6244585_TACACACACACACACACACAC_T   chr11   6244585 TACACACACACACACACACAC   T   0/1
chr11   24976706    CAAAAAA C   1/2 chr11_24976706_CAAAAAA_C    chr11   24976706    CAAAAAA C   0/1

This is the INFO field comparison for these:

Index   CHROM   POS INFO_vcf1   INFO_vcf2
1   chr1    39684254    AC=1;AF=0.500;AN=2;BaseQRankSum=0.688;DP=23;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=58.91;MQRankSum=0.000;QD=34.80;ReadPosRankSum=1.046;SOR=1.739  AC=2;AF=1.00;AN=2;BaseQRankSum=0.757;DP=24;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=58.95;MQRankSum=0.000;QD=31.46;ReadPosRankSum=1.089;SOR=1.863
2   chr1    215640844   AC=1,1;AF=0.500,0.500;AN=2;BaseQRankSum=2.362;DP=23;ExcessHet=0.0000;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;MQRankSum=0.000;QD=15.59;ReadPosRankSum=1.922;SOR=1.179  AC=1;AF=0.500;AN=2;BaseQRankSum=1.981;DP=22;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=23.86;ReadPosRankSum=1.981;SOR=0.527
3   chr10   1015399 AC=1;AF=0.500;AN=2;BaseQRankSum=0.319;DP=4;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=44.84;MQRankSum=1.150;QD=28.90;ReadPosRankSum=0.319;SOR=1.609   AC=2;AF=1.00;AN=2;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=46.05;QD=33.26;SOR=2.833
4   chr10   71730924    AC=2;AF=1.00;AN=2;DP=2;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=29.31;SOR=2.303    AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=23.21;ReadPosRankSum=0.000;SOR=1.179
5   chr11   6244585 AC=1,1;AF=0.500,0.500;AN=2;DP=9;ExcessHet=0.0000;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=28.29;SOR=4.174   AC=1;AF=0.500;AN=2;DP=7;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=33.27;SOR=2.303
6   chr11   24976706    AC=1,1;AF=0.500,0.500;AN=2;DP=17;ExcessHet=0.0000;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=23.86;SOR=1.022  AC=1;AF=0.500;AN=2;DP=16;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=20.11;SOR=0.693

ADD REPLY • link 5 months ago by Maverick ▴ 10

0

Entering edit mode

When I did base recalibration(BQSR) without splitting the chromosomes, but split only at haplotype caller stage, the vcf files are a 100% match. So I can see that base recalibration is not specific to contigs and uses data across chromosomes. Is this right?

ADD REPLY • link 5 months ago by Maverick ▴ 10

0

Entering edit mode

For anyone else with similar issues, I used BQSR to get the recal table for each chromosome but then I did not combine them together and instead used the individual recal table generated for the specific chromosome in the ApplyBQSR step... After combining them into a single recal table using GatherBQSRReports and using the same recal table for ApplyBQSR for each chromosome again, gave me the same results as not splitting by chromosomes.

Cheers

ADD REPLY • link 4 months ago by Maverick ▴ 10