Question

Liftover - Picard and plink issues when performing liftover from CanFam4 to CanFam3

0

Entering edit mode

11 months ago

ipediez93 ▴ 40

Hi all,

I'm trying to liftover a publicly available dataset at ENA (PRJEB62420) from CanFam4 (their reference) to CanFam3, to make it compatible with my current data. I need to stick to CanFam3, so this is the only option for me. The data I currenlty have is in plink files format (.bed, .bim, .fam) and I need to integrate both datasets using plink. I've currently tried two different softwares for the liftover: picard liftover and CrossMap, and I've run into different issues with each one. This is my first time doing this, so I'm not sure if I'm doing a silly mistake or it's something more complex.

The files I'm using are:

The dataset's .vcf.gz downloaded from ENA
The chain file canFam4ToCanFam3.over.chain.gz from UCSC
The reference genome canFam3.fa.gz from UCSC

Below you can find the code for both approaches I've tried (Picard and CrossMap), the code I've used and the issues and errors I've run into. If you spot something I might be doing wrong, any advice about how to do this better, or have any other ideas about how to solve the issues, please let me know!

Picard Liftover

When trying to liftover with picard, all the variants get rejected. I'm aware this can be caused due to different chromosome coding, but both dataset's vcf and reference genome have 'chr1' cromosome coding style, so I guess that's not the issue.

The code I'm using is:

picard CreateSequenceDictionary R=CanFam3/canFam3.fa O=CanFam3/canFam3.dict

picard LiftoverVcf I=ENA/Dog10K_phase1_merged.vcf.gz O=picard_CanFam3/Dog10K_phase1_canFam3.vcf CHAIN=CanFam3/canFam4ToCanFam3.over.chain REJECT=picard_CanFam3/rejected_variants.vcf R=CanFam3/canFam3.fa

CrossMap

The issue with crossmap is that it ends the liftover process, but then I'm not able to transform the output into plink format as it raises the error 'ALT allele duplicates REF allele on line X of .vcf file' and it stops the process. I've tried the solution proposed here, but after executing the bcftools + plink code I'm only left with 100 variants (out of ~17M, clearly something has gone wrong). I'm also trying the first solution proposed here, but the vcf is so large it takes an eternity.

The code I'm using is:

CrossMap.py vcf CanFam3/canFam4ToCanFam3.over.chain ENA/Dog10K_phase1_merged.vcf.gz CanFam3/canFam3.fa 10K_CanFam3/Dog10K_phase1_canFam3.vcf.gz --compress

plink --vcf 10K_CanFam3/Dog10K_phase1_canFam3.vcf.gz --dog --allow-extra-chr --make-bed --out 10K_CanFam3/Dog10K_phase1_canFam3

The software versions are:

CrossMap - v0.6.5 (bioconda::crossmap)
Picard - v3.1.1 (bioconda::picard)
Plink - v1.9

picard liftover crossmap plink • 1.2k views

ADD COMMENT • link updated 11 months ago by Giulio Genovese ▴ 570 • written 11 months ago by ipediez93 ▴ 40

score 2 · Answer 1 · 2024-01-10

2

Entering edit mode

11 months ago

Giulio Genovese ▴ 570

You can perform this task with BCFtools/liftover from here using:

wget ftp://ftp.sra.ebi.ac.uk/vol1/ERZ183/ERZ18321977/Dog10K_phase1_merged.vcf.gz
wget https://hgdownload.soe.ucsc.edu/goldenPath/canFam4/liftOver/canFam4ToCanFam3.over.chain.gz
wget https://hgdownload.soe.ucsc.edu/goldenPath/canFam3/bigZips/canFam3.fa.gz
wget https://hgdownload.soe.ucsc.edu/goldenPath/canFam4/bigZips/canFam4.fa.gz
gunzip canFam3.fa.gz canFam4.fa.gz
samtools faidx canFam3.fa.gz
samtools faidx canFam4.fa.gz
bcftools +liftover --no-version -Ou Dog10K_phase1_merged.vcf.gz -- \
  -c canFam4ToCanFam3.over.chain.gz -s canFam4.fa -f canFam3.fa | \
bcftools sort -Oz -o Dog10K_phase1_merged.canFam3.vcf.gz --write-index

I was also able to perform the liftover with Picard/LiftoverVcf so your issue might be related to the size of the input VCF. However, your VCF contains a lot of indels and both Picard/LiftoverVcf and CrossMap/VCF do not support indels when reference and alternate alleles are swapped leading to corrupt data. I would advise to avoid these software for these use cases

ADD COMMENT • link 11 months ago by Giulio Genovese ▴ 570

0

Entering edit mode

I tested what you suggested, and in fact the indels and the format of the VCF file seemed to be the issue. After running bcftools norm and bcftools view --types snps, the software you suggested performed the conversion as Picard/LiftoverVcf and CrossMap did.

Nevertheless, in all the cases only 200-700 variants matched the variants I have on my previous data, which has around 140.000 chromosome positions. This seems really odd, as I expected much more overlapping between the previous dataset (that comes both from sequencing and a commercial array) and this new sequencing experiment.

What do you think about this? I guess I'm only left with starting the pipeline from 0 mapping the raw data with CanFam3 reference

ADD REPLY • link 11 months ago by ipediez93 ▴ 40

0

Entering edit mode

The canFam4 VCF includes 17,050,435 variants across 39 chromosomes. Is your data including 140,000 chromosome positions coming from a DNA microarray? I would assume a fair amount of overlap required. Can you show some examples of variants in your dataset that have no match in the canFam4 VCF?

ADD REPLY • link 11 months ago by Giulio Genovese ▴ 570

0

Entering edit mode

Yes, the data is coming from a DNA array. I'll give you a sample of chromosomic positions in my canFam3 data that are not available in the canFam4 VCF after the Liftover process:

[1] "chr21_6293694" "chr31_8754785" "chr12_62060676" "chr1_49375431" "chr12_71264803" "chr10_60715901" "chr9_5521942" "chr16_53270558" [9] "chr4_2613846" "chr6_8371135" "chr2_78841858" "chr29_12189334" "chr8_60466605" "chr1_30451365" "chr14_48776060" "chr17_51849657" [17] "chr27_34092675" "chr13_44032386" "chr21_32848739" "chr12_64190787" "chr16_32283718" "chr18_9205983" "chr4_44897379" "chr15_56606959" [25] "chr8_10958226" "chr5_48667616" "chr2_17790407" "chr27_13500486" "chr8_40567049" "chr15_57793235" "chr8_55479099" "chr1_38006868" [33] "chr23_43938043" "chr14_50740815" "chr21_40833334" "chr7_59032445" "chr5_6859691" "chr1_92826004" "chr20_18801744" "chr4_36968310"

The following list are chromosomic positions that match between both datasets:

[1] "chr1_2507972" "chr8_21434928" "chr9_37885371" "chr9_13549079" "chr38_2186793" "chr5_70822672" "chr23_27332882" "chr1_51574252" [9] "chr26_6210595" "chr14_26417137" "chr22_27420987" "chr28_1177318" "chr10_4491780" "chr5_60108228" "chr6_75815688" "chr2_69140959" [17] "chr20_6831290" "chr15_25504183" "chr10_9058093" "chr37_7692993" "chr24_25359880" "chr19_39636332" "chr34_33743377" "chr18_23250690" [25] "chr17_63285636" "chr37_3778941" "chr5_25844897" "chr18_6388052" "chr34_33009451" "chr3_27411558" "chr37_28799873" "chr2_76722130" [33] "chr35_9793796" "chr25_12184561" "chr4_42476019" "chr4_72306783" "chr33_20410081" "chr32_13362092" "chr19_49167341" "chr11_44319151"

ADD REPLY • link 11 months ago by ipediez93 ▴ 40

0

Entering edit mode

I see those coordinates in the CanineHD_B.csv Illumina manifest file but those are not canFam3 coordinates. I think they might be for canFam2. That might be the real problem here. You can use BCFtools/gtc2vcf to convert your DNA microarray data to the reference genome of your choice, regardless of what the Illumina manifest file was designed for. That way you might get your DNA microarray data straight for canFam4 and even avoid the liftover step

ADD REPLY • link 11 months ago by Giulio Genovese ▴ 570

score 0 · Answer 2 · 2024-01-10

Since you have Plink (.bed/.bim/.fam) and canFam4ToCanFam3.over.chain files, you can convert Plink .bim file into a bed file and use liftOver tool to liftover genome coordinates from canFam4 into canFam3 easily. Please check the approach in this link below- https://genome.sph.umich.edu/wiki/LiftOver