Hi all,
I'm trying to liftover a publicly available dataset at ENA (PRJEB62420) from CanFam4 (their reference) to CanFam3, to make it compatible with my current data. I need to stick to CanFam3, so this is the only option for me. The data I currenlty have is in plink files format (.bed, .bim, .fam) and I need to integrate both datasets using plink. I've currently tried two different softwares for the liftover: picard liftover and CrossMap, and I've run into different issues with each one. This is my first time doing this, so I'm not sure if I'm doing a silly mistake or it's something more complex.
The files I'm using are:
- The dataset's .vcf.gz downloaded from ENA
- The chain file canFam4ToCanFam3.over.chain.gz from UCSC
- The reference genome canFam3.fa.gz from UCSC
Below you can find the code for both approaches I've tried (Picard and CrossMap), the code I've used and the issues and errors I've run into. If you spot something I might be doing wrong, any advice about how to do this better, or have any other ideas about how to solve the issues, please let me know!
Picard Liftover
When trying to liftover with picard, all the variants get rejected. I'm aware this can be caused due to different chromosome coding, but both dataset's vcf and reference genome have 'chr1' cromosome coding style, so I guess that's not the issue.
The code I'm using is:
picard CreateSequenceDictionary R=CanFam3/canFam3.fa O=CanFam3/canFam3.dict
picard LiftoverVcf I=ENA/Dog10K_phase1_merged.vcf.gz O=picard_CanFam3/Dog10K_phase1_canFam3.vcf CHAIN=CanFam3/canFam4ToCanFam3.over.chain REJECT=picard_CanFam3/rejected_variants.vcf R=CanFam3/canFam3.fa
CrossMap
The issue with crossmap is that it ends the liftover process, but then I'm not able to transform the output into plink format as it raises the error 'ALT allele duplicates REF allele on line X of .vcf file' and it stops the process. I've tried the solution proposed here, but after executing the bcftools + plink code I'm only left with 100 variants (out of ~17M, clearly something has gone wrong). I'm also trying the first solution proposed here, but the vcf is so large it takes an eternity.
The code I'm using is:
CrossMap.py vcf CanFam3/canFam4ToCanFam3.over.chain ENA/Dog10K_phase1_merged.vcf.gz CanFam3/canFam3.fa 10K_CanFam3/Dog10K_phase1_canFam3.vcf.gz --compress
plink --vcf 10K_CanFam3/Dog10K_phase1_canFam3.vcf.gz --dog --allow-extra-chr --make-bed --out 10K_CanFam3/Dog10K_phase1_canFam3
The software versions are:
- CrossMap - v0.6.5 (bioconda::crossmap)
- Picard - v3.1.1 (bioconda::picard)
- Plink - v1.9
I tested what you suggested, and in fact the indels and the format of the VCF file seemed to be the issue. After running bcftools norm and bcftools view --types snps, the software you suggested performed the conversion as Picard/LiftoverVcf and CrossMap did.
Nevertheless, in all the cases only 200-700 variants matched the variants I have on my previous data, which has around 140.000 chromosome positions. This seems really odd, as I expected much more overlapping between the previous dataset (that comes both from sequencing and a commercial array) and this new sequencing experiment.
What do you think about this? I guess I'm only left with starting the pipeline from 0 mapping the raw data with CanFam3 reference
The canFam4 VCF includes 17,050,435 variants across 39 chromosomes. Is your data including 140,000 chromosome positions coming from a DNA microarray? I would assume a fair amount of overlap required. Can you show some examples of variants in your dataset that have no match in the canFam4 VCF?
Yes, the data is coming from a DNA array. I'll give you a sample of chromosomic positions in my canFam3 data that are not available in the canFam4 VCF after the Liftover process:
[1] "chr21_6293694" "chr31_8754785" "chr12_62060676" "chr1_49375431" "chr12_71264803" "chr10_60715901" "chr9_5521942" "chr16_53270558" [9] "chr4_2613846" "chr6_8371135" "chr2_78841858" "chr29_12189334" "chr8_60466605" "chr1_30451365" "chr14_48776060" "chr17_51849657" [17] "chr27_34092675" "chr13_44032386" "chr21_32848739" "chr12_64190787" "chr16_32283718" "chr18_9205983" "chr4_44897379" "chr15_56606959" [25] "chr8_10958226" "chr5_48667616" "chr2_17790407" "chr27_13500486" "chr8_40567049" "chr15_57793235" "chr8_55479099" "chr1_38006868" [33] "chr23_43938043" "chr14_50740815" "chr21_40833334" "chr7_59032445" "chr5_6859691" "chr1_92826004" "chr20_18801744" "chr4_36968310"
The following list are chromosomic positions that match between both datasets:
[1] "chr1_2507972" "chr8_21434928" "chr9_37885371" "chr9_13549079" "chr38_2186793" "chr5_70822672" "chr23_27332882" "chr1_51574252" [9] "chr26_6210595" "chr14_26417137" "chr22_27420987" "chr28_1177318" "chr10_4491780" "chr5_60108228" "chr6_75815688" "chr2_69140959" [17] "chr20_6831290" "chr15_25504183" "chr10_9058093" "chr37_7692993" "chr24_25359880" "chr19_39636332" "chr34_33743377" "chr18_23250690" [25] "chr17_63285636" "chr37_3778941" "chr5_25844897" "chr18_6388052" "chr34_33009451" "chr3_27411558" "chr37_28799873" "chr2_76722130" [33] "chr35_9793796" "chr25_12184561" "chr4_42476019" "chr4_72306783" "chr33_20410081" "chr32_13362092" "chr19_49167341" "chr11_44319151"
I see those coordinates in the
CanineHD_B.csv
Illumina manifest file but those are not canFam3 coordinates. I think they might be for canFam2. That might be the real problem here. You can use BCFtools/gtc2vcf to convert your DNA microarray data to the reference genome of your choice, regardless of what the Illumina manifest file was designed for. That way you might get your DNA microarray data straight for canFam4 and even avoid the liftover step