How to do data cleaning for VCF genetic file:
0. check REF and ALT is correct or not, if not correct, revise them.
`bcftools norm -t "^24,25,26" -m-any --check-ref s -f hg19.fa Exome_QC.vcf.gz -Ov`
1. remove chr0 records
`vcftools --vcf All_samples_Exome_QC.vcf --not-chr 0 --recode --out Exome_QC.clean.vcf`
2. remove duplicated location variants (Duplicate marker)
`bcftools norm -d both --threads=32 All_samples_Exome.vcf -Ov -o Exome.norm.vcf`
3. remove all the variants whose ALT="-" or REF="-"
`bcftools view -e 'ALT ="-" | REF ="-"' All_samples_Exome.vcf.gz -Ov -o Exome_clean.vcf`
4. How to remove duplicate markers according to chr, start, end, ref and alt: check this script
sh remove_VCF_duplicates.sh All_samples_Exome.vcf.gz \> All_samples.undup.vcf
5. How to change "chr1" to "1". check this script
6. check REF/ALT same with Reference Genome or Phase Reference (beagle)
7. Install vt and try to use vt to normalize vcf recommended by RS
8. Apply MuSiCa to check mutation profile
9. Apply R package maftools to convert VCF to MAF
10. Remove variants with low quality : vcftools --vcf a.vcf --minGQ 90 --out b --recode
11. install most frequent used genetic analysis tools
12. list, include and remove samples from VCF bcftools query -l input.vcf
13. sciclone for inferring the subclonal architecture of tumors [validated in Ubuntu 18.04]
14. change chromosome name:
rm chr_name_conv.txt
for i in {1..22} X Y M; do echo "chr$i $i" >> chr_name_conv.txt; done
bcftools annotate --rename-chrs chr_name_conv.txt Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_PASS_variants.VA.vcf.gz -Oz -o Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_PASS_NUM.variants.VA.vcf.gz
Out of interest, where would chr0 records come from?
In many genome projects, chr0 is used to 'group' contigs that could not be assigned (yet) to a specific chromosome. So it's a pseudo-chromosome to collect all the left-over contigs and scaffolds. (which thus has no biological meaning of course)
It should be noted that this is for standard bialletic sites used in most genetic analysis of diploid organisms. In a lot of other cases, especially in the context of gene editing, mosaicism often results in multi-allelic variants, which could be handled by "bcftools norm", too.
the remaining task includes:
This "vcf cleaning procedure" seems to be specific to your use case. Do you know of anyone else that does this exact procedure that you do?
thanks for your share...excellent...he VCF file represents each individual as a column and each position as a row. This format is fine, but I prefer to have my data in the long-and-skinny format, rather than the short-and-fat format. Group-by operations are more flexible with long-and-skinny data, and everyone loves group-bys.