Tutorial:How to do data cleaning for VCF genetic file
1
11
Entering edit mode
5.4 years ago
Shicheng Guo ★ 9.5k

How to do data cleaning for VCF genetic file:

0. check REF and ALT is correct or not, if not correct, revise them.

`bcftools norm -t "^24,25,26" -m-any --check-ref s -f hg19.fa Exome_QC.vcf.gz -Ov`

1. remove chr0 records

`vcftools --vcf All_samples_Exome_QC.vcf --not-chr 0 --recode --out Exome_QC.clean.vcf`

2. remove duplicated location variants (Duplicate marker)

`bcftools norm -d both --threads=32 All_samples_Exome.vcf -Ov  -o Exome.norm.vcf`

3. remove all the variants whose ALT="-" or REF="-"

`bcftools view -e 'ALT ="-" | REF ="-"' All_samples_Exome.vcf.gz  -Ov -o Exome_clean.vcf`

4. How to remove duplicate markers according to chr, start, end, ref and alt: check this script

sh remove_VCF_duplicates.sh All_samples_Exome.vcf.gz \> All_samples.undup.vcf

5. How to change "chr1" to "1". check this script

6. check REF/ALT same with Reference Genome or Phase Reference (beagle)

7. Install vt and try to use vt to normalize vcf recommended by RS

8. Apply MuSiCa to check mutation profile

9. Apply R package maftools to convert VCF to MAF

10. Remove variants with low quality : vcftools --vcf a.vcf --minGQ 90 --out b --recode

11. install most frequent used genetic analysis tools

12. list, include and remove samples from VCF bcftools query -l input.vcf

13. sciclone for inferring the subclonal architecture of tumors [validated in Ubuntu 18.04]

14. change chromosome name:

rm chr_name_conv.txt
for i in {1..22} X Y M; do echo "chr$i $i" >> chr_name_conv.txt; done
bcftools annotate --rename-chrs chr_name_conv.txt Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_PASS_variants.VA.vcf.gz -Oz -o Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_PASS_NUM.variants.VA.vcf.gz
vcf • 7.2k views
ADD COMMENT
2
Entering edit mode

Out of interest, where would chr0 records come from?

ADD REPLY
3
Entering edit mode

In many genome projects, chr0 is used to 'group' contigs that could not be assigned (yet) to a specific chromosome. So it's a pseudo-chromosome to collect all the left-over contigs and scaffolds. (which thus has no biological meaning of course)

ADD REPLY
2
Entering edit mode

It should be noted that this is for standard bialletic sites used in most genetic analysis of diploid organisms. In a lot of other cases, especially in the context of gene editing, mosaicism often results in multi-allelic variants, which could be handled by "bcftools norm", too.

ADD REPLY
1
Entering edit mode

the remaining task includes:

Statistics: 
Alternative allele frequency > 0.5 sites: 387,454
Reference Overlap: 93.55% 
Match: 2,010,797
Allele switch: 0
Strand flip: 0
Strand flip and allele switch: 0
A/T, C/G genotypes: 0
Filtered sites: 
Filter flag set: 0
Invalid alleles: 377,897
Duplicated sites: 358
NonSNP sites: 0
Monomorphic sites: 11,421
Allele mismatch: 6,377
SNPs call rate < 90%: 108,308
ADD REPLY
1
Entering edit mode

This "vcf cleaning procedure" seems to be specific to your use case. Do you know of anyone else that does this exact procedure that you do?

ADD REPLY
0
Entering edit mode

thanks for your share...excellent...he VCF file represents each individual as a column and each position as a row. This format is fine, but I prefer to have my data in the long-and-skinny format, rather than the short-and-fat format. Group-by operations are more flexible with long-and-skinny data, and everyone loves group-bys.

ADD REPLY
7
Entering edit mode
5.4 years ago
Ram 44k

With respect to #3, do not use awk to manipulate VCF files. There is a possibility that it might work in an unexpected manner in one of the header lines, and bcftools has been extensively tested and is a lot more VCF-format-specific than awk.

Also, your #5 is the bcftools version of #3, so #3 need not exist.

The bash script is not needed for duplicate removal, when bcftools norm exists. It uses a loop hard-coded with contig names, so it has multiple pitfalls.

Do not use sh to run a shell script unless you're 100% sure the script is POSIX compliant.

Your script to change chrX to X runs through the file twice, producing an unnecessary intermediate file, which bcftools annotate --rename-chrs can do much more reliably and reproducibly.

Your script to change X to chrX won't work for any contig that does not map as str -> "chr"+str exactly. Which means, it will fail on any VCF with additional contigs.

ADD COMMENT
0
Entering edit mode

Awesome! Thanks RamRS!

ADD REPLY
0
Entering edit mode

Hi RamRS, do you have more comprehensive vcf file cleaning procedures? Thanks.

ADD REPLY
4
Entering edit mode

Step-1: bcftools norm <split-multiallelic> <normalize-and-left_align-with-ref>

or

Better Step-1: vt decompose -s in.vcf | vt normalize - >out.vcf

Step-2: Annotate with VEP

Everything else is based on what you need downstream

ADD REPLY
1
Entering edit mode

Step 1 like this way?

bcftools norm -m-any --check-ref s -f hg19.fa Exome_QC.vcf.gz -Ov

Step 2, VEP is okay. I prefer the latest version of ANNOVAR which I found have a better annotation to VEP.

table_annovar.pl -vcfinput input.vcf ~/humandb/ --thread 32 -buildver hg19 -out output -remove -protocol refGene,dbnsfp33a -operation gx,f -nastring . -otherinfo -polish -xref ~/humandb/gene_fullxref.txt
ADD REPLY
2
Entering edit mode

Yes, step-1 is accurate. vt norm is better because it retains an INFO attribute on modified variants.

I used to prefer ANNOVAR too, but VEP is more accurate, IMO. Plus, annovar's flags confuse me.

ADD REPLY
0
Entering edit mode

What's vt norm ?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

I see. they should give it a better name. xxtools is more popular.

ADD REPLY

Login before adding your answer.

Traffic: 2460 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6