Question

difference in genotypes in a multisample vcf file

0

Entering edit mode

7.8 years ago

prasundutta87 ▴ 670

Hi,

I have two individuals of the same animal (one male and one female water buffalo). I have RNAseq data from both the animals and I used mpileup and bcftools 'call' programs for variant calling. My aim is to find out allelic imbalance in a set of genes of interest.

Let me take an example of one variant in the resuted VCF file. For a particular variant ,one individual is heterozygous and the other is homozygous.

Since, allelic imbalance can only be found out through a heterozygous SNP, should I go ahead with the variants that are heterozygous in both the individuals? Or should I go ahead with the variants that are atleast heterozygous in at least one individual.

Thanks. PS- I am new to NGS and doing variant analysis for the first time

SNP next-gen RNA-Seq • 2.5k views

ADD COMMENT • link 7.8 years ago by prasundutta87 ▴ 670

0

Entering edit mode

7.8 years ago

prasundutta87 ▴ 670

Thanks for your reply Petr. Just to add, I did call the SNPs using both individual's data (alignment file) simultaneously. Also, both of my alignment files are sorted and indexed before I used them for calling variants. The resulted VCF file is a multi-sample VCF file. Hence, I have genotype data side by side for both the individuals and that's where my question arose. I am aware that one individual can be homozygous and another heterozygous for one variant, which is due to diversity. I used this command for calling variants:

samtools mpileup -f reference.fa -g individual_1.bam .indivudual_2.bam|bcftools call -mv > Final_result.vcf

For filtering, I used default varFilter default parameters first, followed by variant annotation using SnpEff and then further filtering of variants using SnpSift using the following parameters: QUAL > 20, MQ >= 30, DP >= 10 (parameters range selected from various published literatures). I also filtered out the variants that was null in either of the individuals.

If you could kindly comment on the procedure, I can go ahead with selecting one variant that is heterozygous in at least one individual.

ADD COMMENT • link 7.8 years ago by prasundutta87 ▴ 670

0

Entering edit mode

This is a good starting point. You might want to change tools and filter parameters later to increase sensitivity

ADD REPLY • link 7.8 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Thanks a lot..am trying out GATK as well although the tools take too much time to run.

ADD REPLY • link 7.8 years ago by prasundutta87 ▴ 670

0

Entering edit mode

Please use ADD COMMENT to reply to earlier answers, as such this thread remains logically structured and easy to follow.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2017-02-14

At least one heterozygous. It will be easier if you merge vcf files into a singlemultisample file with vcftools.

vcf-merge A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz

With such single vcf you are interested in all lines that are different in genotype GT field and maybe you want to filter by quality of the call criteria.

Be careful with variant calling and comparison with sex chromosomes. Keep in mind that vcf usually has only mutation data unless you prepared data to go otherwise. So you most likely do not have homozygous reference calls there, but that does not mean a call is homozygous reference, because you may have very low coverage. If you have alignment files than you can use

samtools bedcov regions.bed aln.sorted.bam

Where regions.bed is regions to calculate coverage and aln.sorted.bam is your sorted alignment file.

In case it is not sorted you can sort and index it like this

samtools sort -T tmp.aln -o aln.sorted.bam aln.bam
samtoold index aln.sorted.bam

Another option if you have alignment files is to do variant calling on multiple files simultaneously. That way you will generate multisample vcf file and will have coverage data for potentially homozygous reference and low coverage positions in it already.