Entering edit mode
6.8 years ago
Sharon
▴
610
Hi all
I understand that snp array is better for proving relatedness, but we have 2 vcfs generated from exom data, and we need to show if there is any relatedness between them.
I tried these two commands from vcftools and both get the same output:
vcftools --vcf p1p2.vcf --relatedness
vcftools --vcf p1p2.vcf --relatedness2
The output is:
INDV1 INDV2 RELATEDNESS_AJK
P1 P1 -nan
P1 P2 -nan
P2 P2 -nan
Any hint why I am getting nans?
Thanks
Can you paste the header of your VCF(s) and also some of the variants?
Thanks Kevin, here it is:
Hi Sharon! Okay, let me take a look in a while. I can already see potential problems.
Sure, thanks Kevin. Take your time. Much appreciated. Thanks so much.
Could you try the following steps:
gzip MyVars.vcf
tabix -p vcf MyVars.vcf.gz
bcftools norm -m-any -Oz MyVars.vcf.gz > MyVars.norm.vcf.gz
Then, retry the relatedness part.
Thanks Kevin, I tried this now:
Just gzip doesn't work and I had to change to --gzvcf, I got this:
Oh, some amount of progress... and, apologies, yes, we use bgzip and not gzip for this.
Can you take a look at the site 1:17086003 just to see what's there? You may have to manually remove it but, better to check the variant call to see what exactly is happening.
You could do something quick like
bcftools view norm.vcf.gz | grep -e "17086003"
Thanks Kevin so much. I keep getting positions that the error says are ployploidy that looks like this:
I keep removing and I keep getting more, they seem too much. Should I continue doing this manully?
I am also not sure why they show up, this is human? How do you think? Thanks
How did you call these variants? To me, this looks like issues with the variant caller (or aligner in aligning reads to a homopolymeric region).
If you want a quick fix just for the relatedness command, then use bcftools in order to only retain a maximum of 2 alleles (ref / alt) at each site: how to remove multiallelic from VCF (I assume that it will just retain the first listed in the VCF).
I did not do the call, they are old files in the lab, generated 5 years ago or so, and professor wanted to check relatedness lately. So, I will try that step of removing multiallelic from VCF.
Thanks Kevin so much, very much appreciated.
I removed the multiallelic from vcf, I am back now to the original issue :(
That's strange because it kept all 40993 sites in your file this time. Where are your samples from Do you expect relatedness?
I ran this on monozygotic twins and other brohter-pairs in the past, by the way, and the metric does work. There could be something quirky about your VCF..
If you want to paste some of it again, I can check, or upload it somewhere online so that I can try.
Thanks Kevin. Yes, it works with me too on other data we have. I am also surprised it is like this here. I posted my question just in case there is an issue I did not see. But yes, it works with other stuff I have.
It should be a father/son relationship. But for some other evidence, we claim that there is no relatedness between them. So it is weird/risky question too ! Here is a sample:
Oh!, there you go (I think...)! There are no properly encoded genotypes? I just see a '1' and not 1/1, 0/1, etc. What happened there?
Oh, yes, THANK YOU KEVIN, did not notice. If this is a sample of the original vcf before merging or anything, then this vcf has to be regenerated, right? I still see "1" instead of "1/0" or "1/1" in some cases. Thanks ALOT
Yes, where did you generate it? Custom annotations are permitted in the VCF format but not for the GT, I think.
I did not generate the original vcfs, I was given them. i just merged them. I will ask for the fastq if possible to regenerate and double check this issue.
Thanks so much Kevin. VERY MUCH APPRECIATED.