I have a VCF file that includes variant and invariant sites for every locus. Is there an easy way to obtain the lengths for each these?
I have a VCF file that includes variant and invariant sites for every locus. Is there an easy way to obtain the lengths for each these?
Check the vcf header. I have lots of .vcf files with the lengths of each contig in them.
You're saying there's more contigs in the VCF header than in your CHROM column? Then you can trust the VCF header and just use the subset you're interested in. Either way, going to an earlier stage is less error prone, look at the reference genome itself, or the SAM header for information related to those things. the VCF probably copied the data from there.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Every invariant site? If that's the case, just use this logic:
SELECT chr,(MAX(pos)-MIN(pos)) GROUP BY chr