We have a list of a few thousand CNVs/SVs identified using array-based CNV calling methods. They are from five individuals sequenced by the 1000 Genomes Project. We would like to compare breakpoints on the CNVs identified by our calling and those identified by 1000 genomes.
So the question: how can I generate a vcf file (presumably using tabix and vcftools) for a select set of individuals overlapping a list of specified CNVs like the ones below? Moreover, can I do this (thousands of regions) in a single step? Should my regions I query be smaller or larger than the CNVs we identified? And can we filter the results by those that are above threshold size?
Note: it appears BrentP has an answer about how to get multiple regions at once from 1kG here , but if I am pulling all of these from the 1kG FTP server, must I do it by chromosome? And how would pulling regions handle imperfect overlaps as mentioned above?
Individuals:
NA10851
NA18505
Regions:
chr22:22680529-22726814
chr22:22613016-22670785
chr22:41234550-41276824
Thanks, Ryan
a little tip - you can get better answers if you only ask one question per topic. Otherwise, it's difficult for people to reply, and you will only get general answers.