I'm trying to use 1000GenomeProject integrated map ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz to filter out common SVs, in order to look for rare/novel SV in disease sample.
Questions:
I guess the integrated map is NOT quite integrated, correct? Because this list contains those highly confident and validated SVs, while ambiguous ones will be kicked out. But usually SV callings from real-world data would contain many such "ambiguous" ones (results in lots of false-positive caused by repetitive sequence misalignment, etc., or systematic errors/bias from the program itself). So if we use this "integrated" map as "golden standard" for filter, we'll end up retaining many "ambiguous" false positive.
For those analyzing tumor samples, naturally you'll have controls. But I'm working on complex disease, so one solution I could think of is to run many CONTROL samples (for example, CEU controls from 1000Genome) simultaneously, and remove whatever seen in CONTROL, which hopefully removes many "ambiguous" ones.
What else solutions could I do?
I randomly pick up several SV callings, which shows up as common deletions in my CONTROL, but interestingly absent from integrated SV map; To my surprise, they are all not-conserved LINE, picture as below:
The deletion absent from 1000Genome Project integrated map is the gap in the middle, I'm wondering why?
Thanks
Just a comment on "run[ning] many CONTROL samples": my company works on cancer but we lack normal tissue. Therefore, we use exactly your approach by removing variants which frequently occur in genome resequencing projects.
The 1kg provides some of the calls that didn't make the final cut in the working directories. I'd recommend downloading those.