I'm looking for rare variants from whole-genome sequencing data. I found a "rare" SNP in my patient sample which has never been found in any database including latest 1000-Genome and exome sequencing database. However when I check this in other 4 randomly-chosen control whole-genome sequences from 1000G, it turned out within GC-rich region and barely covered by any reads (but in my data, sequencer goes through this GC-rich region resulting good coverage).
Then I would argue I'm not sure if the SNP I found is really rare, or just common one but missed by NGS in 1000G because PCR simply cannot go over the GC-rich region.
But 1000G got huge number of samples and call SNP/indel from this aggregation of samples simultaneously; it'll be almost impossible that one certain region won't be covered by any read, right?
So should I trust 1000Genome SNP/indel database for those GC-rich region?
Due to various filtering, 1000g will miss a small fraction of common SNPs, which can hardly be avoided. Checking unfiltered SNPs is a better way to confirm if it is really rare. I do not know if unfiltered are still available.
Don't trust indels. 1000g still have a lot of troubles with them. They are trying hard to improve indel calling.
what is the sequencer you used for your data?
The sequencer is HiSeq2000