A recent paper by Heng Li discusses two major filters that can be used to remove artifacts when calling variants. They are:
- Filtering by excessive depth (Currently being accomplished using vcfutils -D <large depth> ~ 1000>
- Filtering out low complexity regions - He accomplishes this using mdust and subtracting the regions out using a bedfile.
So - I am trying to apply these filters on sequencing done in C. elegans and am currently stuck on the second filter. Heng li provides an LCR file on github [here][2], "LCR-hs38.bed.gz"
I need to find or generate a file like this for C. elegans. Once this bedfile is produced it can be used to subtract LCR variants from a VCF or GFF using bedtools subtract
.
As a starting point - a masked version of the C. elegans ce10 genome exists at UCSC in fasta format: http://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZips/chromFaMasked.tar.gz
UCSC uses repeatmasker to do the masking.
EDIT
For those curious - here is how I generated the necessary file:
wget 'http://hgdownload.soe.ucsc.edu/goldenPath/ce10/database/rmsk.txt.gz' -O LCR_rmsk.txt.gz
gunzip -kfc LCR_rmsk.txt.gz | cut -f 6,7,8 > LCR_rmsk.txt
Thanks Vivek!
Thanks I used the following code to take care of this!