To filter out variants which I consider irrelevant, I want to filter out variants situated in highly repetitive regions such as centromeres and telomeres and also other repeat regions on the genome.
Therefore, I'm looking for a database of such repeat regions.
If your data is in genomic coordinates, you could use the UCSC Genome browser table browser tool to extract repeated element information from the RepeatMasker track.
If you have sequences, you could use RepeatMasker and RepBase to determine which parts of your sequences are repetitive in nature.
THE UCSC has a simpleRepeat database for tandem repeats, the raw data is here (.txt.gz):
Or through mysql:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -N -AB \
-e "SELECT chrom, chromStart, chromEnd from simpleRepeat;" hg19 \
> simpleRepeats.bed
You could also have a look at the mappability tables. Their description is:
These tracks display the level of sequence uniqueness of the reference
GRCh37/hg19 genome assembly. They were generated using different window
sizes, and high signal will be found in areas where the sequence is unique.
Eric's answer is fine if you wish to use a public source to do the filtering. If you wish to do this in-house, then grab the library of human repeats - here, the RepBase data would be best.
Alastair also provides key points to accomplish this task.
I'm sure if this is a tangent that should really be a separate thread for discussion, but I noticed that RepBase is having to change it's method of support.
Given that I would use RepBase for the command-line version of RepeatMasker, I am not sure how this affects things (and, in terms of having a .bed or .gtf track, I would download the table from UCSC, as recommended in other responses). However, if you wanted to learn more about the repeat references / annotations, you might want to learn more about the sequences in RepBase.
My answer in this thread should be what you need.