Finding the Refseq Soft-masked Regions
2
0
Entering edit mode
4.6 years ago

I noticed that in IGV there are lower case regions (soft-masked). I want to exclude these in an analysis and was wondering if there was an annotation for the sites that are soft-masked. Like a bed file or GTF with all the specific soft-masked locations.

I could go through and identify the lower case regions manually but before I do that maybe I'm overlooking an available data source?

IGV Refseq • 2.6k views
ADD COMMENT
0
Entering edit mode

I'm looking at human data (hg38) but this is probably not super relevant to the question

ADD REPLY
0
Entering edit mode

A recent past thread about this : Repeat masked gtf files from ensembl

I don't think GTF files with masked locations are available.

ADD REPLY
1
Entering edit mode
4.6 years ago

To solve this I used the script from steve found here: find positions of a short sequence in a genome using the pattern [agct]. The square brackets are standard for python regex and allow for matching to one of any of the characters within the brackets.

The script is used to find the index of all matches from the given short sequence to a fasta file.

The input fasta in my case was Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa from Ensembl

Then I collapsed all the identified regions in R and converted their Ensembl style chromosomes (i.e. "1","MT") to UCSC format (i.e. "chr1","chrM").

Be aware that this requires a lot of memory as I'm loading the location of all the soft-masked sites (~34GB file) into R. The whole method could be simplified and sped up significantly by identifying and converting all matches for each chromosome in parallel (the current method took a little over one hour).

library(rtracklayer)
library(GenomicRanges)

sites <- import.bed('output_from_script')
sites.fix <- reduce(sites)
seqlevels(sites.fix) <- gsub(" " ,"",seqlevels(sites.fix))
seqlevelsStyle(sites.fix) <- 'UCSC'

export(sites.fixed,'soft_masked_sites_hg38.bed')

The final bed file is 135 MB.

ADD COMMENT
4
Entering edit mode
4.6 years ago
vkkodali_ncbi ★ 3.8k

If you are downloading the genome FASTA file from NCBI, the corresponding repeat masker file is also available in the same FTP path that you can use to construct a BED file. For example, you can download the current human genome data from the following FTP path: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13. The FASTA file is GCF_000001405.39_GRCh38.p13_genomic.fna.gz and the RepeatMasker file GCF_000001405.39_GRCh38.p13_rm.out.gz has the data about the masked regions. Specifically, columns 5-7 in this file are query_sequence, start_position and end_position that you can use to construct a BED file. You may have to offset the start_position by 1 to make the coordinates BED-compatible.

ADD COMMENT

Login before adding your answer.

Traffic: 1612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6