Hi,
Wondering the most efficient way to remove CTCF sites from a BED file? Thanks.
Rob.
Hi,
Wondering the most efficient way to remove CTCF sites from a BED file? Thanks.
Rob.
bedTools intersect can probably get the job done. You can take your regions.bed file and a separate bed file containing CTCF sites, then use the -v option to output only regions that are not CTCF sites like this:
bedtools intersect -v -a regions.bed -b CTCF.bed > regions_noCTCF.bed
The default is to remove any regions which have even a single base pair of overlap with the B file, but you can change that so that a certain amount of overlap is required for removal.
With BEDOPS bedops
:
$ bedops --not-element-of -1 regions.bed CTCF.bed > regionsWithoutCTCFOverlaps.bed
Using --not-element-of
preserves the original intervals in regions.bed
and any additional columns they have (ID, score, strand, etc.).
If you actually wanted to carve out the space taken up by CTCF intervals, you could use --difference
:
$ bedops --difference regions.bed CTCF.bed > answer.bed
This calculates new intervals, discarding additional columns in regions.bed
.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If I understand correctly the first option leaves me with a file where the CTCF sites are identified in the BED, and the second option totally drops them out of the BED record? Thanks again!
The first option removes any elements that overlap CTCF sites by one or more bases. The second option removes the genomic space within elements, which is occupied by the genomic space of CTCF sites. The cartoons in the BEDOPS docs explain this graphically.
Very helpful info thank you. In the second case once the genomic space is removed does the interval get split into two if the CTCF site isn't on one end of the other? Juts trying to see a signature of this in the number of intervals at the end.
Yes, you'd get two or more pieces. It's like painting a wall and pulling away pieces of masking tape from within the middle of the wall, if that analogy is useful.
However, an easier tool to use for that would be
bedmap
:Then run
wc -l
onregionsThatEntirelyContainCTCFSites.bed
andregions.bed
to get counts. This would give an accurate account of relative, full CTCF occupancy.