Closest gene to set of intervals
1
0
Entering edit mode
7.4 years ago
rbronste ▴ 420

Hi,

I know this question has been asked before, but have some additional variations:

I have a bed file of regions, want to see closest gene.

  1. Need a file containing one line for each gene instead of what table browser outputs, which is several lines for each refSeq gene. Not entirely sure how to get this.
  2. Want output to be the gene names themselves instead of a list of intervals, I guess a further awk operation to get there?
  3. Want closest gene irrespective of which direction it is from interval.

Any opinions appreciated, thanks!

bedops bedtools • 3.5k views
ADD COMMENT
1
Entering edit mode

Use BEDOPS closest-features --closest --no-ref to get the closest element (ties go upstream) and leave out the reference element (i.e. reporting the nearest element only). Pipe to cut to get the ID from the fourth column:

$ closest-features --closest --no-ref regions.bed refseq.bed | cut -f4 - > nearest-refseq-IDs.txt

If your annotations are in a different format, you can convert directly:

$ closest-features --closest --no-ref regions.bed <(gff2bed < refseq.gff) | cut -f4 - > nearest-refseq-IDs.txt
ADD REPLY
0
Entering edit mode

Very helpful thanks. Seems like I am constantly pulling refSeq files from table browser that have 35K+ lines in the BED, is there an easy way on there to just list every gene once? Seems like I am getting alternative transcript locations. Also does this command restrict the distance at which the element can exist from internal in my regions.bed file, I ran the refSeq file against my 500 intervals but only got one gene. Thanks again!

ADD REPLY
1
Entering edit mode

You could pipe ids to sort | uniq to get a sorted and unique list. You'd lose any associations with coordinates but then you aren't getting those, anyways.

ADD REPLY
0
Entering edit mode

Great, that works well. Wondering if there would be any reason for the regions.bed file to be scanned only across first interval for nearby gene?

ADD REPLY
0
Entering edit mode

I'm sorry, but I don't quite understand. Can you describe what you mean?

ADD REPLY
0
Entering edit mode

So I run the following command:

closest-features --closest --no-ref Atf2_sorted.bed refGene_062817_bedops_sorted.bed | cut -f4 - > nearest-refseq-IDs.txt

The Atf2_sorted.bed file has ~700 intervals however only the first one is considered when comparing to the refGene annotation - and I get an output of the closest gene to that first interval only.

ADD REPLY
0
Entering edit mode

I think its actually the format of my BED regions file, is there a specific type of BED that bedops accepts?

ADD REPLY
1
Entering edit mode

BEDOPS takes input that is a sort-bed-sorted, minimally three-column BED file, in most cases.

Are you working with files from Microsoft tools like Word, Excel or Windows? You'd need to strip any carriage return characters (\r) with tr or dos2unix etc. Having those characters can cause parsing problems.

You could use cat -te <file> to verify that your file is tab-delimited and has proper line endings.

ADD REPLY
0
Entering edit mode

No there are not edited in MsOffice at all, learned my lesson with that a while back. However when I do:

cat -te <file>

I get a jumble that doesn't look organized. This is confirmed with:

wc -l <file>

Which shows me that I have zero lines in my file though I have a bunch there. It may have something to do that this is an export from R and may not be perfectly processed as a BED file. Is there a quick way to take such a file and format it for BED format? Or see what the issue is? It has the right number of columns and looks TAB delimited. Thanks.

ADD REPLY
0
Entering edit mode

On re-reading, if wc -l returns no lines, that's definitely odd. Instead of posting the result of cat -te | head, which will likely just spit out the entire file, can you please post what you're using in R to export your data?

ADD REPLY
0
Entering edit mode

I am using outputting a BED file (or maybe now it seems as though its BED-file like) from DiffBind in the following way:

> report <- dba.report(tamoxifen, th=.05,                  DataType=DBA_DATA_FRAME,method=DBA_EDGER)

> scores <- -10*(log10(report$FDR))

> sites  <- cbind(report[,1:3],rownames(report),scores)

> gains  <- report$Fold > 2
> losses <- report$Fold < -2

​> write.table(sites[gains,],
              "DBsitesGains.bed", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)

> write.table(sites[losses,],
              "DBsitesLosses.bed", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)
ADD REPLY
0
Entering edit mode

I don't see anything obvious that suggests a problem. If you want to post DBsitesGains.bed or the losses file somewhere I can download them, I can take a look at those directly.

ADD REPLY
0
Entering edit mode

Actually I just figured out its only running the first line of my regions.bed file against the refSeq.bed, and giving me that one closest gene. Strange.

ADD REPLY
0
Entering edit mode

One quick question on this approach, if in my regions.bed there is a column indicating the FDR of these particular called regions, is there a good way to maintain that column in the final nearest-refseq-IDs.txt output here? Thanks!

ADD REPLY
1
Entering edit mode

If you know how many columns are in regions.bed, and that number is consistent throughout the entire regions file, you can take out the --no-ref option and cut two columns.

For example, say your regions.bed contains five columns and the FDR is in the score column (fifth column, per BED specification). The refSeq IDs are still in the fourth column of the annotation file.

Here, you could cut from the output the fifth column (FDR) and the ninth column (5+4=9, refSeq ID):

$ closest-features --closest regions.bed <(gff2bed < refseq.gff) | cut -f5,9 - > nearest-FDRs-with-refseq-IDs.txt

The output nearest-FDRs-with-refseq-IDs.txt will contain the FDR score from regions.bed and the refSeq ID of the refSeq element closest to that region.

You can adjust this approach, depending on your regions.bed file's characteristics and what columns you want.

ADD REPLY
0
Entering edit mode

Thank you! Very helpful!

ADD REPLY
0
Entering edit mode

One other quick question, how can I maintain a header through sort-bed and closest-features - as it gives me an error indicating my first line in closest-features is not a chromosome start even when I have the --header flag on. Thanks!

ADD REPLY
0
Entering edit mode

If you're using --header and it isn't skipping the header, then that might be a bug. Please file an issue on the Github site if that's the case and we'll look into it.

Otherwise, you could strip the header manually and reapply it at the end via standard input/output streams:

$ closest-features --closest <(tail -n+2 regions.bed) <(gff2bed < refseq.gff) | cut -f5,9 - | cat <(head -1 regions.bed) - > nearest-FDRs-with-refseq-IDs.txt

That's pretty ugly, I guess, but it should work. I'm not sure if the header columns would match up with the two-column output in this example, but perhaps tweak this, depending on what columns you want.

ADD REPLY
0
Entering edit mode
7.4 years ago
Rohit ★ 1.5k

I am not sure if you have tried, bedtools closest already does everything (apart from name extraction) what you need. You can use the -D or -d options with -ref (with respect to reference) to check which is the closest with/without direction involved.

ADD COMMENT
0
Entering edit mode

Does this negate the need for a -b file and pulls straight from a reference genome? Or you still need a BED of TSSs or whole genes etc?

ADD REPLY
0
Entering edit mode

You would need a bed file of gene boundaries for this

ADD REPLY

Login before adding your answer.

Traffic: 2512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6