Entering edit mode
7.4 years ago
rbronste
▴
420
Hi,
I know this question has been asked before, but have some additional variations:
I have a bed file of regions, want to see closest gene.
- Need a file containing one line for each gene instead of what table browser outputs, which is several lines for each refSeq gene. Not entirely sure how to get this.
- Want output to be the gene names themselves instead of a list of intervals, I guess a further awk operation to get there?
- Want closest gene irrespective of which direction it is from interval.
Any opinions appreciated, thanks!
Use BEDOPS
closest-features --closest --no-ref
to get the closest element (ties go upstream) and leave out the reference element (i.e. reporting the nearest element only). Pipe tocut
to get the ID from the fourth column:If your annotations are in a different format, you can convert directly:
Very helpful thanks. Seems like I am constantly pulling refSeq files from table browser that have 35K+ lines in the BED, is there an easy way on there to just list every gene once? Seems like I am getting alternative transcript locations. Also does this command restrict the distance at which the element can exist from internal in my regions.bed file, I ran the refSeq file against my 500 intervals but only got one gene. Thanks again!
You could pipe ids to
sort | uniq
to get a sorted and unique list. You'd lose any associations with coordinates but then you aren't getting those, anyways.Great, that works well. Wondering if there would be any reason for the regions.bed file to be scanned only across first interval for nearby gene?
I'm sorry, but I don't quite understand. Can you describe what you mean?
So I run the following command:
The Atf2_sorted.bed file has ~700 intervals however only the first one is considered when comparing to the refGene annotation - and I get an output of the closest gene to that first interval only.
I think its actually the format of my BED regions file, is there a specific type of BED that bedops accepts?
BEDOPS takes input that is a
sort-bed
-sorted, minimally three-column BED file, in most cases.Are you working with files from Microsoft tools like Word, Excel or Windows? You'd need to strip any carriage return characters (
\r
) withtr
ordos2unix
etc. Having those characters can cause parsing problems.You could use
cat -te <file>
to verify that your file is tab-delimited and has proper line endings.No there are not edited in MsOffice at all, learned my lesson with that a while back. However when I do:
I get a jumble that doesn't look organized. This is confirmed with:
Which shows me that I have zero lines in my file though I have a bunch there. It may have something to do that this is an export from R and may not be perfectly processed as a BED file. Is there a quick way to take such a file and format it for BED format? Or see what the issue is? It has the right number of columns and looks TAB delimited. Thanks.
On re-reading, if
wc -l
returns no lines, that's definitely odd. Instead of posting the result ofcat -te | head
, which will likely just spit out the entire file, can you please post what you're using in R to export your data?I am using outputting a BED file (or maybe now it seems as though its BED-file like) from DiffBind in the following way:
I don't see anything obvious that suggests a problem. If you want to post
DBsitesGains.bed
or the losses file somewhere I can download them, I can take a look at those directly.Actually I just figured out its only running the first line of my regions.bed file against the refSeq.bed, and giving me that one closest gene. Strange.
One quick question on this approach, if in my regions.bed there is a column indicating the FDR of these particular called regions, is there a good way to maintain that column in the final
nearest-refseq-IDs.txt
output here? Thanks!If you know how many columns are in
regions.bed
, and that number is consistent throughout the entire regions file, you can take out the--no-ref
option and cut two columns.For example, say your
regions.bed
contains five columns and the FDR is in the score column (fifth column, per BED specification). The refSeq IDs are still in the fourth column of the annotation file.Here, you could cut from the output the fifth column (FDR) and the ninth column (5+4=9, refSeq ID):
The output
nearest-FDRs-with-refseq-IDs.txt
will contain the FDR score fromregions.bed
and the refSeq ID of the refSeq element closest to that region.You can adjust this approach, depending on your
regions.bed
file's characteristics and what columns you want.Thank you! Very helpful!
One other quick question, how can I maintain a header through sort-bed and closest-features - as it gives me an error indicating my first line in closest-features is not a chromosome start even when I have the --header flag on. Thanks!
If you're using
--header
and it isn't skipping the header, then that might be a bug. Please file an issue on the Github site if that's the case and we'll look into it.Otherwise, you could strip the header manually and reapply it at the end via standard input/output streams:
That's pretty ugly, I guess, but it should work. I'm not sure if the header columns would match up with the two-column output in this example, but perhaps tweak this, depending on what columns you want.