BEDOPS closest features help
1
0
Entering edit mode
9.1 years ago
shewalesj ▴ 20

Hello Everyone,

I've got an output file from RnBeads, an R package, with differentially methylated regions. None of these regions actually come with gene IDs. I'm trying to use the BEDOPS closest feature to get the closest gene ID in comparison to my dataset excel file. I'm unable to do this successfully. Everytime I run the closest features command, the output file I get shows that none of my locations fell within the reference file? I'm assuming this since within the last column, every row ends with "|NA".

My question is, what am I doing wrong? Are my files in the wrong format? Did I download the correct file from ensembl? why are none of my regions of interest returning the closest region?

The files below are available at this dropbox folder: https://www.dropbox.com/sh/xa43onjwg3d7npv/AAAk8Wzwe-rj95eDrdHBuHona?dl=0

  1. tiling_fourm.xlsx <-- This excel file is what I'm trying to get the closest gene locations of.
  2. tiling_forum.bed <-- Removed the first row (headings). I've taken the chromosome, start, and end column, saved it as a tab delimited txt file and changed the extension to .bed
  3. hg19.bed <-- this is the reference file (analogous to the <inputfile> within the bedops closest features user guide page. This file I've created by going to https://genome.ucsc.edu/cgi-bin/hgTables and then selecting the following features: Clade: Human, genome: Human, assembly: feb2009grch37/hg19, group: mapping and sequencing, Track: UCSC Genes, table:known gene, region: genome, output format: select fields from primary and related tables. Then the fields selected were Chrom, chrinstart and chromend.
  4. outputforum.txt <-- this is the result file I get after running the following cmd: closest-features --closest hg19.bed tiling_forum.bed > outputforum.txt.

As you can see the file that's produced contains the last column with values that end in "|NA" as in not mapped? I'm assuming?

If you anyone could help me out, it would be greatly appreciated.

Thanks

bed methylation bedops bedtools-closest • 3.6k views
ADD COMMENT
1
Entering edit mode

Thanks.

ADD REPLY
3
Entering edit mode
9.1 years ago

There are a few issues with the tiling_forum.bed.txt input file you made:

  1. Text output from either Microsoft Excel for Mac and Excel for WIndows will have Microsoft-style line ending problems that need fixing before using those text files with Unix applications (like BEDOPS).
  2. Ideally, any files used with Unix tools should have a trailing newline.
  3. BED files need to be sorted per BEDOPS sort-bed before using them with other tools.

I'm guessing that you exported the tiling_forum.bed.txt file from Excel for Mac, because each line in your file ends with a carriage return ("CR"), instead of a line feed ("LF") character. If you had exported from Excel for Windows, each line would end with both carriage return and line feed characters (so-called "CRLF").

These characters are "invisible" so it is difficult to immediately see what's going on without investigating with tools that can print out these special characters as something visible. Once you can see them, you can use tools to either remove them or turn them into something else.

To fix the tiling_forum.bed.txt file, open up a Terminal session and type the following:

$ (tr '\r' '\n' < tiling_forum.bed.txt; echo) \
    | sort-bed - \
    > tiling_forum.bed.txt.fixed

What this pipeline does is:

  1. Use tr to translate CR (carriage return) characters to LF (line feed or newline) characters and use echo to add a trailing newline.
  2. Sort the translated BED data with sort-bed.
  3. Write the sorted BED data to a new file called tiling_forum.bed.txt.fixed.

The hg19.bed.txt file has a couple issues:

  1. We need to strip the header line (#chrom etc.).
  2. We need to sort the stripped BED data with sort-bed.

We can run the following set of commands to fix this file:

$ tail -n +2 hg19.bed.txt | sort-bed - > hg19.bed.txt.fixed

Once both inputs are fixed, proper BED files, you can run the closest-features application on them:

$ closest-features --closest hg19.bed.txt.fixed tiling_forum.bed.txt.fixed | head -10
chr1    10000    10615|chr1    65001    70000
chr1    10615    177417|chr1    65001    70000
chr1    227417    267719|chr1    65001    70000
chr1    317719    471368|chr1    65001    70000
chr1    521368    632917|chr1    65001    70000
chr1    632917    812484|chr1    65001    70000
chr1    812484    998289|chr1    65001    70000
chr1    998289    1127268|chr1    65001    70000
chr1    1127268    1237427|chr1    65001    70000
chr1    1237427    1319872|chr1    65001    70000
ADD COMMENT
1
Entering edit mode

Thank you so much for your help Alex, I really appreciate it. I'm new to all this and most of my experience is with the wet lab stuff as a graduate student. Just getting into the field of informatics.

ADD REPLY
0
Entering edit mode

No worries! Unfortunately, the text that comes out of Excel can cause weird issues with non-Microsoft tools. Hopefully this helps give a taste of how to fix things. Cleaning inputs seems to be like 95% of informatics, anyway.

ADD REPLY

Login before adding your answer.

Traffic: 2192 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6