Question

update unknown enteries in bed with gene names

0

Entering edit mode

8.5 years ago

bioguy24 ▴ 230

I have ~3500 lines in a bed file from ngs data that are "unknown" that I trying to basically update the "unknown" with the "gene". Using genome browser I can go line by line, but I am looking for a better way that is faster and less prone to error? The input is tab-delimited and uses hg19 coordinates.Thank you :).

example of unknown bed:

 chr1   4736396 4736516 chr1:4736396-4736516    .   unknown

desired output

 chr1   4736396 4736516 chr1:4736396-4736516    .   AJAP1

ngs • 3.1k views

ADD COMMENT • link updated 8.5 years ago by Alex Reynolds 36k • written 8.5 years ago by bioguy24 ▴ 230

1

Entering edit mode

Try annotatePeaks.pl from the HOMER suite.

annotatePeaks.pl $BED hg19 > $OUTPUT

ADD REPLY • link 8.5 years ago by Sinji ★ 3.2k

0

Entering edit mode

The output off:

 /home/cmccabe/Desktop/NGS/homer-4.8/bin/annotatePeaks.pl unknown.bed hg19 > /home/cmccabe/Desktop/NGS/homer-4.8/out

Peak file = unknown.bed
Genome = hg19
Organism = human
 sh: 1: bed2pos.pl: not found
 sh: 1: checkPeakFile.pl: not found
 sh: 1: cleanUpPeakFile.pl: not found
Reading Positions...
-----------------------
Finding Closest TSS...
 sh: 1: annotateRelativePosition.pl: not found
 sh: 1: assignGenomeAnnotation: not found
 readline() on closed filehandle IN at /home/cmccabe/Desktop/NGS/homer-4.8/bin/annotatePeaks.pl line 3297.
 rm: cannot remove ‘0.140523784169094.tmp’: No such file or directory
NOTE: If this part takes more than 2 minutes, there is a good chance
    your machine ran out of memory: consider hitting ctrl+C and rerunning
    the command with "-noann"
To capture annotation stats in a file, use "-annStats <filename>" next time
 sh: 1: assignGenomeAnnotation: not found
 readline() on closed filehandle IN at /home/cmccabe/Desktop/NGS/homer-4.8/bin/annotatePeaks.pl line 3330.
Counting Tags in Peaks from each directory...
Organism: human
Loading Gene Informaiton...
Outputing Annotation File...
Done annotating peaks file

is:

 PeakID (cmd=annotatePeaks.pl unknown.bed hg19) Chr Start   End Strand  Peak Score  Focus Ratio/Region Size Annotation  Detailed Annotation Distance to TSS Nearest PromoterID  Entrez ID   Nearest Unigene Nearest Refseq  Nearest Ensembl Gene Name   Gene Alias  Gene Description    Gene Type

Thank you :).

ADD REPLY • link 8.5 years ago by bioguy24 ▴ 230

1

Entering edit mode

Do you have a question about this output?

ADD REPLY • link 8.5 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

It doesn't look like you installed HOMER or some of its dependencies appropriately. Go back to the HOMER installation page and follow it carefully.

ADD REPLY • link 8.5 years ago by Sinji ★ 3.2k

0

Entering edit mode

I will reread the instructions for installation on ubuntu 14.04, 64-bit. Thank you :).

ADD REPLY • link 8.5 years ago by bioguy24 ▴ 230

0

Entering edit mode

Read the output:

 sh: 1: bed2pos.pl: not found
 sh: 1: checkPeakFile.pl: not found
 sh: 1: cleanUpPeakFile.pl: not found
...

It can't find these scripts. They are probably in the same /bin directory as annotatePeaks.pl. If so, you need to add this directory to your $PATH.

ADD REPLY • link 8.5 years ago by harold.smith.tarheel ★ 5.0k

score 2 · Accepted Answer · 2016-11-18

To demonstrate the annotation process I will describe, we need some genes. Let's assume you're working with human data and that you visit the GENCODE site to grab GENCODE records in GTF format.

You can use BEDOPS gtf2bed to convert this to a sorted UCSC BED-formatted file containing GENCODE genes. For example, here is how to get GENCODE v19 gene annotations for genome build hg19:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    | gunzip -c - \
    | gtf2bed - \
    | grep "\tgene\t" \
    > gencode.v19.annotation.gtf.genes.bed

Now that you have both data sets in BED format, you can now map your BED file to the GENCODE annotations with BEDOPS bedmap, which returns genes that overlap your elements by one or more bases, e.g.:

$  bedmap --echo --echo-map-id-uniq elements.bed gencode.v19.annotation.gtf.genes.bed > answer.bed

The file answer.bed contains the annotation results. This file is a sorted BED file that contains lines of the format:

[ transcript1 ] | [ gene1-overlapping-transcript1 ] ; ... ; [ geneA-overlapping-transcript1 ]
[ transcript2 ] | [ gene1-overlapping-transcript2 ] ; ... ; [ geneB-overlapping-transcript2 ]
...
[ transcriptN ] | [ gene1-overlapping-transcriptN ] ; ... ; [ geneZ-overlapping-transcriptN ]

The pipe character (|) delimits BED-formatted transcript elements from the IDs or names of overlapping genes. The semi-colon (;) delimits names of overlapping genes for a given transcript.

You can replace these delimiters if you need to with the --delim and --multidelim operators, but basically this lets you easily post-process the results with Perl, Python, awk — whatever scripting or other language you prefer.

If you want the entire gene record and not just the gene name, use --echo-map in place of --echo-map-id-uniq. Other --echo-map-* operators are available to recover different attributes of the mapped element.

The default criteria for overlap between transcript and mapped gene is one or more bases. If you want this to be more stringent, review the overlap criteria section of the BEDOPS bedmap documentation.