genes that map to hg38 coordinates
3
1
Entering edit mode
7.9 years ago
bioguy24 ▴ 230

Given a list of ~55,000 hg38 coordinates is there an easy way to get the gene that maps to it?I guess it is running genome browser in batch. Maybe there is a mysql command that will use the input? Thank you :).

input

 chr1:1013574-1013576
 chr1:1013984-1014478
 chr1:1020163-1020383

desired output

  chr1:1013574-1013576 ISG15
  chr1:1013984-1014478 ISG15
  chr1:1020163-1020383 AGRN
ngs • 5.3k views
ADD COMMENT
0
Entering edit mode

Thank you all for the great solutions :).

ADD REPLY
4
Entering edit mode
7.9 years ago

Another way is to download the HGNC names to a BED file, and use BEDOPS bedmap to map gene IDs to intervals.

First, download gene annotations and convert them to a sorted, four-column BED file:

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz" | gunzip -c | cut -f3,5,6,13 | sort-bed - > genes.bed

Second, convert the intervals from your format into sorted BED and pipe that to bedmap, along with the genes.bed file created in the first step. The hyphen (-) in the bedmap call is a placeholder for standard input, which is data that is coming out of sort-bed upstream in the pipeline:

$ sed $'s/[:-]/\t/g' intervals.txt | sort-bed - | bedmap --echo --echo-map-id-uniq - genes.bed > answer.bed

The file answer.bed will be BED-formatted, but that can be easily munged back into your desired format by piping the output to awk:

$ sed $'s/[:-]/\t/g' intervals.txt | sort-bed - | bedmap --echo --echo-map-id-uniq - genes.bed | awk '{print $1":"$2"-"$3"\t"$4;}' > answer.txt
ADD COMMENT
2
Entering edit mode
7.9 years ago
Charles Plessy ★ 2.9k

You can use R and Bioconductor's GenomicRanges package as follows:

First, represent your regions of interest as genomic ranges. For real data, you may load a BED file with the function rtracklayer::import.bed(). Here, we create by hand the three regions that you gave as example. The output of the commands is displayed in italics.

# In R, load the library.
library(GenomicRanges)

# Create an GRanges object (here called, 'r'). that contains the regions of interest.
r <- c( GRanges("chr1", IRanges(1013574,1013576))
      , GRanges("chr1", IRanges(1013984,1014478))
      , GRanges("chr1", IRanges(1020163,1020383)))
r

Output:

GRanges object with 3 ranges and 0 metadata columns:
      seqnames             ranges strand
         <Rle>          <IRanges>  <Rle>
  [1]     chr1 [1013574, 1013576]      *
  [2]     chr1 [1013984, 1014478]      *
  [3]     chr1 [1020163, 1020383]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then, load genomic ranges for an annotation, for instance GENCODE.

# Get GENCODE from the Bioconductor's Annotation Hub.
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("Gencode", "gff", "human"))

Output:

AnnotationHub with 9 records
# snapshotDate(): 2016-12-29
# $dataprovider: Gencode
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags,
#   sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH49554"]]' 

            title                                                    
  AH49554 | gencode.v23.2wayconspseudos.gff3.gz                      
  AH49555 | gencode.v23.annotation.gff3.gz                           
  AH49556 | gencode.v23.basic.annotation.gff3.gz                     
  AH49557 | gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz      
  AH49558 | gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz
  AH49559 | gencode.v23.long_noncoding_RNAs.gff3.gz                  
  AH49560 | gencode.v23.polyAs.gff3.gz                               
  AH49561 | gencode.v23.primary_assembly.annotation.gff3.gz          
  AH49562 | gencode.v23.tRNAs.gff3.gz

The command above returned that the ID for GENCODE 23 is AH49556 and indicated which command to run next.

ah["AH49556"]
gc <- ah[["AH49556"]]

Then, intersect your regions of interest with the annotation.

overlaps <- findOverlaps(r, gc)

This returns an object telling which region overlaps with which annotation. The last step is to extract the gene names, collate them, and prepare your final output.

genes <- extractList(gc$gene_name, as(overlaps, "List"))
genes <- unstrsplit(unique(genes), ";") # Needed in case more than one gene overlaps.
paste(as.character(r), genes)

[1] "chr1:1013574-1013576 ISG15" "chr1:1013984-1014478 ISG15"
[3] "chr1:1020163-1020383 AGRN"

You can find other examples on how to use findOverlaps() in other Biostars posts or in Bioconductor's support forum.

ADD COMMENT
1
Entering edit mode
7.9 years ago
newbiebio ▴ 80

Maybe you could try bedtools. First, go to ensembl and download refgenome's gtf file. Then use 'bedtools intersect -wo -a ref.gtf -b you_input.bed"

ADD COMMENT

Login before adding your answer.

Traffic: 2358 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6