How to annotate gene names for genomic coordinates?
4
10
Entering edit mode
9.4 years ago

I want to annotate gene names for a segmented file by finding overlap with it with a reference file. Would like to know how can it be done using GenomicRanges Package or any other bioconductor packages.

Segmented_file

Chromosome   Start      End
1            5930000    11730000
1            16850000   18010000

reference_file

Chr   Start      End        Gene
1     5930500    6230500    SPSB1
1     6930500    7340500    SPSB2
1     16854500   16950000   TAS1R1
1     17810032   17910064   ENO1

Expected results

Chromosome   Start      End        Gene
1            5930000    11730000   SPSB1,SPSB2
1            16850000   18010000   TAS1R1,ENO1
next-gen genome R • 15k views
ADD COMMENT
3
Entering edit mode

If you can use the command line, you can do this in a simple one-liner with BEDOPS bedmap:

$ bedmap --echo --echo-map-id-uniq segmented_file reference_file > answer.bed

The file answer.bed contains results in the format you expect, except that gene names are delimited with semi-colons. You can change this to commas, if you like, by specifying the --multidelim option.

Just make sure first that segmented_file and reference_file are sorted BED files:

$ sort-bed segmented_file.unsorted > segmented_file

Etc.

ADD REPLY
9
Entering edit mode
9.4 years ago
seidel 11k

Yes, you can use the GenomicRanges library in R to do this. Here is an example:

library(GenomicRanges)
# generate 10 random segments
regions.gr <- GRanges(seqnames=Rle(rep(1,10)), ranges=IRanges(start=sample(1:1000,10), width=100), strand=Rle(rep("*",10)))

# generate 5 random genes
fiveGeneNames <- paste(sample(letters,5), sample(1:100,5), sep="")
genes.gr <- GRanges(seqnames=Rle(rep(1,5)), ranges=IRanges(start=sample(1:1000,5), width=100, names=fiveGeneNames), strand=Rle(sample(c("+","-"),5, replace=T)))

# what regions overlap what genes?
overlapGenes <- findOverlaps(regions.gr, genes.gr)

# Return any genes with an overlap.
# Convert the resulting "Hits" object to a data frame
# and use it as an index
overlapGenes.df <- as.data.frame(overlapGenes)
names(genes.gr[overlapGenes.df$subjectHits])

# extract the regions that hit genes
regionsWithHits <- as.data.frame(regions.gr[overlapGenes.df$queryHits])
# add the names of the genes
regionsWithHits$genes <- names(genes.gr)[overlapGenes.df$subjectHits]

# Some segments might not overlap any gene.
# Find the distance to the nearest gene
distanceToNearest(regions.gr, genes.gr)

# if your genes and segments were in a bed file you could easily import them
library(rtracklayer)
regions <- import.bed("myRegions.bed",asRangedData=FALSE)
genes <- import.bed("myGenes.bed",asRangedData=FALSE)
ADD COMMENT
0
Entering edit mode

Nice one. I tried to use R GRanges for a while, but it was unintuitive as hell. I wanted to merge a de novo annotation to a reference one.

ADD REPLY
0
Entering edit mode

And if you have several genes in a cnv range??? How do you solve this?? thanks a lot

ADD REPLY
0
Entering edit mode

You mean instead of segments to genes, you want to know what genes appear within a given segment? The answer is above. Your cnv range (i.e. a segment) will return the index of the genes it overlaps. However, it can also be useful to turn the problem around, and use your genes as the query. i.e. for all genes, which ones overlap which segments. If neither of these answers your question - then I don't understand your question.

ADD REPLY
4
Entering edit mode
9.4 years ago

In R you can use the genomeIntervals package and the interval_overlap function.

Outside of R I would recommend bedtools intersect.

ADD COMMENT
0
Entering edit mode

@david, could you specify an example for how to use genomeIntervals package in this case?

ADD REPLY
3
Entering edit mode

For two files in bed format, you could do something like this (untested):​

library(genomeIntervals);
args <- commandArgs(TRUE)

genes.file = args[1]
segmented.file  = args[2]
output.file = args[3]

segments <- read.table(segmented.file, stringsAsFactors=F)
segments.ivals <- new("Genome_intervals_stranded", as.matrix(segments[,2:3]), closed=T, annotation=data.frame(seq_name=segments[,1], inter_base=F, strand=factor(segments[,6], levels=c("+", "-"))))

genes <- read.table(genes.file, stringsAsFactors=F)
genes.ivals <- new("Genome_intervals_stranded", as.matrix(genes[,2:3]), closed=T, annotation=data.frame(seq_name=genes[,1], inter_base=F, strand=factor(genes[,6], levels=c("+", "-"))))

n <- interval_overlap(segments.ivals, genes.ivals)
n_idx <- which(unlist(lapply(n, length)) > 0)

write.table(segments[n_idx,], output.file,row.names=F,col.names=F,append=T,quote=F,sep="\t")
ADD REPLY
3
Entering edit mode
9.4 years ago
cyril-cros ▴ 950

EDIT: this answer does not use R, but R is not so good at building new annotations. You can do a lot with an existing annotation, and importing a gtf file into a TxDB object is easy. I offer a solution using bedtools instead, which is simpler:

bedtools intersect -wa -wb -a segmented_file -b reference_file | awk -F'\t' {$1,$2,$3,$7} | bedtools merge -c 4 -o distinct -i

Bedtools intersect gets your data in the format chrS startS EndS chrR startR endR geneR, for each overlap. This is not a bed file, so we use awk to discard unwanted columns and keep chrS startS EndS geneRfor each overlap. We now need to merge all the genes belonging to a given segmented file interval, which is done with merge.

You also need to have your data in bed format (tab separated), and sorted

Also, see How To Get Annotation For Bed File From Another Bed File where they use information on genes strand. You can also find how to convert from bed format to gtf. Merging with a reference annotation can give you the CDS and START/STOP codon positions.

ADD COMMENT
0
Entering edit mode
2.5 years ago
jaafari.omid ▴ 80

Hello dear friend,

Actually I am working on a fish species and I am going to find genes based on a segmented file. In the following there are the commands that I have used. The first command generates a bed file like the bellow image which it is composed of 12 columns. So using the second column I have come to take the 1,2,3 and the 12 chromosomes. Then it follows a sorting and finally merge the genes of each specific coordinates. My clear question is, how can we have the last column much tidier, I mean there are some explanations and maybe in some cases with no genes. I want to have only the gene names in the last output.

Thank you in advance,

Omid

Here are the commands:

>     bedtools intersect -wa -wb -a segmented_file.txt -b annotation-file.gff > output1.bed
>     awk '{print $1 "\t" $2 "\t" $3 "\t" $12}' output1.bed > output2.bed
>     LC_ALL=C sort -t$'\t' -k1,1 -k2,2n output2.bed > output3.bed
>     bedtools merge -c 4 -o distinct -i output3.bed > output4.bed

enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 2724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6