How to find intron co-ordinates from the gff file?
1
6
Entering edit mode
9.0 years ago
Chirag Parsania ★ 2.0k

Hi

Does anyone know how can I find intronic co-ordinates from the gff file ? My .gff file contains following features with co-ordinates. Unfortunately there's no intron co-ordinate

chromosome,gene,mRNA,exon,CDS,tRNA,retrotransposon,long_terminal_repeat  snoRNA,blocked_reading_frame,centromere,repeat_region,pseudogene,snRNA,ncRNA ,rRNA

Give your inputs

Thanks!

intron gff bedtools • 8.4k views
ADD COMMENT
0
Entering edit mode

Hi,

Your introns are between each exons within a given gene. They're not precisely labelled, however, one can find their begin/end positions since exons are clearly defined.

ADD REPLY
0
Entering edit mode

Isn't it annoying that to find introns one has to go through a relatively complex procedure, as Devon suggests?

ADD REPLY
0
Entering edit mode

Exactly @dariober..! I can do that in excel as well. But I wonder if somebody has already wrote some program for this. Thanks for your inputs.

ADD REPLY
7
Entering edit mode
9.0 years ago

I ended up needing to do something like this yesterday. Here is an RScript that should more or less work. Note that in my case I wanted introns by genes. If you want introns per transcript then just use intronsByTranscript(), which is much simpler.

library(GenomicFeatures)
library(rtracklayer)

gtf <- makeTxDbFromGFF("genes.gtf") #change me!
exons <- exonsBy(gtf, by="gene")

#make introns
exons <- reduce(exons)
exons <- exons[sapply(exons, length) > 1]

introns <- lapply(exons, function(x) {
    #Make a "gene" GRange object
    gr = GRanges(seqnames=seqnames(x)[1], ranges=IRanges(start=min(start(x)),
        end=max(end(x))), 
        strand=strand(x)[1])
    db = disjoin(c(x, gr))
    ints = db[countOverlaps(db, x) == 0]
    #Add an ID
    if(as.character(strand(ints)[1]) == "-") {
        ints$exon_id = c(length(ints):1)
    } else {
        ints$exon_id = c(1:length(ints))
    }
    ints
})
introns <- GRangesList(introns)
ADD COMMENT
0
Entering edit mode

Hi Devon,

Thanks for the script. I have tried to run that. Uploaded library successfully. But then I got an error that

Error: could not find function "makeTxDbFromGFF"

Here is the first few lines of session info

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rtracklayer_1.26.3     GenomicFeatures_1.18.7 AnnotationDbi_1.28.2   Biobase_2.26.0         GenomicRanges_1.18.4

Kindly tell me how can I update package and run function makeTxDbFromGFF

Chirag

ADD REPLY
0
Entering edit mode

Hi Devon,

It worked.! I used makeTranscriptDbFromGFF instead of makeTxDbFromGFF. Thank you so much.

Chirag

ADD REPLY
0
Entering edit mode

Hi Devon

I just found this script of yours for extracting introns...the scripts is running fine but at the last step its stucked for longer and I am not getting any output..any idea why?

ADD REPLY
0
Entering edit mode

Hi @Devon I was trying your above mentioned code and I am getting following errror:

> exons
GRangesList object of length 37989:
$ENSG00000000003 
GRanges object with 20 ranges and 2 metadata columns:
       seqnames                 ranges strand |   exon_id       exon_name
          <Rle>              <IRanges>  <Rle> | <integer>     <character>
   [1]        X [100627109, 100629986]      - |    706021 ENSE00003730948
   [2]        X [100628670, 100629986]      - |    706022 ENSE00001459322
   [3]        X [100630759, 100630866]      - |    706023 ENSE00000868868
   [4]        X [100632063, 100632068]      - |    706024 ENSE00003731560
   [5]        X [100632485, 100632568]      - |    706025 ENSE00000401072
   ...      ...                    ...    ... .       ...             ...
  [16]        X [100635558, 100635746]      - |    706036 ENSE00003733424
  [17]        X [100636191, 100636689]      - |    706037 ENSE00001886883
  [18]        X [100636608, 100636806]      - |    706038 ENSE00001855382
  [19]        X [100636793, 100637104]      - |    706039 ENSE00001863395
  [20]        X [100639945, 100639991]      - |    706040 ENSE00001828996

...
<37988 more elements>
-------
seqinfo: 270 sequences (1 circular) from an unspecified genome; no seqlengths
> 
> introns <- lapply(exons, function(x) {
+     #Make a "gene" GRange object
+     gr = GRanges(seqnames=seqnames(x)[1], ranges=IRanges(start=min(start(x)),
+                                                          end=max(end(x))), 
+                  strand=strand(x)[1])
+     db = disjoin(c(x, gr))
+     ints = db[countOverlaps(db, x) == 0]
+     #Add an ID
+     if(as.character(strand(ints)[1]) == "-") {
+         ints$exon_id = c(length(ints):1)
+     } else {
+         ints$exon_id = c(1:length(ints))
+     }
+     ints
+ })
Error in .Method(..., deparse.level = deparse.level) : 
  number of columns for arg 2 do not match those of first arg
Called from: .Method(..., deparse.level = deparse.level)
Browse[1]>

Any inputs why it is so? and how to overcome this?

ADD REPLY

Login before adding your answer.

Traffic: 1746 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6