transcriptome in Granges format
1
0
Entering edit mode
2.9 years ago
alexmondaini ▴ 20

Hello everyone,

I would like to have some opinions on whether my approach to get a transcriptome is correct, perhaps this is naive question. In R:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
transcriptome <-  transcripts(txdb)

> transcriptome
GRanges object with 78807 ranges and 2 metadata columns:
          seqnames            ranges strand |     tx_id     tx_name
             <Rle>         <IRanges>  <Rle> | <integer> <character>
      [1]     chr1       11874-14409      + |         1  uc001aaa.3
      [2]     chr1       11874-14409      + |         2  uc010nxq.1
      [3]     chr1       11874-14409      + |         3  uc010nxr.1
      [4]     chr1       69091-70008      + |         4  uc001aal.1
      [5]     chr1     321084-321115      + |         5  uc001aaq.2
      ...      ...               ...    ... .       ...         ...
  [78803]     chrY 27605645-27605678      - |     78803  uc004fwx.1
  [78804]     chrY 27606394-27606421      - |     78804  uc022cpc.1
  [78805]     chrY 27607404-27607432      - |     78805  uc004fwz.3
  [78806]     chrY 27635919-27635954      - |     78806  uc022cpd.1
  [78807]     chrY 59358329-59360854      - |     78807  uc011ncc.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Is it safe to call this set of transcripts as the hg19 whole transcriptome in Granges format ?

transcriptome • 609 views
ADD COMMENT
0
Entering edit mode
2.9 years ago
ATpoint 85k

I personally prefer to get annotations directly from GENCODE. That way you do not have to rely on any annotation packages. For human hg19 that would be:

library(rtracklayer)

#/ URL to GENCODE GTF, from: https://www.gencodegenes.org/human/release_19.html
url <- "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"

#/ load GTF, output directly as GRanges:
gr <- rtracklayer::import(url)

#/ subset to transcripts only:
gr[gr$type=="transcript"]

GRanges object with 196520 ranges and 21 metadata columns:
           seqnames      ranges strand |   source       type     score     phase           gene_id     transcript_id
              <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric> <integer>       <character>       <character>
       [1]     chr1 11869-14409      + |  HAVANA  transcript        NA      <NA> ENSG00000223972.4 ENST00000456328.2
       [2]     chr1 11872-14412      + |  ENSEMBL transcript        NA      <NA> ENSG00000223972.4 ENST00000515242.2
       [3]     chr1 11874-14409      + |  ENSEMBL transcript        NA      <NA> ENSG00000223972.4 ENST00000518655.2
       [4]     chr1 12010-13670      + |  HAVANA  transcript        NA      <NA> ENSG00000223972.4 ENST00000450305.2
       [5]     chr1 14363-29370      - |  ENSEMBL transcript        NA      <NA> ENSG00000227232.4 ENST00000438504.2
       ...      ...         ...    ... .      ...        ...       ...       ...               ...               ...
  [196516]     chrM 14149-14673      - |  ENSEMBL transcript        NA      <NA> ENSG00000198695.2 ENST00000361681.2
  [196517]     chrM 14674-14742      - |  ENSEMBL transcript        NA      <NA> ENSG00000210194.1 ENST00000387459.1
  [196518]     chrM 14747-15887      + |  ENSEMBL transcript        NA      <NA> ENSG00000198727.2 ENST00000361789.2
  [196519]     chrM 15888-15953      + |  ENSEMBL transcript        NA      <NA> ENSG00000210195.2 ENST00000387460.2
  [196520]     chrM 15956-16023      - |  ENSEMBL transcript        NA      <NA> ENSG00000210196.2 ENST00000387461.2
                gene_type gene_status   gene_name        transcript_type transcript_status transcript_name       level
              <character> <character> <character>            <character>       <character>     <character> <character>
       [1]     pseudogene       KNOWN     DDX11L1   processed_transcript             KNOWN     DDX11L1-002           2
       [2]     pseudogene       KNOWN     DDX11L1 transcribed_unproces..             KNOWN     DDX11L1-201           3
       [3]     pseudogene       KNOWN     DDX11L1 transcribed_unproces..             KNOWN     DDX11L1-202           3
       [4]     pseudogene       KNOWN     DDX11L1 transcribed_unproces..             KNOWN     DDX11L1-001           2
       [5]     pseudogene       KNOWN      WASH7P unprocessed_pseudogene             KNOWN      WASH7P-202           3
       ...            ...         ...         ...                    ...               ...             ...         ...
  [196516] protein_coding       KNOWN      MT-ND6         protein_coding             KNOWN      MT-ND6-201           3
  [196517]        Mt_tRNA       KNOWN       MT-TE                Mt_tRNA             KNOWN       MT-TE-201           3
  [196518] protein_coding       KNOWN      MT-CYB         protein_coding             KNOWN      MT-CYB-201           3
  [196519]        Mt_tRNA       KNOWN       MT-TT                Mt_tRNA             KNOWN       MT-TT-201           3
  [196520]        Mt_tRNA       KNOWN       MT-TP                Mt_tRNA             KNOWN       MT-TP-201           3
                    havana_gene              tag    havana_transcript exon_number     exon_id         ont        protein_id
                    <character>      <character>          <character> <character> <character> <character>       <character>
       [1] OTTHUMG00000000961.2            basic OTTHUMT00000362751.1        <NA>        <NA>        <NA>              <NA>
       [2] OTTHUMG00000000961.2             <NA>                 <NA>        <NA>        <NA>        <NA>              <NA>
       [3] OTTHUMG00000000961.2             <NA>                 <NA>        <NA>        <NA>        <NA>              <NA>
       [4] OTTHUMG00000000961.2             <NA> OTTHUMT00000002844.2        <NA>        <NA> PGO:0000019              <NA>
       [5] OTTHUMG00000000958.1             <NA>                 <NA>        <NA>        <NA>        <NA>              <NA>
       ...                  ...              ...                  ...         ...         ...         ...               ...
  [196516]                 <NA> appris_principal                 <NA>        <NA>        <NA>        <NA> ENSP00000354665.2
  [196517]                 <NA>            basic                 <NA>        <NA>        <NA>        <NA>              <NA>
  [196518]                 <NA> appris_principal                 <NA>        <NA>        <NA>        <NA> ENSP00000354554.2
  [196519]                 <NA>            basic                 <NA>        <NA>        <NA>        <NA>              <NA>
  [196520]                 <NA>            basic                 <NA>        <NA>        <NA>        <NA>              <NA>
                ccdsid
           <character>
       [1]        <NA>
       [2]        <NA>
       [3]        <NA>
       [4]        <NA>
       [5]        <NA>
       ...         ...
  [196516]        <NA>
  [196517]        <NA>
  [196518]        <NA>
  [196519]        <NA>
  [196520]        <NA>
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths
ADD COMMENT

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6