Question

transcriptome in Granges format

0

Entering edit mode

3.5 years ago

alexmondaini ▴ 20

Hello everyone,

I would like to have some opinions on whether my approach to get a transcriptome is correct, perhaps this is naive question. In R:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
transcriptome <-  transcripts(txdb)

> transcriptome
GRanges object with 78807 ranges and 2 metadata columns:
          seqnames            ranges strand |     tx_id     tx_name
             <Rle>         <IRanges>  <Rle> | <integer> <character>
      [1]     chr1       11874-14409      + |         1  uc001aaa.3
      [2]     chr1       11874-14409      + |         2  uc010nxq.1
      [3]     chr1       11874-14409      + |         3  uc010nxr.1
      [4]     chr1       69091-70008      + |         4  uc001aal.1
      [5]     chr1     321084-321115      + |         5  uc001aaq.2
      ...      ...               ...    ... .       ...         ...
  [78803]     chrY 27605645-27605678      - |     78803  uc004fwx.1
  [78804]     chrY 27606394-27606421      - |     78804  uc022cpc.1
  [78805]     chrY 27607404-27607432      - |     78805  uc004fwz.3
  [78806]     chrY 27635919-27635954      - |     78806  uc022cpd.1
  [78807]     chrY 59358329-59360854      - |     78807  uc011ncc.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Is it safe to call this set of transcripts as the hg19 whole transcriptome in Granges format ?

transcriptome • 716 views

ADD COMMENT • link updated 3.5 years ago by ATpoint 88k • written 3.5 years ago by alexmondaini ▴ 20

score 0 · Answer 1 · 2021-12-14

I personally prefer to get annotations directly from GENCODE. That way you do not have to rely on any annotation packages. For human hg19 that would be:

library(rtracklayer)

#/ URL to GENCODE GTF, from: https://www.gencodegenes.org/human/release_19.html
url <- "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"

#/ load GTF, output directly as GRanges:
gr <- rtracklayer::import(url)

#/ subset to transcripts only:
gr[gr$type=="transcript"]

GRanges object with 196520 ranges and 21 metadata columns:
           seqnames      ranges strand |   source       type     score     phase           gene_id     transcript_id
              <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric> <integer>       <character>       <character>
       [1]     chr1 11869-14409      + |  HAVANA  transcript        NA      <NA> ENSG00000223972.4 ENST00000456328.2
       [2]     chr1 11872-14412      + |  ENSEMBL transcript        NA      <NA> ENSG00000223972.4 ENST00000515242.2
       [3]     chr1 11874-14409      + |  ENSEMBL transcript        NA      <NA> ENSG00000223972.4 ENST00000518655.2
       [4]     chr1 12010-13670      + |  HAVANA  transcript        NA      <NA> ENSG00000223972.4 ENST00000450305.2
       [5]     chr1 14363-29370      - |  ENSEMBL transcript        NA      <NA> ENSG00000227232.4 ENST00000438504.2
       ...      ...         ...    ... .      ...        ...       ...       ...               ...               ...
  [196516]     chrM 14149-14673      - |  ENSEMBL transcript        NA      <NA> ENSG00000198695.2 ENST00000361681.2
  [196517]     chrM 14674-14742      - |  ENSEMBL transcript        NA      <NA> ENSG00000210194.1 ENST00000387459.1
  [196518]     chrM 14747-15887      + |  ENSEMBL transcript        NA      <NA> ENSG00000198727.2 ENST00000361789.2
  [196519]     chrM 15888-15953      + |  ENSEMBL transcript        NA      <NA> ENSG00000210195.2 ENST00000387460.2
  [196520]     chrM 15956-16023      - |  ENSEMBL transcript        NA      <NA> ENSG00000210196.2 ENST00000387461.2
                gene_type gene_status   gene_name        transcript_type transcript_status transcript_name       level
              <character> <character> <character>            <character>       <character>     <character> <character>
       [1]     pseudogene       KNOWN     DDX11L1   processed_transcript             KNOWN     DDX11L1-002           2
       [2]     pseudogene       KNOWN     DDX11L1 transcribed_unproces..             KNOWN     DDX11L1-201           3
       [3]     pseudogene       KNOWN     DDX11L1 transcribed_unproces..             KNOWN     DDX11L1-202           3
       [4]     pseudogene       KNOWN     DDX11L1 transcribed_unproces..             KNOWN     DDX11L1-001           2
       [5]     pseudogene       KNOWN      WASH7P unprocessed_pseudogene             KNOWN      WASH7P-202           3
       ...            ...         ...         ...                    ...               ...             ...         ...
  [196516] protein_coding       KNOWN      MT-ND6         protein_coding             KNOWN      MT-ND6-201           3
  [196517]        Mt_tRNA       KNOWN       MT-TE                Mt_tRNA             KNOWN       MT-TE-201           3
  [196518] protein_coding       KNOWN      MT-CYB         protein_coding             KNOWN      MT-CYB-201           3
  [196519]        Mt_tRNA       KNOWN       MT-TT                Mt_tRNA             KNOWN       MT-TT-201           3
  [196520]        Mt_tRNA       KNOWN       MT-TP                Mt_tRNA             KNOWN       MT-TP-201           3
                    havana_gene              tag    havana_transcript exon_number     exon_id         ont        protein_id
                    <character>      <character>          <character> <character> <character> <character>       <character>
       [1] OTTHUMG00000000961.2            basic OTTHUMT00000362751.1        <NA>        <NA>        <NA>              <NA>
       [2] OTTHUMG00000000961.2             <NA>                 <NA>        <NA>        <NA>        <NA>              <NA>
       [3] OTTHUMG00000000961.2             <NA>                 <NA>        <NA>        <NA>        <NA>              <NA>
       [4] OTTHUMG00000000961.2             <NA> OTTHUMT00000002844.2        <NA>        <NA> PGO:0000019              <NA>
       [5] OTTHUMG00000000958.1             <NA>                 <NA>        <NA>        <NA>        <NA>              <NA>
       ...                  ...              ...                  ...         ...         ...         ...               ...
  [196516]                 <NA> appris_principal                 <NA>        <NA>        <NA>        <NA> ENSP00000354665.2
  [196517]                 <NA>            basic                 <NA>        <NA>        <NA>        <NA>              <NA>
  [196518]                 <NA> appris_principal                 <NA>        <NA>        <NA>        <NA> ENSP00000354554.2
  [196519]                 <NA>            basic                 <NA>        <NA>        <NA>        <NA>              <NA>
  [196520]                 <NA>            basic                 <NA>        <NA>        <NA>        <NA>              <NA>
                ccdsid
           <character>
       [1]        <NA>
       [2]        <NA>
       [3]        <NA>
       [4]        <NA>
       [5]        <NA>
       ...         ...
  [196516]        <NA>
  [196517]        <NA>
  [196518]        <NA>
  [196519]        <NA>
  [196520]        <NA>
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths