How to get geneID and not geneName from txdb made from a gff3 file?
2
0
Entering edit mode
8.5 years ago
Parham ★ 1.6k

Hi,

I need to prepare an object that holds transcripts lengths and geneIDs. In order to do that, I create a TxDb object from a gff3 file and then from TxDb I group transcripts as I show in below. The gff3 file is here.

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("../schizosaccharomyces_pombe.chr.gff3", circ_seqs = "MT", organism = "Schizosaccharomyces pombe", format = "gff3")
txsByGene <- transcriptsBy(txdb, "gene")
tail(txsByGene)
GRangesList object of length 6:
$zrt2 
GRanges object with 1 range and 2 metadata columns:
      seqnames             ranges strand |     tx_id     tx_name
         <Rle>          <IRanges>  <Rle> | <integer> <character>
  [1]        I [5317857, 5319528]      + |      1487        zrt2

$zta1 
GRanges object with 1 range and 2 metadata columns:
      seqnames             ranges strand | tx_id tx_name
  [1]      III [1794719, 1796107]      - |  6779    zta1

$zuo1 
GRanges object with 1 range and 2 metadata columns:
      seqnames             ranges strand | tx_id tx_name
  [1]       II [3094849, 3096383]      - |  5286    zuo1

The problem I have is it fetches geneNames (zrt2, zta1, zuo1) while what I need is geneIDs, e.g, SPAP8A3.03 instead of zrt2. I appreciate any advice on how to overcome this problem. I am slightly familiar with awk as well if someone can show how to do it directly on gff3 file.

Thanks in advance.

gff3 txdb genomicfeatures pombe pombase • 13k views
ADD COMMENT
3
Entering edit mode
8.5 years ago

Use columns(txdb) to see all available columns:

> columns(txdb)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
[11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
[21] "TXTYPE"

The list may change depending on which files you used to create the TxDB.

You can use transcripts(txdb) to get a dataframe with any of these columns, e.g. GENEID:

> yeast.transcripts.allcolums
GRanges object with 6692 ranges and 2 metadata columns:
         seqnames         ranges strand   |     tx_id     tx_name
            <Rle>      <IRanges>  <Rle>   | <integer> <character>
     [1]     chrI [  335,   649]      +   |         1     YAL069W
     [2]     chrI [  538,   792]      +   |         2   YAL068W-A
     [3]     chrI [ 2480,  2707]      +   |         3   YAL067W-A
     [4]     chrI [10091, 10399]      +   |         4     YAL066W
     [5]     chrI [12046, 12426]      +   |         5   YAL064W-B
     ...      ...            ...    ... ...       ...         ...
  [6688]     chrM [65770, 66174]      +   |      6688       Q0182
  [6689]     chrM [73758, 74513]      +   |      6689       Q0250
  [6690]     chrM [74495, 75984]      +   |      6690       Q0255
  [6691]     chrM [79213, 80022]      +   |      6691       Q0275
  [6692]     chrM [85554, 85709]      +   |      6692       Q0297

(yes, I am using S.cerevisiae as I was too lazy to search and download the S.pombe file :-) )

Now, when you do transcriptsBy, it actually returns you a named list. You can use names() to rewrite the names of the transcripts using any info from the transcripts object.

> yeast.transcripts = transcriptsBy(txdb, "gene")
> yeast.transcripts
GRangesList object of length 6692:
$Q0010 
GRanges object with 1 range and 2 metadata columns:
      seqnames       ranges strand |     tx_id     tx_name
         <Rle>    <IRanges>  <Rle> | <integer> <character>
  [1]     chrM [3952, 4338]      + |      6665       Q0010

$Q0017 
GRanges object with 1 range and 2 metadata columns:
      seqnames       ranges strand | tx_id tx_name
  [1]     chrM [4254, 4415]      + |  6666   Q0017

$Q0032 
GRanges object with 1 range and 2 metadata columns:
      seqnames         ranges strand | tx_id tx_name
  [1]     chrM [11667, 11957]      + |  6667   Q0032

...
<6689 more elements>
-------


### Let's rename the transcripts:
> names(yeast.transcripts) = yeast.transcripts.allcolumns$tx_id
> yeast.transcripts

GRangesList object of length 6692:
$1 
GRanges object with 1 range and 2 metadata columns:
      seqnames       ranges strand |     tx_id     tx_name
         <Rle>    <IRanges>  <Rle> | <integer> <character>
  [1]     chrM [3952, 4338]      + |      6665       Q0010

$2 
GRanges object with 1 range and 2 metadata columns:
      seqnames       ranges strand | tx_id tx_name
  [1]     chrM [4254, 4415]      + |  6666   Q0017

$3 
GRanges object with 1 range and 2 metadata columns:
      seqnames         ranges strand | tx_id tx_name
  [1]     chrM [11667, 11957]      + |  6667   Q0032

...
<6689 more elements>
-------
seqinfo: 17 sequences (1 circular) from an unspecified genome; no seqlengths

This works correctly because transcripts and transcriptsBy return objects of the same size.

If instead of grouping by genes you are grouping by anything else (e.g. exons), you could also use use.names=T in the transcriptsBy call.

ADD COMMENT
0
Entering edit mode

First of all thanks for a very thorough answer. I appreciate that. I ran all the options you recommended on my gff3 file. It seems that there is an issue with the gff3 file that txdb object is not created properly out of it. I have compared a txdb object created from gff3 file to a txdb created from biomaRt. Below are the outputs.

> library(biomaRt)
> listMarts(host="fungi.ensembl.org")
            biomart               version
1       fungal_mart           Fungal Mart
2 fungal_variations Fungal Variation Mart
> txdb<- makeTxDbFromBiomart(
+ ,biomart ="fungal_mart"
+ ,dataset = "spombe_eg_gene"
+ ,host="fungi.ensembl.org"
+ )
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .normarg_makeTxDb_chrominfo(chrominfo, transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object
> genes(txdb, columns=c("TXID", "TXNAME", "GENEID"))
GRanges object with 7014 ranges and 3 metadata columns:
               seqnames             ranges strand   |          TXID          TXNAME          GENEID
                  <Rle>          <IRanges>  <Rle>   | <IntegerList> <CharacterList> <CharacterList>
   SPAC1002.01        I [1798347, 1799015]      +   |           508   SPAC1002.01.1     SPAC1002.01
   SPAC1002.02        I [1799061, 1800053]      +   |           509   SPAC1002.02.1     SPAC1002.02
  SPAC1002.03c        I [1799915, 1803141]      -   |          2073  SPAC1002.03c.1    SPAC1002.03c
  SPAC1002.04c        I [1803624, 1804491]      -   |          2074  SPAC1002.04c.1    SPAC1002.04c
  SPAC1002.05c        I [1804548, 1806797]      -   |          2075  SPAC1002.05c.1    SPAC1002.05c
           ...      ...                ...    ... ...           ...             ...             ...
    SPSNRNA.03        I [ 987570,  987825]      -   |          1851    SPSNRNA.03.1      SPSNRNA.03
    SPSNRNA.04       II [ 467233,  467361]      -   |          4595    SPSNRNA.04.1      SPSNRNA.04
    SPSNRNA.05       II [3236867, 3236986]      +   |          4106    SPSNRNA.05.1      SPSNRNA.05
    SPSNRNA.06        I [2562276, 2562427]      -   |          2305    SPSNRNA.06.1      SPSNRNA.06
    SPSNRNA.07       II [3958832, 3959086]      -   |          5528    SPSNRNA.07.1      SPSNRNA.07
  -------
  seqinfo: 6 sequences from an unspecified genome; no seqlengths
> length(transcripts(txdb))
[1] 7015
> length(transcriptsBy(txdb, "gene"))
[1] 7014

As you can see TXNAME and GENEIDs are what I expect and they are not gene names. Also, the length of transcripts is only 1 more than transcriptsBy.

Continue below:

ADD REPLY
0
Entering edit mode

Continued from above:

Now if I do the same with gff3 file the output is very different. Compare the TXNAME, GENEID and object lengths:

> txdb <- makeTxDbFromGFF("../schizosaccharomyces_pombe.chr.gff3", organism = "Schizosaccharomyces pombe", format = "gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> genes(txdb, columns=c("TXID", "TXNAME", "GENEID"))
GRanges object with 5157 ranges and 3 metadata columns:
       seqnames             ranges strand |          TXID          TXNAME          GENEID
          <Rle>          <IRanges>  <Rle> | <IntegerList> <CharacterList> <CharacterList>
  aah2        I [4373039, 4374987]      - |          2805            aah2            aah2
  aah3      III [ 833581,  836015]      - |          6534            aah3            aah3
  aap1        I [4832879, 4834570]      + |          1351            aap1            aap1
  aar2        I [3436000, 3438166]      - |          2544            aar2            aar2
  aat1       II [ 111007,  113472]      - |          4487            aat1            aat1
   ...      ...                ...    ... .           ...             ...             ...
  zta1      III [1794719, 1796107]      - |          6779            zta1            zta1
  zuo1       II [3094849, 3096383]      - |          5286            zuo1            zuo1
  zwf1        I [1454782, 1456870]      + |           413            zwf1            zwf1
  zwf2      III [ 239463,  241999]      - |          6391            zwf2            zwf2
  zym1        I [2381690, 2382200]      + |           663            zym1            zym1
  -------
  seqinfo: 4 sequences (1 circular) from an unspecified genome; no seqlengths
> length(transcripts(txdb))
[1] 6987
> length(transcriptsBy(txdb, "gene"))
[1] 5157

I have the link to gff3 file in the original question. I appreciate if you or anyone else can have a look into the file. Probably it's not only my problem.

ADD REPLY
1
Entering edit mode

Where does your gff file comes from? It seems that the file itself contains gene symbols in the GENEID column, perhaps that's the source of the problem.

I would do a couple of things:

1) mergeByOverlaps of the two dataframes (the one from ensembl, and your file), and check whether the TXIDs are the same for each range. Or simply subset the two dataframes on the same region, to be sure that the TXID correspond. This cannot be seen from the examples you have posted, because the second dataset is not sorted.

2) You can easily get a dataframe with the correct gene names from the ensembl dataframe, and use it to replace the GENEID column in the other (e.g. mcols(genes(txdb.ensembl, columns=c("TXID", "TXNAME", "GENEID"))) etc...

The reason why you get different lengths in the second example is because the GENEID column is not unique, which is not surprising considering that it includes gene names.

ADD REPLY
0
Entering edit mode

I get it from pombase . The reason I insist on using gff3 is that because I used that file for feature counting, therefore I wanted to use the same annotation file for downstream analysis. Thanks for the help.

ADD REPLY
1
Entering edit mode
8.5 years ago
uma ▴ 10

Can you try it with this two version of gff3/gtf from our ensemblgenomes release 20 & release 31 and let me know if it works. there are some changes made in our latest gff format. Just try to check if that is affecting you

ftp://ftp.ensemblgenomes.org/pub/release-31/fungi/gff3/schizosaccharomyces_pombe/

ftp://ftp.ensemblgenomes.org/pub/release-20/fungi/gff3/schizosaccharomyces_pombe/

ADD COMMENT
0
Entering edit mode

Here are the outputs. I ran the same codes for each release. Release 20 shows GeneIDs in the output which is good. However both them seems to not produce proper data when transcriptsBy is used.

> txdb <- makeTxDbFromGFF("D:/Vbox_shared/Annotations_after_scilifelab/gff3_to_compare/Schizosaccharomyces_pombe.ASM294v2.20.gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, type, ID, Name) :
  The following transcripts have multiple parts that were merged: SPAC1F5.11c.1, SPCC737.08.1
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns): SPAC1F5.11c.1, SPCC737.08.1
> genes(txdb, columns=c("TXID", "TXNAME", "GENEID"))
GRanges object with 29 ranges and 3 metadata columns:
                 seqnames             ranges strand   |          TXID          TXNAME          GENEID
                    <Rle>          <IRanges>  <Rle>   | <IntegerList> <CharacterList> <CharacterList>
   SPAC212.05c.1        I [  20824,   21015]      +   |             6   SPAC212.05c.1   SPAC212.05c.1
   SPAC212.07c.1        I [  13665,   14555]      +   |             3   SPAC212.07c.1   SPAC212.07c.1
   SPAC212.09c.1        I [   7619,    9274]      +   |             1   SPAC212.09c.1   SPAC212.09c.1
    SPAC212.10.1        I [   5726,    6331]      -   |          1579    SPAC212.10.1    SPAC212.10.1
  SPAC23D3.05c.1        I [4344572, 4346170]      -   |          2807  SPAC23D3.05c.1  SPAC23D3.05c.1
             ...      ...                ...    ... ...           ...             ...             ...
   SPCC576.16c.1      III [2109181, 2110528]      -   |          6875   SPCC576.16c.1   SPCC576.16c.1
    SPCC622.17.1      III [1435080, 1436649]      +   |          6064    SPCC622.17.1    SPCC622.17.1
   SPCC663.07c.1      III [1646645, 1647396]      -   |          6749   SPCC663.07c.1   SPCC663.07c.1
    SPCC830.02.1      III [2181204, 2182504]      +   |          6274    SPCC830.02.1    SPCC830.02.1
   SPCP20C8.03.1      III [  32624,   33393]      +   |          5683   SPCP20C8.03.1   SPCP20C8.03.1
  -------
  seqinfo: 6 sequences (1 circular) from an unspecified genome; no seqlengths
> length(transcripts(txdb))
[1] 7016
> length(transcriptsBy(txdb, "gene"))
[1] 29
> txdb <- makeTxDbFromGFF("D:/Vbox_shared/Annotations_after_scilifelab/gff3_to_compare/Schizosaccharomyces_pombe.ASM294v2.31.gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .local(con, format, text, ...) :
  gff-version directive indicates version is   3, not 3
> genes(txdb, columns=c("TXID", "TXNAME", "GENEID"))
GRanges object with 5170 ranges and 3 metadata columns:
       seqnames             ranges strand   |          TXID          TXNAME          GENEID
          <Rle>          <IRanges>  <Rle>   | <IntegerList> <CharacterList> <CharacterList>
  aah2        I [4373039, 4374987]      -   |          2813            aah2            aah2
  aah3      III [ 833581,  836015]      -   |          6553            aah3            aah3
  aap1        I [4832879, 4834570]      +   |          1355            aap1            aap1
  aar2        I [3436000, 3438166]      -   |          2551            aar2            aar2
  aat1       II [ 111007,  113472]      -   |          4495            aat1            aat1
   ...      ...                ...    ... ...           ...             ...             ...
  zta1      III [1794719, 1796107]      -   |          6798            zta1            zta1
  zuo1       II [3094849, 3096383]      -   |          5300            zuo1            zuo1
  zwf1        I [1454782, 1456870]      +   |           414            zwf1            zwf1
  zwf2      III [ 239463,  241999]      -   |          6409            zwf2            zwf2
  zym1        I [2381690, 2382200]      +   |           664            zym1            zym1
  -------
  seqinfo: 6 sequences (1 circular) from an unspecified genome; no seqlengths
> length(transcripts(txdb))
[1] 7015
> length(transcriptsBy(txdb, "gene"))
[1] 5172
ADD REPLY
0
Entering edit mode

Hi, this works for me:

gtf <- rtracklayer::import("Schizosaccharomyces_pombe.ASM294v2.37.gtf")
my_genes <- gtf[gtf$type == "gene"]
mcols(my_genes) <- mcols(my_genes)[c(5,6,8)]
ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6