Use columns(txdb) to see all available columns:
> columns(txdb)
[1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID" "EXONNAME"
[11] "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID" "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART" "TXSTRAND"
[21] "TXTYPE"
The list may change depending on which files you used to create the TxDB.
You can use transcripts(txdb) to get a dataframe with any of these columns, e.g. GENEID:
> yeast.transcripts.allcolums
GRanges object with 6692 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chrI [ 335, 649] + | 1 YAL069W
[2] chrI [ 538, 792] + | 2 YAL068W-A
[3] chrI [ 2480, 2707] + | 3 YAL067W-A
[4] chrI [10091, 10399] + | 4 YAL066W
[5] chrI [12046, 12426] + | 5 YAL064W-B
... ... ... ... ... ... ...
[6688] chrM [65770, 66174] + | 6688 Q0182
[6689] chrM [73758, 74513] + | 6689 Q0250
[6690] chrM [74495, 75984] + | 6690 Q0255
[6691] chrM [79213, 80022] + | 6691 Q0275
[6692] chrM [85554, 85709] + | 6692 Q0297
(yes, I am using S.cerevisiae as I was too lazy to search and download the S.pombe file :-) )
Now, when you do transcriptsBy, it actually returns you a named list. You can use names() to rewrite the names of the transcripts using any info from the transcripts object.
> yeast.transcripts = transcriptsBy(txdb, "gene")
> yeast.transcripts
GRangesList object of length 6692:
$Q0010
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chrM [3952, 4338] + | 6665 Q0010
$Q0017
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] chrM [4254, 4415] + | 6666 Q0017
$Q0032
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] chrM [11667, 11957] + | 6667 Q0032
...
<6689 more elements>
-------
### Let's rename the transcripts:
> names(yeast.transcripts) = yeast.transcripts.allcolumns$tx_id
> yeast.transcripts
GRangesList object of length 6692:
$1
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chrM [3952, 4338] + | 6665 Q0010
$2
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] chrM [4254, 4415] + | 6666 Q0017
$3
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] chrM [11667, 11957] + | 6667 Q0032
...
<6689 more elements>
-------
seqinfo: 17 sequences (1 circular) from an unspecified genome; no seqlengths
This works correctly because transcripts and transcriptsBy return objects of the same size.
If instead of grouping by genes you are grouping by anything else (e.g. exons), you could also use use.names=T in the transcriptsBy call.
First of all thanks for a very thorough answer. I appreciate that. I ran all the options you recommended on my gff3 file. It seems that there is an issue with the gff3 file that txdb object is not created properly out of it. I have compared a txdb object created from gff3 file to a txdb created from biomaRt. Below are the outputs.
As you can see
TXNAME
andGENEIDs
are what I expect and they are not gene names. Also, the length oftranscripts
is only 1 more thantranscriptsBy
.Continue below:
Continued from above:
Now if I do the same with gff3 file the output is very different. Compare the
TXNAME
,GENEID
and objectlengths
:I have the link to gff3 file in the original question. I appreciate if you or anyone else can have a look into the file. Probably it's not only my problem.
Where does your gff file comes from? It seems that the file itself contains gene symbols in the GENEID column, perhaps that's the source of the problem.
I would do a couple of things:
1) mergeByOverlaps of the two dataframes (the one from ensembl, and your file), and check whether the TXIDs are the same for each range. Or simply subset the two dataframes on the same region, to be sure that the TXID correspond. This cannot be seen from the examples you have posted, because the second dataset is not sorted.
2) You can easily get a dataframe with the correct gene names from the ensembl dataframe, and use it to replace the GENEID column in the other (e.g. mcols(genes(txdb.ensembl, columns=c("TXID", "TXNAME", "GENEID"))) etc...
The reason why you get different lengths in the second example is because the GENEID column is not unique, which is not surprising considering that it includes gene names.
I get it from pombase . The reason I insist on using gff3 is that because I used that file for feature counting, therefore I wanted to use the same annotation file for downstream analysis. Thanks for the help.