How tximport work with gencode transcripts?
2
1
Entering edit mode
6.4 years ago
Sharon ▴ 610

Hi All

How to use tximport with gencode?

I used Gencode transcripts for mouse looks like this:

>ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|RP23-271O17.1-001|RP23-271O17.1|1070|TEC|
 AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
 AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT

Salmon quantification is like this;

Name    Length  EffectiveLength TPM     NumReads
ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|RP23-271O17.1-001|RP23-271O17.1|1070|TEC|   1070    489.786 0.536463        7.4771
ENSMUST00000082908.1|ENSMUSG00000064842.1|-|-|Gm26206-201|Gm26206|110|snRNA|    110     28      0       0

I use this code to generate tx2gene:

library(TxDb.Mmusculus.UCSC.mm10.ensGene) 
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene)
columns(txdb)
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "GENEID") 
tx2gene <- df[, 2:1]
head(tx2gene)
write.csv(tx2gene, 'MouseEnstx2gene.csv')

The MouseEnstx2gene.csv looks like this and this causes problem in further analysis:

"","TXNAME","GENEID"
"1","ENSMUST00000000001","ENSMUSG00000000001"
"2","ENSMUST00000000003","ENSMUSG00000000003"

I feel there is a tiny thing I am missing here that could fix this as it is not the same format as Salmon quantification ! Any hint? I just tried ignoreAfterBar from tximport and it doesn't work !

Error in tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE) : 
  unused argument (ignoreAfterBar = TRUE)

Thanks

RNA-Seq R • 4.9k views
ADD COMMENT
3
Entering edit mode
6.4 years ago

Hi Sharon,

I'm not sure that you even need the ignoreAfterBar parameter. If using the tx2gene parameter, though, then the first column of your tx2gene object has to be the exact transcript name as in the Salmon files. The second column, then, is what you wan to convert these to.

For example, if you save the Salmon transcripts to a character vector (here, salmonCounts is just any file output by Salmon):

transcriptsSalmon <- as.character(salmonCounts$Name)

head(transcriptsSalmon, 10)                                                                                                                                          
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|
ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|
ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|
ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|RP11-34P13.4-001|FAM138A|1187|lincRNA|
ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|RP11-34P13.4-002|FAM138A|590|lincRNA|
ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|

We can then parse this and extract whatever information we want, information which will eventually be used as the rownames of our future txi object created by tximport via tx2gene:

tx2gene <- data.frame(
  transcriptsSalmon,
  do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6])
colnames(tx2gene) <- c("transcript_id", "gene_id")
head(tx2gene, 10)

transcript_id                                                                          gene_id
ENST00000456328.2|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|              DDX11L1
ENST00000450305.2|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| DDX11L1
ENST00000488147.1|...|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|             WASH7P
ENST00000619216.1|...|MIR6859-1-201|MIR6859-1|68|miRNA|                                MIR6859-1
ENST00000473358.1|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|                       RP11-34P13.3
ENST00000469289.1|..|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|                        RP11-34P13.3
ENST00000607096.1|...|MIR1302-2-201|MIR1302-2|138|miRNA|                               MIR1302-2
ENST00000417324.1|...|RP11-34P13.4-001|FAM138A|1187|lincRNA|                           FAM138A
ENST00000461467.1|...|RP11-34P13.4-002|FAM138A|590|lincRNA|                            FAM138A
ENST00000606857.1|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|             OR4G4P

The pitfall of using just the gene name is that it is not unique, so, will result in issues. You could also merge fields:

tx2gene <- data.frame(transcriptsSalmon,
  paste(do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,1],
  do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6], sep="_"))
colnames(tx2gene) <- c("transcript_id", "gene_id")
tx2gene$gene_id[1:20]

 [1] ENST00000456328.2_DDX11L1      ENST00000450305.2_DDX11L1     
 [3] ENST00000488147.1_WASH7P       ENST00000619216.1_MIR6859-1   
 [5] ENST00000473358.1_RP11-34P13.3 ENST00000469289.1_RP11-34P13.3
 [7] ENST00000607096.1_MIR1302-2    ENST00000417324.1_FAM138A     
 [9] ENST00000461467.1_FAM138A      ENST00000606857.1_OR4G4P      
[11] ENST00000642116.1_OR4G11P      ENST00000492842.2_OR4G11P     
[13] ENST00000641515.2_OR4F5        ENST00000335137.4_OR4F5       
[15] ENST00000466430.5_RP11-34P13.7 ENST00000477740.5_RP11-34P13.7
[17] ENST00000471248.1_RP11-34P13.7 ENST00000610542.1_RP11-34P13.7
[19] ENST00000453576.2_RP11-34P13.7 ENST00000495576.1_RP11-34P13.8

Does that make sense?

Kevin

ADD COMMENT
1
Entering edit mode

A comprehensive answer from you, as usual. Nice to have you around, Sir!

ADD REPLY
1
Entering edit mode

Thanks bro. Have to empty the contents of my head before I'm too old to remember anything!

ADD REPLY
0
Entering edit mode

Hi Kevin Thanks so much and sorry for the late reply, was away. Yes, this makes sense. I understand where the problem came from but did not how to solve, am still learning R. I will try that. Thanks so much :)

ADD REPLY
1
Entering edit mode

No problem. Trust that you are not working too much this weekend.

ADD REPLY
1
Entering edit mode

work doesn't leave me =D I just switched to something else :) :)

ADD REPLY

Login before adding your answer.

Traffic: 1862 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6