Hi Sharon,
I'm not sure that you even need the ignoreAfterBar
parameter. If using the tx2gene
parameter, though, then the first column of your tx2gene object has to be the exact transcript name as in the Salmon files. The second column, then, is what you wan to convert these to.
For example, if you save the Salmon transcripts to a character vector (here, salmonCounts is just any file output by Salmon):
transcriptsSalmon <- as.character(salmonCounts$Name)
head(transcriptsSalmon, 10)
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|
ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|
ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|
ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|RP11-34P13.4-001|FAM138A|1187|lincRNA|
ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|RP11-34P13.4-002|FAM138A|590|lincRNA|
ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|
We can then parse this and extract whatever information we want, information which will eventually be used as the rownames of our future txi object created by tximport via tx2gene
:
tx2gene <- data.frame(
transcriptsSalmon,
do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6])
colnames(tx2gene) <- c("transcript_id", "gene_id")
head(tx2gene, 10)
transcript_id gene_id
ENST00000456328.2|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript| DDX11L1
ENST00000450305.2|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| DDX11L1
ENST00000488147.1|...|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene| WASH7P
ENST00000619216.1|...|MIR6859-1-201|MIR6859-1|68|miRNA| MIR6859-1
ENST00000473358.1|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA| RP11-34P13.3
ENST00000469289.1|..|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA| RP11-34P13.3
ENST00000607096.1|...|MIR1302-2-201|MIR1302-2|138|miRNA| MIR1302-2
ENST00000417324.1|...|RP11-34P13.4-001|FAM138A|1187|lincRNA| FAM138A
ENST00000461467.1|...|RP11-34P13.4-002|FAM138A|590|lincRNA| FAM138A
ENST00000606857.1|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene| OR4G4P
The pitfall of using just the gene name is that it is not unique, so, will result in issues. You could also merge fields:
tx2gene <- data.frame(transcriptsSalmon,
paste(do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,1],
do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6], sep="_"))
colnames(tx2gene) <- c("transcript_id", "gene_id")
tx2gene$gene_id[1:20]
[1] ENST00000456328.2_DDX11L1 ENST00000450305.2_DDX11L1
[3] ENST00000488147.1_WASH7P ENST00000619216.1_MIR6859-1
[5] ENST00000473358.1_RP11-34P13.3 ENST00000469289.1_RP11-34P13.3
[7] ENST00000607096.1_MIR1302-2 ENST00000417324.1_FAM138A
[9] ENST00000461467.1_FAM138A ENST00000606857.1_OR4G4P
[11] ENST00000642116.1_OR4G11P ENST00000492842.2_OR4G11P
[13] ENST00000641515.2_OR4F5 ENST00000335137.4_OR4F5
[15] ENST00000466430.5_RP11-34P13.7 ENST00000477740.5_RP11-34P13.7
[17] ENST00000471248.1_RP11-34P13.7 ENST00000610542.1_RP11-34P13.7
[19] ENST00000453576.2_RP11-34P13.7 ENST00000495576.1_RP11-34P13.8
Does that make sense?
Kevin
A comprehensive answer from you, as usual. Nice to have you around, Sir!
Thanks bro. Have to empty the contents of my head before I'm too old to remember anything!
Hi Kevin Thanks so much and sorry for the late reply, was away. Yes, this makes sense. I understand where the problem came from but did not how to solve, am still learning R. I will try that. Thanks so much :)
No problem. Trust that you are not working too much this weekend.
work doesn't leave me =D I just switched to something else :) :)