error with Tximport when txOut = TRUE
1
0
Entering edit mode
16 months ago
mbansal • 0

Hello everyone,

Hello Everyone,

I am having issue with when trying to aggregate transcript abundances to the gene level (when txOut=FALSE) but it works fine with txOut=TRUE.

Here are the steps I followed:

  1. Produced bam file using Gencode transcript fasta file.
  2. Further sorted and index them.
  3. Used Nanocount to produce abundance.tsv file.
  4. Created gencode dataframe using gencode gtf file containing only two columns target_id and gene_name
  5. Read quant file using path command

Here is the input tsv files:

read_tsv(path)

     target_id              raw est_count    tpm transcript_length
       <chr>                <dbl>     <dbl>  <dbl>             <dbl>
     1 ENST00000362079.2  0.0109      4264. 10883.               784
     2 ENST00000343262.9  0.0102      3985. 10172.               945
     3 ENST00000567815.5  0.00993     3891.  9931.               712
     4 ENST00000501597.3  0.00902     3533.  9018.               469
     5 ENST00000651669.1  0.00826     3235.  8257.               352
     6 ENST00000361624.2  0.00790     3096.  7902.              1542
     7 ENST00000270625.7  0.00782     3066.  7825.               573
     8 ENST00000222247.10 0.00779     3051.  7788.               634
     9 ENST00000229239.10 0.00753     2949.  7526.              1285
    10 ENST00000530721.5  0.00744     2917.  7445.               796

Here is the tx file for aggregating transcript abundances to the gene level


tx_gencode_gtf_local
# A tibble: 3,424,907 × 2
   target_id         gene_name
   <chr>             <chr>    
 1 NA                DDX11L2  
 2 ENST00000456328.2 DDX11L2
 3 ENST00000456328.2 DDX11L2
 4 ENST00000456328.2 DDX11L2
 5 ENST00000456328.2 DDX11L2
 6 NA                DDX11L1
 7 ENST00000450305.2 DDX11L1
 8 ENST00000450305.2 DDX11L1
 9 ENST00000450305.2 DDX11L1
10 ENST00000450305.2 DDX11L1

when I am running this command:

Txi_gene <- tximport(path, 
                             type = "none", 
                             tx2gene = tx_gencode_gtf_local,
                             txIdCol = "transcript_id",
                             abundanceCol = "tpm",
                             countsCol = "raw",
                             lengthCol = "transcript_length",
                             countsFromAbundance = "lengthScaledTPM",
                             txOut = FALSE, 
                             ignoreTxVersion = TRUE)

I am getting error:

Error in .local(object, ...) : 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [, ...]

Example IDs (tx2gene): [ENST00000456328.2, ENST00000450305.2, ENST00000488147.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

when I tried txIdCol = "target_id", it showed error with txOut=TRUE so I tried with txIdCol = "transcript_id". But it did not work. Also I tried with ignoreTxVersion=FALSE when tx_gencode_gtf_local ENST_ID does not have version information, but it failed with same error.

Thank you

Mohit

tximport R • 2.7k views
ADD COMMENT
0
Entering edit mode

when I tried txIdCol = "target_id", it showed error with txOut=TRUE

What was the exact error? This should have worked.

ADD REPLY
0
Entering edit mode

Is it possible to know the solution of above discussed issue. Since I am also experiencing the error as below:

Error in tximport(files, type = "salmon", txOut = TRUE, importer = my_read_tsv,  : 
  all(c(abundanceCol, countsCol, lengthCol) %in% names(raw)) is not TRUE

Thanks.

ADD REPLY
1
Entering edit mode
16 months ago
Michael Love ★ 2.6k

These arguments

txIdCol = "transcript_id",
abundanceCol = "tpm",
countsCol = "raw",
lengthCol = "transcript_length",

have to be correct names of columns of the quantification files.

I would also remove NA from tx2gene table. I don't think it matters but just in case.

You don't need to use ignoreTxVersion=TRUE unless you have mismatched version information across quantification file and tx2gene (that is, the quantification file has versioned transcripts but the tx2gene does not).

You probably want txOut=FALSE (I assume, since you are bothering providing a tx2gene table).

ADD COMMENT
0
Entering edit mode

Oops, I didn't read OP's question clearly; they do mention "trying to aggregate transcript abundances to the gene level". They seem to have lost track of their original requirement in their trial-and-error troubleshooting journey.

ADD REPLY
0
Entering edit mode

Here are the steps I followed but it is giving error:

gtf <- rtracklayer::import('gencode.v43.annotation.gtf')
gtf_df=as.data.frame(gtf)
tx_gencode_gtf <- dplyr::select(gtf_df, "transcript_id", "gene_name")
tx_gencode_gtf <- dplyr::rename(tx_gencode_gtf,target_id = transcript_id)
tx_gencode_gtf <- as_tibble(tx_gencode_gtf)
clean_tx_gencode_gtf <- na.omit(tx_gencode_gtf)
clean_tx_gencode_gtf_unique <- clean_tx_gencode_gtf %>%
  distinct(target_id, .keep_all = TRUE)

Followed by:

Txi_gene <- tximport(
  path,
  type = "none",
  tx2gene = "clean_tx_gencode_gtf_unique",
  txOut = FALSE,
  countsFromAbundance = "lengthScaledTPM",
  txIdCol = "transcript_name",  # Spcify the correct column name for transcript IDs
  abundanceCol = "tpm",
  countsCol = "est_count",
  lengthCol = "transcript_length",
  ignoreTxVersion = FALSE,
  ignoreAfterBar = TRUE,
  importer = read_tsv
)

Error:

chr (1): transcript_name
dbl (4): raw, est_count, tpm, transcript_length

Error in tximport(path, type = "none", tx2gene = "clean_tx_gencode_gtf_unique",  : 
  all(txId == raw[[txIdCol]]) is not TRUE

Note: I tried with both clean_tx_gencode_gtf and clean_tx_gencode_gtf_unique. But it is giving the same error. abundance.tsv files are generated using similar method.

Here is the structure of abundance.tsv file: enter image description here

ADD REPLY
0
Entering edit mode

Do you not see that the transcript_name field has a bunch of stuff that's not just the transcript identifier? Just go back to your previous code and change txIdCol to "target_id" and txOut to FALSE.

ADD REPLY
0
Entering edit mode

Yes. Earlier I have removed anything after | and used transcript ids for aggregating them to gene level but it was not working. Now, with above mentioned transcript name, I used ignoreAfterBar = TRUE, to avoid using anything after |. But as you can see it was giving error. I will remove everything after | in excel and change txIdCol to "target_id" and txOut to FALSE and test it again.

ADD REPLY
0
Entering edit mode

Don't use Excel. It's a bad tool for processing large plain text content. Use sed/awk instead.

If ignoreAfterBar takes care of the messy data in the column, the error probably happens because there's some transcript ID that is only present in the GTF file (tx2gene object) or the input files. Check if that's the case.

ADD REPLY
0
Entering edit mode

Here is my updated abdundance.tsv files:

    transcript_name        raw est_count    tpm transcript_length
   <chr>                <dbl>     <dbl>  <dbl>             <dbl>
 1 ENST00000362079.2  0.0109      4264. 10883.               784
 2 ENST00000343262.9  0.0102      3985. 10172.               945
 3 ENST00000567815.5  0.00993     3891.  9931.               712
 4 ENST00000501597.3  0.00902     3533.  9018.               469
 5 ENST00000651669.1  0.00826     3235.  8257.               352
 6 ENST00000361624.2  0.00790     3096.  7902.              1542
 7 ENST00000270625.7  0.00782     3066.  7825.               573
 8 ENST00000222247.10 0.00779     3051.  7788.               634
 9 ENST00000229239.10 0.00753     2949.  7526.              1285
10 ENST00000530721.5  0.00744     2917.  7445.               796

Here is the two column data.frame that has transcript IDs in the first column and gene names

clean_tx_gencode_gtf_unique
# A tibble: 252,913 × 2
   target_id         gene_name  
   <chr>             <chr>      
 1 ENST00000456328.2 DDX11L2    
 2 ENST00000450305.2 DDX11L1    
 3 ENST00000488147.1 WASH7P     
 4 ENST00000619216.1 MIR6859-1  
 5 ENST00000473358.1 MIR1302-2HG
 6 ENST00000469289.1 MIR1302-2HG
 7 ENST00000607096.1 MIR1302-2  
 8 ENST00000417324.1 FAM138A    
 9 ENST00000461467.1 FAM138A    
10 ENST00000606857.1 OR4G4P

tximport:

Txi_gene <- tximport(
  path,
  type = "none",
  tx2gene = "clean_tx_gencode_gtf_unique",
  txOut = FALSE,
  countsFromAbundance = "lengthScaledTPM",
  txIdCol = "target_id",  # Spcify the correct column name for transcript IDs
  abundanceCol = "tpm",
  countsCol = "est_count",
  lengthCol = "transcript_length",
  ignoreTxVersion = FALSE,
  importer = read_tsv
)

It is giving error

Error in `colnames<-`(`*tmp*`, value = c("tx", "gene")) : 
  attempt to set 'colnames' on an object with less than two dimensions
ADD REPLY
0
Entering edit mode

tx2gene takes an object, not a string. It might technically be programmed to work but try passing just clean_tx_gencode_gtf_unique and not "clean_tx_gencode_gtf_unique".

ADD REPLY
0
Entering edit mode

Hello Ram,

I tried all possible things but it is still giving error:

When

Txi_gene <- tximport(
  path,
  type = "none",
  tx2gene = "clean_tx_gencode_gtf",
  txOut = FALSE,
  countsFromAbundance = "lengthScaledTPM",
  txIdCol = "target_id",  # Spcify the correct column name for transcript IDs
  abundanceCol = "tpm",
  countsCol = "est_count",
  lengthCol = "transcript_length",
  ignoreTxVersion = FALSE,
  importer = read_tsv
)

It is giving error:

Error in `colnames<-`(`*tmp*`, value = c("tx", "gene")) : 
  attempt to set 'colnames' on an object with less than two dimensions

and when I am using

Txi_gene <- tximport(
  path,
  type = "none",
  tx2gene = clean_tx_gencode_gtf,
  txOut = FALSE,
  countsFromAbundance = "lengthScaledTPM",
  txIdCol = "target_id",  # Spcify the correct column name for transcript IDs
  abundanceCol = "tpm",
  countsCol = "est_count",
  lengthCol = "transcript_length",
  ignoreTxVersion = TRUE,
  importer = read_tsv
)

removing duplicated transcript rows from tx2gene
Error in .local(object, ...) : 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [, ...]

Example IDs (tx2gene): [ENST00000456328.2, ENST00000450305.2, ENST00000488147.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

Thank you
Mohit

ADD REPLY
0
Entering edit mode

The solution is suggested to you in the error: set ignoreAfterBar to TRUE like you did before.

ADD REPLY
0
Entering edit mode

Hi Mohit,

Try making the column for the transcripts the same name. Call them both transcript_id.

Also, don't use "lengthScaledTPM" as you're using this with long read data. Perhaps try only inputting the transcript_id and est_count columns from NanoCount.

Josie.

ADD REPLY
0
Entering edit mode

Hello Josie,

Here is the tsv file:

A tibble: 1,010,956 × 5
   target_id          raw est_count    tpm transcript_length
   <chr>                <dbl>     <dbl>  <dbl>             <dbl>
 1 ENST00000362079.2  0.0109      4264. 10883.               784
 2 ENST00000343262.9  0.0102      3985. 10172.               945
 3 ENST00000567815.5  0.00993     3891.  9931.               712
 4 ENST00000501597.3  0.00902     3533.  9018.               469
 5 ENST00000651669.1  0.00826     3235.  8257.               352
 6 ENST00000361624.2  0.00790     3096.  7902.              1542

Here is tx2gene dataframe having two columns Target_id and Gene_name tx_gencode_gtf_local A tibble: 252,913 × 2

   target_id          gene_name
   <chr>              <chr>    
 1 ENST00000362079.2  MT-CO3   
 2 ENST00000361739.1  MT-CO2   
 3 ENST00000387347.2  MT-RNR2  
 4 ENST00000343262.9  RPS2     
 5 ENST00000229239.10 GAPDH    
 6 ENST00000361624.2  MT-CO1   
 7 ENST00000501597.3  RPL41    
 8 ENST00000567815.5  RPL13    
 9 ENST00000651669.1  RPS27    
10 ENST00000361899.2  MT-ATP6

when I am trying to run tximport with only est_count (commented out #countsFromAbundance, abundanceCol, and lengthCol) , it is giving error

Error in c(abundanceCol, countsCol, lengthCol) in names(raw) : argument "abundanceCol" is missing, with no default

Txi_gene <- tximport(
+   path,
+   type = "none",
+   tx2gene = "tx_gencode_gtf_local",
+   txOut = FALSE,
+   #countsFromAbundance = "lengthScaledTPM",
+   txIdCol = "target_id",  # Spcify the correct column name for transcript IDs
+   #abundanceCol = "tpm",
+   countsCol = "est_count",
+   #lengthCol = "transcript_length",
+   ignoreTxVersion = FALSE,
+   #ignoreAfterBar = TRUE,
+   importer = read_tsv
+ )
Rows: 252739 Columns: 5                                                                                                                
Column specification 

Delimiter: "\t"
chr (1): transcript_id
dbl (4): raw, est_count, tpm, transcript_length

Use spec() to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in c(abundanceCol, countsCol, lengthCol) in names(raw) : 
  argument "abundanceCol" is missing, with no default
ADD REPLY
0
Entering edit mode

What are you doing? You're not following anyone's recommendations but just messing up in new and different ways.

ADD REPLY
0
Entering edit mode

I followed your suggestions but it did not work. Now I followed Josie suggestion to opt out #countsFromAbundance, abundanceCol, and lengthCol but it is still giving error. I am happy to share abundance file if you want to try at your end.

ADD REPLY
0
Entering edit mode

Did you try the last thing I mentioned:

Txi_gene <- tximport(
  path,
  type = "none",
  tx2gene = clean_tx_gencode_gtf,
  txOut = FALSE,
  countsFromAbundance = "lengthScaledTPM",
  txIdCol = "target_id",  # Spcify the correct column name for transcript IDs
  abundanceCol = "tpm",
  countsCol = "est_count",
  lengthCol = "transcript_length",
  ignoreTxVersion = TRUE,
  ignoreAfterBar = TRUE, # the change I asked you to make
  importer = read_tsv
)
ADD REPLY
0
Entering edit mode

Yes. I tested it again,

> Txi_gene <- tximport(
+   path,
+   type = "none",
+   tx2gene = clean_tx_gencode_gtf,
+   txOut = FALSE,
+   countsFromAbundance = "lengthScaledTPM",
+   txIdCol = "target_id",  # Specify the correct column name for transcript IDs
+   abundanceCol = "tpm",
+   countsCol = "est_count",
+   lengthCol = "transcript_length",
+   ignoreTxVersion = TRUE,
+   ignoreAfterBar = TRUE,
+   importer = read_tsv
+ )
Rows: 252913 Columns: 5                                                                                                                
Column specification 
Delimiter: "\t"
chr (1): target_id
dbl (4): raw, est_count, tpm, transcript_length
     Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 252913 Columns: 5                                                                                                                
Column specification 
Delimiter: "\t"
chr (1): target_id
dbl (4): raw, est_count, tpm, transcript_length

 Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in tximport(path, type = "none", tx2gene = clean_tx_gencode_gtf,  : 
  all(txId == raw[[txIdCol]]) is not TRUE
ADD REPLY
0
Entering edit mode

Did you also do the other thing I mentioned before:

the error probably happens because there's some transcript ID that is only present in the GTF file (tx2gene object) or the input files. Check if that's the case.

ADD REPLY
0
Entering edit mode

Also, you did not follow Josie's recommendations fully, as you changed the value for the txIdCol parameter but not the corresponding colname of the tsv data. Nor did you include just the transcript_id and est_count columns like they asked you to, so you have not followed anyone's suggestions completely. Doing just a part of what people ask you to do will just give you different errors and annoy people trying to help you.

ADD REPLY
0
Entering edit mode

Apologies for bothering you. I appreciate your help for fixing this issue. Josie asked to use only two columns 1. target_idand2. est_count columns from NanoCount. I did exactly as she suggested but it did not help. Instead of removing those columns from the input tsv files, I just commented out those option out to ignore them. Please correct me if I am wrong. It was my mistake to leave transcript_id in tximport command but I changed it to target_id. Still it is showing error.

ADD REPLY
0
Entering edit mode

You're not bothering me or anyone, but you need to mention that a proposed solution was tried and did not work and also show the exact error. You were doing that earlier but stopped and that got me confused.

ADD REPLY
0
Entering edit mode

I think this is a good suggestion but just ended up confusing OP who was one step away from a working solution. Also, this comment should have been added at a different location, as it doesn't contextually follow the discussion that comes before it.

ADD REPLY

Login before adding your answer.

Traffic: 1803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6