Hello,
I'm working through my first batch of RNA-Seq analysis and unfortunately I don't have an experienced bioinformatician to work with. My question is regarding tximport of my quant.sf files into R. I have been working with the EquCab3.0 reference transcriptome from NCBI to generate these quant.sf files, but I have encountered a problem when importing into R.
Although the NCBI .fasta file has the NM_transcript identifiers, they don't appear to have unique delimiters for the corresponding gene ID's that I need to create the 2 column data.frame. Therefore, I ran the following from the EquCab3.0 gtf file:
library(rtracklayer)
gtf <- rtracklayer::import("GCF_002863925.1_EquCab3.0_genomic.gtf")
Then to generate the mapping table I ran
tx2gene = unique(data.frame(gtf[gtf$type=="exon"])[,c("transcript_id", "gene_id")])
head(tx2gene)
transcript_id gene_id
1 XM_005602135.3 SYCE1
14 XM_005602137.3 SYCE1
26 XM_023636208.1 LOC111772506
33 XM_023636216.1 LOC111772506
39 XM_023636230.1 LOC111772506
45 XM_023636202.1 LOC111772506
For reference, each of my 10 quant.sf files look like this:
head quant.sf
Name Length EffectiveLength TPM NumReads
NM_001081757.1 5648 5649.138 5.559883 1566.000
NM_001081758.1 5984 5841.368 0.028292 8.240
NM_001081759.1 5982 5654.000 0.000000 0.000
NM_001081760.1 5980 5652.000 0.000000 0.000
NM_001081761.1 5505 5177.000 0.000000 0.000
NM_001081762.1 4872 4553.211 0.010932 2.482
NM_001081763.1 4629 4471.577 13.722008 3059.301
NM_001081764.1 4553 4225.000 0.000000 0.000
NM_001081765.1 4125 3676.439 29.286268 5368.281
For the final code I ran the 'ignoreTxVersion = T' which gives an error message
count_data = tximport(files = sample_files,
type = "salmon",
tx2gene = tx2gene,
ignoreTxVersion = T)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10
Error in .local(object, ...) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
Example IDs (file): [NM_001081757, NM_001081758, NM_001081759, ...]
Example IDs (tx2gene): [XM_005602135.3, XM_005602137.3, XM_023636208.1, ...]
This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.
However, when I set 'ignoreTxVersion = F', it seems to work:
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10
summarizing abundance
summarizing counts
summarizing length
Is there something obvious I have missed here, or am I ok to move forward with the analysis? From what I've read, it's better to set this option to 'TRUE' for downstream analysis with DeSeq. Any advice is greatly appreciated.
Thanks all!
Cross-posted: https://support.bioconductor.org/p/9139164/