isoformSwithAnalyzeR importRdata error
0
0
Entering edit mode
10 months ago
bsmith030465 ▴ 240

Hi,

I was trying to use IsoformSwitchAnalyzeR (v2.2.0) for differential transcript expression. I used kallisto to create the abundance (h5 & tsv) files. However, I get an error when I try to use importRdata(). I used the same fa file in importRdata(), as the once I used to build the kallisto index. Here is my code and the error:

> kal.quant <- importIsoformExpression(kallisto.dir,addIsofomIdAsColumn=T)
> names(kal.quant)
[1] "abundance"     "counts"        "infReps"       "length"        "importOptions"

> aSwitchList <- importRdata(
+   isoformCountMatrix   = kal.quant$counts,
+   isoformRepExpression = kal.quant$abundance,
+   ignoreAfterBar = TRUE,
+   designMatrix         = mdata.iso,
+   isoformExonAnnoation = paste0(ann.dir, 'gencode.v44.chr_patch_hapl_scaff.annotation.gtf.gz'),
+   isoformNtFasta = paste0(ann.dir,'gencode.v44.pc_transcripts.fa.gz')
+ )
Step 1 of 10: Checking data...
Step 2 of 10: Obtaining annotation...
importing GTF (this may take a while)...
Error in importRdata(isoformCountMatrix = kal.quant$counts, isoformRepExpression =     kal.quant$abundance,  : 
The annotation and quantification (count/abundance matrix and isoform annotation) seems to     be different (Jaccard similarity < 0.925). 
Either isforoms found in the annotation are not quantifed or vise versa. 
Specifically:
 110962 isoforms were quantified.
 170712 isoforms are annotated.
 Only 110962 overlap.
 0 isoforms quantifed had no corresponding annoation

This combination cannot be analyzed since it will cause discrepencies between quantification     and annotation thereby skewing all analysis.

If there is no overlap (as in zero or close) there are two options:
.
.

In previous posts on similar errors, it was mentioned that this error sometimes comes up because the fa and gtf files are not fully compatable(?), but not sure if it's the case here.

I have also tried to remove the duplicate annotations in the h5 abundance files manually (thanks sudarshan), but his doesn't seem to effect the error I get. I used:

files <- list.files(kallisto.dir, pattern=".h5", recursive=TRUE, full.names=TRUE)

for (currentFile in files) {
  oldids <- h5read(currentFile, "/aux/ids")
  newids <- gsub("\\|.*", "", oldids)
  h5write(newids, currentFile, "/aux/ids")

}

What should I be doing differently?

Thanks for your help!

Brian

My sessionInfo:

> sessionInfo()

`R version 4.3.0 (2023-04-21) Platform: x86_64-apple-darwin20 (64-bit) Running under: macOS 14.1.1

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Kolkata tzcode source: internal

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] rhdf5_2.46.1 IsoformSwitchAnalyzeR_2.2.0 pfamAnalyzeR_1.2.0 dplyr_1.1.4
[5] stringr_1.5.1 readr_2.1.4 sva_3.50.0 genefilter_1.84.0
[9] mgcv_1.9-1 nlme_3.1-164 satuRn_1.10.0 DEXSeq_1.48.0
[13] RColorBrewer_1.1-3 AnnotationDbi_1.64.1 DESeq2_1.42.0 SummarizedExperiment_1.32.0 [17] GenomicRanges_1.54.1 GenomeInfoDb_1.38.5 IRanges_2.36.0 S4Vectors_0.40.2
[21] MatrixGenerics_1.14.0 matrixStats_1.2.0 Biobase_2.62.0 BiocGenerics_0.48.1
[25] BiocParallel_1.36.0 limma_3.58.1 biomaRt_2.58.0 fastqq_0.1.3
[29] sleuth_0.30.1 tximeta_1.20.1 ggplot2_3.4.4

loaded via a namespace (and not attached): [1] splines_4.3.0 later_1.3.2 BiocIO_1.12.0 bitops_1.0-7
[5] filelock_1.0.3 tibble_3.2.1 XML_3.99-0.16 lifecycle_1.0.4
[9] edgeR_4.0.5 vroom_1.6.5 lattice_0.22-5 ensembldb_2.26.0
[13] magrittr_2.0.3 rmarkdown_2.25 yaml_2.3.8 remotes_2.4.2.1
[17] httpuv_1.6.13 sessioninfo_1.2.2 pkgbuild_1.4.3 pbapply_1.7-2
[21] DBI_1.2.0 abind_1.4-5 pkgload_1.3.3 zlibbioc_1.48.0
[25] purrr_1.0.2 AnnotationFilter_1.26.0 RCurl_1.98-1.13 rappdirs_0.3.3
[29] GenomeInfoDbData_1.2.11 annotate_1.80.0 codetools_0.2-19 DelayedArray_0.28.0
[33] xml2_1.3.6 tidyselect_1.2.0 futile.logger_1.4.3 locfdr_1.1-8
[37] farver_2.1.1 BiocFileCache_2.10.1 GenomicAlignments_1.38.0 jsonlite_1.8.8
[41] ellipsis_0.3.2 survival_3.5-7 tools_4.3.0 progress_1.2.3
[45] Rcpp_1.0.11 glue_1.6.2 gridExtra_2.3 SparseArray_1.2.3
[49] xfun_0.41 usethis_2.2.2 withr_2.5.2 formatR_1.14
[53] BiocManager_1.30.22 fastmap_1.1.1 boot_1.3-28.1 rhdf5filters_1.14.1
[57] fansi_1.0.6 digest_0.6.33 R6_2.5.1 mime_0.12
[61] colorspace_2.1-0 networkD3_0.4 RSQLite_2.3.4 tidyr_1.3.0
[65] utf8_1.2.4 generics_0.1.3 data.table_1.14.10 rtracklayer_1.62.0
[69] prettyunits_1.2.0 httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.2.0
[73] pkgconfig_2.0.3 gtable_0.3.4 blob_1.2.4 hwriter_1.3.2.1
[77] XVector_0.42.0 htmltools_0.5.7 profvis_0.3.8 geneplotter_1.80.0
[81] ProtGenerics_1.34.0 scales_1.3.0 png_0.1-8 lambda.r_1.2.4
[85] knitr_1.45 rstudioapi_0.15.0 tzdb_0.4.0 reshape2_1.4.4
[89] rjson_0.2.21 curl_5.2.0 cachem_1.0.8 BiocVersion_3.18.1
[93] parallel_4.3.0 miniUI_0.1.1.1 restfulr_0.0.15 pillar_1.9.0
[97] grid_4.3.0 vctrs_0.6.5 urlchecker_1.0.1 promises_1.2.1
[101] dbplyr_2.4.0 xtable_1.8-4 tximport_1.30.0 evaluate_0.23
[105] VennDiagram_1.7.3 GenomicFeatures_1.54.1 futile.options_1.0.1 cli_3.6.2
[109] locfit_1.5-9.8 compiler_4.3.0 Rsamtools_2.18.0 rlang_1.1.2
[113] crayon_1.5.2 labeling_0.4.3 plyr_1.8.9 fs_1.6.3
[117] stringi_1.8.3 munsell_0.5.0 Biostrings_2.70.1 lazyeval_0.2.2
[121] devtools_2.4.5 Matrix_1.6-4 BSgenome_1.70.1 hms_1.1.3
[125] bit64_4.0.5 Rhdf5lib_1.24.1 KEGGREST_1.42.0 statmod_1.5.0
[129] shiny_1.8.0 interactiveDisplayBase_1.40.0 AnnotationHub_3.10.0 igraph_1.6.0
[133] memoise_2.0.1 bit_4.0.5 `

tximport h5 isoformSwitchAnalyzeR kallisto • 611 views
ADD COMMENT
0
Entering edit mode

Seems to me that the problem is that the annotation used for the quantification is not the same as the annotation used to pass to aSwitch List. Kalisto has quantified ~110,000 transcripts, but your annotation has 170,000 transcripts in it.

I wonder if its possible that you have quantified only against protein coding transcripts, but the annotation GTF you have passed to aSwitchList has all transcripts?

ADD REPLY
0
Entering edit mode

As the creator of IsoformSwitchAnalyzeR I approve of this solution :-)

ADD REPLY
0
Entering edit mode

Thanks for the reply!

Yes, that is possible. From the gencode website, I picked:

  1. "Protein-coding transcript sequences" for the fasta file (gencode.v44.pc_transcripts.fa.gz)
  2. "Comprehensive gene annotation" (Region: ALL) (gencode.v44.chr_patch_hapl_scaff.annotation.gtf.gz) for the gtf file.

How can I make out if this gene annotation is compatible with the fasta file? Is there a generic way that is used to match up the annotation and fasta files on gencode?

ADD REPLY
0
Entering edit mode

We tend to build our own fasta files from the GTFs. We have our own tool for this (gtf2fasta from cgat-apps), but I believe that gffread from stringtie should also be able to do it. Then you are certain that your fasta and gtf are identical!

An alternative, if you didn't want to to do that, would be to filter the GTF. It might be more fiddly to do this way round, but does avoid having to re-quantify.

The first thing to try would just be grepping out the protein_coding transcripts from the GTF:

$ zcat gencode.v44.chr_patch_hapl_scaff.annotation.gtf.gz | grep "protein_coding" | gzip > gencode.v44.chr_path_hpl_scaff.pc_transcripts.gtf.gz

This might well get you close enough to get past the isoformSwitchAnalyzeR check.

ADD REPLY

Login before adding your answer.

Traffic: 2057 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6