Dear Colleagues,
Recently I am analyzing my CAGE data using CAGEr package and I have experienced some difficulties. My project consists of 10 libraries among which 9 .bam files are up to 940 MB and one .bam is 1.5 GB. At the level of getCTSS(cage_bam) for this one huge file I obtain error:
getCTSS(cage_bam)
Reading in file: ./MCF7_xx.bam...
-> Filtering out low quality reads...
Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : negative length vectors are not allowed
What I have found is that the issue is caused by size of this huge .bam file. My other .bams work perfectly fine. Moreover, I used SAMtools to extract subset of this huge .bam and it worked perfectly up to size of 1.1 GB. Unfortunately, it is not the solution of my difficulties because .bams are sorted so it cut out part to some extent.
The computing power of my PC is really big, so I guess that the issue is located at different point. In addition, I am using the latest versions of R, RStudio, CAGEr and other adjacent packages @ Ubuntu Unity.
Also, I append sessionInfo():
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=pl_PL.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=pl_PL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg38_1.4.1 CAGEr_1.18.0 BSgenome_1.44.0 rtracklayer_1.36.3
[5] Biostrings_2.44.1 XVector_0.16.0 GenomicRanges_1.28.3 GenomeInfoDb_1.12.2
[9] IRanges_2.10.2 S4Vectors_0.14.3 BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] splines_3.4.1 zlibbioc_1.22.0 GenomicAlignments_1.12.1 beanplot_1.2 BiocParallel_1.10.1
[6] som_0.3-5.1 lattice_0.20-35 tools_3.4.1 SummarizedExperiment_1.6.3 grid_3.4.1
[11] data.table_1.10.4 Biobase_2.36.2 matrixStats_0.52.2 Matrix_1.2-10 GenomeInfoDbData_0.99.0
[16] bitops_1.0-6 RCurl_1.95-4.8 VGAM_1.0-3 DelayedArray_0.2.7 compiler_3.4.1
[21] Rsamtools_1.28.0 XML_3.98-1.9
I would very appreciate your help how to solve this issue or any suggestions how to skip it.
Thank you in advance and best regards,
Magda
Not exactly a solution to your problem, but you could do random sampling of the 'huge' bam to get it to the size of the 'acceptable' bams.
Thanks for suggestion. In fact, that would be some kind of solution, but I guess it would be better to fix the bug that limits the size of the file.
Definitely.
You mentioned you removed the first part of your bam and up to a size of 1.1Gb everything was fine. Did you consider that there might be one (or more) faulty reads which throw this error? And these just happen to be near the end of the bam?
By the way, what's the reason this bam is larger? Just sequenced more deeply?
I checked and tested this file in many ways (reverting it, cutting different part of file etc) and the reason that I could see is its size only. We performed CAGE project and each library had from ~2 to ~20 mln reads. This one file has 45 mln reads, thus is so big.
Dear Colleagues,
via Bioconductor support I got the following reply:
Perhaps anybody would know something about this error and how to skip it?
I would very appreciate any info.
Best regards!
M.