Question

CAGEr BAM files size issue

0

Entering edit mode

8.1 years ago

orzech_mag ▴ 230

Dear Colleagues,

Recently I am analyzing my CAGE data using CAGEr package and I have experienced some difficulties. My project consists of 10 libraries among which 9 .bam files are up to 940 MB and one .bam is 1.5 GB. At the level of getCTSS(cage_bam) for this one huge file I obtain error:

getCTSS(cage_bam)

Reading in file: ./MCF7_xx.bam...

-> Filtering out low quality reads...

Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : negative length vectors are not allowed

What I have found is that the issue is caused by size of this huge .bam file. My other .bams work perfectly fine. Moreover, I used SAMtools to extract subset of this huge .bam and it worked perfectly up to size of 1.1 GB. Unfortunately, it is not the solution of my difficulties because .bams are sorted so it cut out part to some extent.

The computing power of my PC is really big, so I guess that the issue is located at different point. In addition, I am using the latest versions of R, RStudio, CAGEr and other adjacent packages @ Ubuntu Unity.

Also, I append sessionInfo():

sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 16.04.2 LTS

Matrix products: default

BLAS: /usr/lib/libblas/libblas.so.3.6.0

LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:

[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=pl_PL.UTF-8

[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=pl_PL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:

[1] BSgenome.Hsapiens.UCSC.hg38_1.4.1 CAGEr_1.18.0 BSgenome_1.44.0 rtracklayer_1.36.3

[5] Biostrings_2.44.1 XVector_0.16.0 GenomicRanges_1.28.3 GenomeInfoDb_1.12.2

[9] IRanges_2.10.2 S4Vectors_0.14.3 BiocGenerics_0.22.0

loaded via a namespace (and not attached):

[1] splines_3.4.1 zlibbioc_1.22.0 GenomicAlignments_1.12.1 beanplot_1.2 BiocParallel_1.10.1

[6] som_0.3-5.1 lattice_0.20-35 tools_3.4.1 SummarizedExperiment_1.6.3 grid_3.4.1

[11] data.table_1.10.4 Biobase_2.36.2 matrixStats_0.52.2 Matrix_1.2-10 GenomeInfoDbData_0.99.0

[16] bitops_1.0-6 RCurl_1.95-4.8 VGAM_1.0-3 DelayedArray_0.2.7 compiler_3.4.1

[21] Rsamtools_1.28.0 XML_3.98-1.9

I would very appreciate your help how to solve this issue or any suggestions how to skip it.

Thank you in advance and best regards,

Magda

CAGEr R software error RNA-Seq biostrings • 2.5k views

ADD COMMENT • link 8.1 years ago by orzech_mag ▴ 230

0

Entering edit mode

Not exactly a solution to your problem, but you could do random sampling of the 'huge' bam to get it to the size of the 'acceptable' bams.

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

0

Entering edit mode

Thanks for suggestion. In fact, that would be some kind of solution, but I guess it would be better to fix the bug that limits the size of the file.

ADD REPLY • link 8.1 years ago by orzech_mag ▴ 230

0

Entering edit mode

Definitely.

You mentioned you removed the first part of your bam and up to a size of 1.1Gb everything was fine. Did you consider that there might be one (or more) faulty reads which throw this error? And these just happen to be near the end of the bam?

By the way, what's the reason this bam is larger? Just sequenced more deeply?

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

0

Entering edit mode

I checked and tested this file in many ways (reverting it, cutting different part of file etc) and the reason that I could see is its size only. We performed CAGE project and each library had from ~2 to ~20 mln reads. This one file has 45 mln reads, thus is so big.

ADD REPLY • link 8.1 years ago by orzech_mag ▴ 230

0

Entering edit mode

Dear Colleagues,

via Bioconductor support I got the following reply:

The error message is coming from Biostrings and is admittedly not helpful. I just changed this in Biostrings 2.44.2 and now it says:

unlist(big_dna_string_set) Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : XStringSet object is too big to be unlisted (would result in an XString object of length 2^31 or more)

It looks like the getCTSS() function in the CAGEr package is trying to unlist some XStringSet object containing quality strings that are too big to be concatenated in one single string. Maybe that could be avoided, I don't know, I'm not familiar with the CAGEr package.

Perhaps anybody would know something about this error and how to skip it?

I would very appreciate any info.

Best regards!

M.

ADD REPLY • link 8.1 years ago by orzech_mag ▴ 230