Duplication removal of RNA-seq fastq files during the process of annotation
0
0
Entering edit mode
4 weeks ago
Abieskawa • 0

Hello everyone, I am running structural annotation to my species with Ginger. I encountered an error was that my RNA-seq data was really large and caused some problems to oases and trinity. It is large because it combined like around 30 pair-end files, some of which came from our collaborators, and most of them came from NCBI. The original Read 1 file is around 200 GB. Ginger seems to only take only one set of combine fastq file, so I cannot run it separately. Should I perform duplicate removal? And any recommended modules or program?

Ginger https://github.com/i10labtitech/GINGER

error codes retrieved from nextflow (Ginger design its pipeline on it)

Nov-04 14:06:33.072 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 3; name: oases; status: COMPLETED; exit: 1; error: -; workDir: /output/marine_tilapia/work/cf/f61a216236dba2edda73721bb5c089] Nov-04 14:06:33.348 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'oases'

Caused by:Process oases terminated with an error exit status (1)

Command executed:

/root/anaconda3/bin/velveth ginger 31 -fastq -short -separate combined_1.trim.fastq combined_2.trim.fastq /root/anaconda3/bin/velvetg ginger -read_trkg yes /root/anaconda3/bin/oases ginger

Command exit status: 1

Command output: [14469.334582] === Sequences loaded in 2814.502856 s [14469.334654] Done inputting sequences [14469.334659] Destroying splay table [14518.165976] Splay table destroyed [0.000000] Reading roadmap file ginger/Roadmaps

Command error: velvetg: Can't calloc 18446744072378670449 Annotations totalling 18446744047091928276 bytes: Cannot allocate memory [0.000000] Reading roadmap file ginger/Roadmaps

duplication RNA-seq annotation • 310 views
ADD COMMENT
0
Entering edit mode

Can you try randomly sampling like 20% of the 200Gb R1 and R2 files? You can repeat this a few times to see how much the permutations are affecting results, but that is a lot of reads!

Also, how are you treating each set of paired end reads? Are they all from the same sample, or are you handing them as individual samples?

ADD REPLY
0
Entering edit mode

No, they are from different tissues and development stage, like gills, brain, etc, 27 samples (54 files because of pair-end) in total. I handle them as different samples. I applied program cutadapt to them, and it did adaptor removal, remove <5 bp reads, and first 10 bp after receiving 6 samples from colleagues as well as the others downloaded from NCBI SRA. I ask this because I guess there is no need to care about the duplicate in the process of genome annotation because I don't care one gene express more than other gene or not, but I am not 100% sure, and some of the samples are deeply sequenced (over 30x).

Except for duplicate removal, I have another idea is to split the original file and annotate with another RNA program, and then incorporate into Ginger, but it takes time to explore how to use them.

Should I give the program cause problems that the author use in Ginger? They were from configure file:

/** RNA-Seq denovo based **/

PDIR_PREP_DENOVO         = "${PDIR_PREP}/denovo" // *** No need to edit ***
PDIR_PREP_DENOVO_TRINITY = "${PDIR_PREP_DENOVO}/trinity" // *** No need to edit ***
PDIR_PREP_DENOVO_OASES   = "${PDIR_PREP_DENOVO}/oases" // *** No need to edit ***
UTILPATH_DENOVO          = "${GINGER_UTIL}/denovo" // *** No need to edit ***

// --- Tools for denovo ---
DENOVO_PYTHON   = "/root/anaconda3/bin/python" // a path to Python "python"
VELVETH         = "/root/anaconda3/bin/velveth" // a path to Velvet command "velveth"
VELVETG         = "/root/anaconda3/bin/velvetg" // a path to Velvet command "velveth"
OASES           = "/root/anaconda3/bin/oases" // a path to Oases command "oases"
TRINITY         = "/root/anaconda3/bin/Trinity" // a path to Trinity command "Trinity"
GMAP_BUILD      = "/root/anaconda3/bin/gmap_build" // a path to GMAP command "gmap_build"
GMAP            = "/root/anaconda3/bin/gmap" // a path to GMAPL command "gmap"
CD_HIT_EST      = "/root/anaconda3/bin/cd-hit-est" // a path to CD-HIT command "cd-hit-est"

// --- Options related to trinity ---
OPREFIX_TRINITY = "trinityGinger" // *** No need to edit ***
SRA_FLAG        = 1 // 1: if the RNA-Seq data was obtained from SRA, 0: if not
ADD REPLY

Login before adding your answer.

Traffic: 1773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6