Question

How To Keep The Raw .Fastq.Gz Files For Rnaseq Data

0

Entering edit mode

10.7 years ago

shirley0818 ▴ 110

Hello, I have 75bp paired-end RNASeq data generated from Illumina HTSeq 2000 using the protocol of 7 samples mixture each lane from lane 1-7 in each flowcell. Each sample has 6bp-index associated with it. Using this protocal, for each sample, there are ~50 small .fastq.gz files for left-read and ~50 small .fastq.gz files for right-read. These small files are generated by the sequencer machine automatically. Now it comes up my questions regarding how to combine and keep the raw .fastq.gz files.

I used the command “cat” to combine these 50 small .fastq.gz files into one large .fastq.gz like the following for sample “2894” (is this the right way?) cat 2894_CCTTCA_L00_R1. .fastq.gz > 2894_R1.fastq.gz cat 2894_CCTTCA_L00_R2. .fastq.gz > 2894_R2.fastq.gz

After this, I have two .fastq.gz files for each sample. I think this is the files I want for analysis (TopHat), and also for uploading to public domain (SRA) when I publish my results.

However, the support staff in our sequencing core suggested that it is better to keep the original small .fastq.gz files for two reasons. 1. They are truly raw, that is to say, they are files generated automatically by the machine. 2. Bowtie2/tophat2 can take these small files as input directly.

Keep in mind that our RNASeq project is big, and we are not affording to keep both all small .fastq.gz files and the combined .fastq.gz files for each sample. So I would like to ask suggestions from you. If you can only keep one copy of the raw .fastq.gz files, which one you routinely keep for each sample:

the combined big .fastq.gz file or the original 50 small .fastq.gz files generated by the machine

Many thanks, Shirley

rnaseq data • 4.9k views

ADD COMMENT • link updated 10.7 years ago by Istvan Albert 101k • written 10.7 years ago by shirley0818 ▴ 110

0

Entering edit mode

We usually combine into one big file, but I think it would depend on your infrastructure, etc.

ADD REPLY • link 10.7 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Having the original smaller files will be helpful in troubleshooting QC problems. In case, a lane in the sequencer is acting weirdly, then all the samples run on that lane will give erratic results. The problem like this would be easier to find out if you maintain the granularity of the data. Also, tools like GATK can also perform base quality recalibration but they will only be able to do it if you supply enough information such as reads that originated from the same lane. I prefer to keep the files in the original form. I don't think it will make much or any difference storing the files individually or merging them and storing them.

ADD REPLY • link 10.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I definitely wouldn't combine different samples or different lane data into one file. I just meant the initial multiple files that the illumina primary analysis pipeline produces...

ADD REPLY • link 10.7 years ago by Madelaine Gogol 5.3k

score 1 · Answer 1 · 2014-03-25

I think you have two different things going on here.

One is that for a single sample the instrument may create multiple files if these samples were distributed over different lanes. This is very annoying to handle. In that case you should concatenate all the files that belong to the same sample into a single file.

But you should not concatenate different samples into one for convenience. That is just asking for trouble later on.

Thus in your case for 7 samples you should end up with 14 files (the paired end reads) where each file corresponds to a sample and is named by the sample.