Hello, I have 75bp paired-end RNASeq data generated from Illumina HTSeq 2000 using the protocol of 7 samples mixture each lane from lane 1-7 in each flowcell. Each sample has 6bp-index associated with it. Using this protocal, for each sample, there are ~50 small .fastq.gz files for left-read and ~50 small .fastq.gz files for right-read. These small files are generated by the sequencer machine automatically. Now it comes up my questions regarding how to combine and keep the raw .fastq.gz files.
I used the command βcatβ to combine these 50 small .fastq.gz files into one large .fastq.gz like the following for sample β2894β (is this the right way?) cat 2894_CCTTCA_L00_R1. .fastq.gz > 2894_R1.fastq.gz cat 2894_CCTTCA_L00_R2. .fastq.gz > 2894_R2.fastq.gz
After this, I have two .fastq.gz files for each sample. I think this is the files I want for analysis (TopHat), and also for uploading to public domain (SRA) when I publish my results.
However, the support staff in our sequencing core suggested that it is better to keep the original small .fastq.gz files for two reasons. 1. They are truly raw, that is to say, they are files generated automatically by the machine. 2. Bowtie2/tophat2 can take these small files as input directly.
Keep in mind that our RNASeq project is big, and we are not affording to keep both all small .fastq.gz files and the combined .fastq.gz files for each sample. So I would like to ask suggestions from you. If you can only keep one copy of the raw .fastq.gz files, which one you routinely keep for each sample:
the combined big .fastq.gz file or the original 50 small .fastq.gz files generated by the machine
Many thanks, Shirley
We usually combine into one big file, but I think it would depend on your infrastructure, etc.
Having the original smaller files will be helpful in troubleshooting QC problems. In case, a lane in the sequencer is acting weirdly, then all the samples run on that lane will give erratic results. The problem like this would be easier to find out if you maintain the granularity of the data. Also, tools like GATK can also perform base quality recalibration but they will only be able to do it if you supply enough information such as reads that originated from the same lane. I prefer to keep the files in the original form. I don't think it will make much or any difference storing the files individually or merging them and storing them.
I definitely wouldn't combine different samples or different lane data into one file. I just meant the initial multiple files that the illumina primary analysis pipeline produces...