Running and Analyzing fastqc on multiple fastq files
2
4
Entering edit mode
9.5 years ago
ravi.uhdnis ▴ 220

Hi Everyone,

I am working on Whole Genome Sequencing and analysis of Human genome from illumina HiSeq platform with about 30X coverage. Each sample (human genome) have about 250-300 fastq.gz files, whom I am dealing with 'fastqc' for quality check using following command :

/usr/local/bin/fastqc -t 8 -f fastq -o OUT/ -casava *.gz -noextract

Although it is running fine and generating equal number of "fastqc.zip" files which I unzipped using unzip '*.zip'. So, here I have 2 questions:

  1. Can I merge two or more fastq files and then run fastqc on those merged files? If yes, how should I merge those fastq files?
  2. I have to manually check 250-300 fastqc folder to know the quality by opening .html page. Is there any way where I can have summary of overall quality of the fastq files in a flowcell?

Please let me know your comments. I'll be highly thankful to you.

Best,
Ravi

next-gen RNA-Seq genome • 41k views
ADD COMMENT
9
Entering edit mode
9.5 years ago

We have a script that will run fastqc and generate a summary report with the images from all the fastq files it was run on. You may also find it useful to systematically parse the fastqc_data.txt files from each run and combine the results that way.

The script is here, but may not be the most useful and documented thing ever... Depends on imagemagick to generate thumbnails...

https://github.com/metalhelix/illuminati/blob/cluster/scripts/fastqc.pl

Also uses this script:

https://github.com/metalhelix/illuminati/blob/cluster/scripts/thumbs.sh

ADD COMMENT
5
Entering edit mode

Visiting from the future... If anyone is dealing with this issue, they may also want to check out MultiQC.

ADD REPLY
0
Entering edit mode

@Madelaine Gogol Just what I was needed!

ADD REPLY
0
Entering edit mode

@Madelaine Gogol,

I have 10 fastq.gz files (including R2 reads). I made a sample_name.txt with name of fastq.gz files (10 files) and ran fastqc.pl by the following command. fastqc.pl -name sample_Name.txt. But I have got null out put. Please can you help me to run your program

ADD REPLY
1
Entering edit mode

I don't usually run it with that option, but it looks like it's expecting sampleName[tab]adapter sequence in the file. That is just to get the names of the sample. You would still have to pass in the fastq files as an argument. like --files *.fastq.gz (if you were in the same directory).

ADD REPLY
0
Entering edit mode

@Madelaine Gogol, since a long time we are looking for such a nice solution to merge all the fastqc reports in a single html file. However the script runs only on the first file in the folder and stops then. Do you have an idea why? We are using following command:

perl '/home/Desktop/fastqc_summarizer.pl' --name '/home/Desktop/fastq/names' --out '/home/Desktop/fastq/fastqc' --files  '/home/Desktop/fastq/*.fastq'
ADD REPLY
0
Entering edit mode

Not really... What is the format of the "names" file? Did you make any changes to the fastqc script besides changing the name? Maybe try it with less arguments at first to see if it runs that way - like from inside the directory of fastqs with just --files "*.fastq".

ADD REPLY
7
Entering edit mode
9.5 years ago

Not only can you merge the fastq files but your life might be easier if you do. For merging them, a simple cat will suffice. I should note that you don't have to be delivered 300 some odd files, you can request that whomever is doing the sequencing just give you a two files (assuming paired-end) per sample/library (the bcl2fastq program that they use to process the bcl files produced by the sequencer can trivially do this).

If you don't want to wait until all of the files are merged, you can likely just use a named pipe as input to fastqc. Something like:

mkfifo foo.fastq.gz
cat sample_L1_R1_???.fastq.gz > foo.fastq.gz
fastqc foo.fastq.gz

Given that fastqc is written in java, I can't guarantee that it'll properly handle block gzipped files like that (the java gzip library has been broken for years). You can always zcat instead. I should note that the only reason process substitution likely wouldn't work is that fastqc names the output files after the input file name.

For 2. it depends a bit on what you want. The sequencing facility actually has an idea about that already (it's produced by the machine). It's easy enough to just ask them (they can also give you a break down of how many reads per sample, their average quality (also per sample), etc.). For our internal pipeline, I have a pdf produced with that sort of information, since it's a bit quicker to look first at a single table like that than to trudge through all of the fastqc files. BTW, fastqc also produces an HTML file with the images included. When I QC flowcells before sending results to our local groups those are what I personally look at...it's quicker than dealing with the zip files.

ADD COMMENT
0
Entering edit mode

Thank you very much Dr Ryan for the comment. Actually we run illumina HiSeq platform in our lab and I joined recently to handle and analyze the output data. I am running bcl2fastq script but our current version CASAVA_v1.8.1 didn't support the option --fastq-cluster-count 0 in order to make just one fastq file for one sample.

Anyway, I simply concatenated the fastq files using cat as (for each lane of each sapmle of a flowcell for R1 as well as R2) e.g.

cat ETH001100_CGATGT_L001_R1_00* > ETH001100_CGATGT_L001_R1_1-8.fastq.gz

This way I got 16 fastq files for each sample/flowcell, in total 32 for each sample in both flowcells. Then I ran fastqc on each of these 32*5=160 files of 5 samples. Is this way correct ?. Please correct me in case I am missing or doing something incorrect anywhere. Thank you. Regards, Ravi

ADD REPLY
1
Entering edit mode

At least the most recent 1.8.X supports setting the cluster count to 0 (it's what I use in my pipeline), so you might consider upgrading.

Regarding the concatenation, why not merge the lanes within at least each flow cell as well? If you're only using one library per sample then that'd make sense. You'd then have 4 files per sample (one forward and one reverse per flow cell). You could also just merge them across flow cells. That's an annoying thing to try and automate, but for single projects it's easy enough.

ADD REPLY
0
Entering edit mode

Thank you for the response Dr Ryan. I'll ask to upgrade available CASAVA version so as to use '--fastq-cluster-count' parameter. Yes, that would be much helpful for me but i am not aware whether this information (i.e Lane's) will be required in downstream analysis pipeline/software or not so i was keeping them as it is. Rest, if no such information is required then i'll simply merge them in a flowcell and then the two of same types from both the cells, in order to have just 2 files per sample. Thank you, Ravi

ADD REPLY
0
Entering edit mode

Dr Ryan, I have one more doubt, please give your suggestion.

If I cat like this way:

cat ETH001100_CGATGT_L001_R1_00* > ETH001100_CGATGT_L001_R1_1-8.fastq.gz

the output file size is 2.1GB

whereas if I do it like:

zcat ETH001100_CGATGT_L001_R1_00* | gzip > ETH001100_CGATGT_L001_R1_1-8.fastq.gz

this file size is 1.7GB. So, why this difference in final .gz file and which way is the correct way of merging the files?

ADD REPLY
1
Entering edit mode

That's somewhat expected. If you concatenate two smaller files then the resulting file's size will be the sum of it's components. If you instead compress the decompressed concatenation then you'll get a somewhat smaller file, since it has more to work with when doing the compression (after all, the larger file has more redundancies than each of the smaller files).

ADD REPLY
0
Entering edit mode

I am concatenating a large set of data (around 45 GB in total of fastq files). When I follow the script by Dr. Ryan I get the process hung on; i.e. , I can't have anything written to the FIFO pipe file. Is there any fix for this situation?

ADD REPLY

Login before adding your answer.

Traffic: 2645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6