Hi, I am working with Whole Genome Sequencing data of Human samples from HiSeq2500 platform with approx. 30X coverage. Everyday it's going hard for me to handle huge data size (Raw reads data e.g. .bcl/.fastq.gz); Alignment files (.sam/.bam); subsequent .bam files after "Mark Duplication", "Local realignment around InDels", "Base Quality Score Recalibration" and finally after calling varinats by 'HaploTypeCaller of GATK'. I want to know what is the best practices to handle such huge data files, Should I delete the older file once I get the next stage file? It would be great if someone suggest me the best practice to handle datasets/files in WGS of human samples. Thank you.
Usually I merge all fastq.gz files of forward and Reverse reads of all lanes of a sample to form ONE final Forward (R1) and Reverse (R2) read. Then, use BWA mem to align them to human reference genome (GRCh38), n the output .sam file is usually 250-350GB in size which eventually get converted into .bam (by picard) file of the range of 45-50GB. In next 3-4 steps, there is generation of similar .bam files of 40-50GB size. The final variant calling file (.VCF) normally range 120-135 GB. So, overall I normally get 500-700GB data from start to end for a WGS human sample. It would be great if you let me know how to map each fastq.gz file in parallel?. Usually I get
[8 lanes*2(forward n Reverse)*8(no of files of each type)] * 2 = 256 fastq.gz
files for a sample. Also, i have access of SUN cluster system with 16 nodes with 8 cores each. Thank you.see https://github.com/lindenb/ngsxml and the option -j of GNU make. So cluster manager like SGE have a parallelized version of make.