How to filter huge VCF files for datamining?
0
0
Entering edit mode
7.6 years ago
jespinoz ▴ 20

I have 88 human samples from a NextSeq run. I generated BCF files by first running HISAT2 and the preconfigured Ensembl hg38 database available on their documentation. After that, I converted the sam files to sorted bam files and then individual bcf files. All together, these 88 BCF files total up to 5.7 TB (5.8 TB below b/c I copied 2 to try to merge as a test) which is gigantic. I need to substantially reduce the file size and the SNPs analysis is only a secondary addition to one of our studies. I planned to look at cooccurrence of SNPs between the 88 samples. I understand SNP studies tend to focus on a rare SNPs but sparse matrices are not useful for the data mining algorithms I wish to use. My question is how I can filter these BCF files to combine them into a single data matrix that is manageable on a 4 GB laptop? What kind of filtering is typically done on SNPs to dramatically reduce the complexity while preserving a safe much information content as possible? I understand this seems contradictory but prototyping is my main goal with this analysis. How should they be filtered and what would be good cutoffs to use?

We have 88 samples whose reads together total to about 746 G in size. (symbollic links below which is why file size is smaller)

We used HISAT2 for the mapping using human assembly hg38. HISAT2 supplies preindexed files that we used located at ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz.

The assembly for the genome used for the indexing was retrieved from ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.

$ du -sh *
5.8T    bcf_files
8.5G    grch38
4.7G    grch38.tar.gz
435K    reads
34K run_tmp.sh
176G    sam_files
38G sorted_bam_files

My pipeline:

Create the sam file

hisat2 -q -p 2 --fast -x ./grch38/genome -1 {r1_path} -2 {r2_path} -S ./sam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam

Sam => Sorted-bam

samtools view -bS ./sam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam | samtools sort -@ 16 -o ./sorted_bam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted

Sorted-bam => BCF samtools mpileup -uf ./grch38/Homo_sapiens.GRCh38.dna.primary_assembly.fa -C 50 --BCF -o ./bcf_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted.bcf ./sorted_bam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted

vcf bcf genome genotype snps • 4.0k views
ADD COMMENT
3
Entering edit mode

You are probably producing so-called all-site VCFs. Such VCFs can be helpful to popgen analyses, but rarely used for other applications. Most of time we produce variant-only multiple-sample VCFs, which are by far smaller. What is your target application?

ADD REPLY
2
Entering edit mode

Pretty sure your VCF isn't the largest file - excluding the index files, it would be the smallest, in fact. With 88 samples, it might be just over a gig in size. BAM files, OTOH are around 60G in size for a 30X depth run.

VCF filtration is generally ad hoc and depends on the kind of analysis you wish to perform, but if you want to work on a 4 GB laptop (I'm assuming 4GB RAM, which is not good for most bioinformatics analyses), you're going to put up with lower speeds as you use software like plink to analyze your VCF file.

ADD REPLY
1
Entering edit mode

I am also surprised here, even 88 multiplied times the size of the human genome does not equal 4TB

ADD REPLY
0
Entering edit mode

These are human reads extract from metagenomic reads so the read depth is all over the place.

ADD REPLY
0
Entering edit mode

I've repeated this using ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz and I'm creating individual bcf files. It's 71.5% done and the directory is 4.2 TB.

ADD REPLY
0
Entering edit mode

Are you creating so-called gVCF files (I think this is what @lh3 referred to above)? It just doesn't seem like a normal VCF can conceivably have that much data. Also, have you considered doing joint genotyping?

ADD REPLY
0
Entering edit mode

I'm running this command for each of the 88 samples: samtools mpileup -uf ./grch38/Homo_sapiens.GRCh38.dna.primary_assembly.fa -C 50 --BCF -o ./bcf_files/1054.2_RD1.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted.bcf ./sorted_bam_files/1054.2_RD1.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted

ADD REPLY
0
Entering edit mode

gVCFs are not super large either - I have a 250+ sample gVCF and it's less ~10G in size. gVCF does contain all sites, but it compresses contiguous variant-free sites into regions, unless you force it to emit all sites individually. See http://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf

ADD REPLY
0
Entering edit mode

Thanks for your patience. I've been traveling so haven't gotten a chance to respond to the comments. @lh3, I think you are right. I used ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_snp.tar.gz (4.6 GB) instead of ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz (3.9 GB) which I think is creating the large files and being overly comprehensive for my desired analysis. Do you know if the SNP db I used may have generated the "all-site VCF" you were referring to? My main goal is to look at groups of SNPs that correlate with a specific trait and I want to do this in Python using Pandas. @Ram, my work computer is 8GB RAM and my home Macbook Air is 4GB RAM which I may use from time to time. I have access to the Grid (which is how I'm generating these files) but I don't want to prototype new ideas using this giant dataset. @cmdcolin, I've updated my question to include my directory size.

ADD REPLY

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6