I used jellyfish to count kmers and the obtained histograms were analyzed using Genome scope (http://qb.cshl.edu/genomescope/) in order to estimate effective genome size. However the resulting bp estimation of 61,351 bps is unexpectedly small for human data. The paired-end data was obtained from a Chip-seq experiment on Geo (GSE72141). Additionally, the model only converged on the input fastq files not the actual chip files. Any thoughts and/or suggestions are welcome.
My paired end processing workflow of raw data from GEO consisted of
trimming with bbduk and filtering for minlen=36 (readlength)
mapping to hg38 with bowtie2 PE
removing duplicates with samtools rmdup
filtering out unmapped and MAPQ<10 with samtools -F 4 -q 10
name sorting with samtools sort -n
and converting back to fastq with samtools fastq bam>-1.fq.gz -2.fq.gz
I used this code for counting kmers
jellyfish count -t 30 \
-C -m 31 \
-s 5G \
-o "$line"_out \
--min-qual-char=? <(zcat "$Folder"/"$line"_1.fastq.gz) <(zcat "$Folder"/"$line"_2.fastq.gz)
done<"$INPUT"
and this for histo generation
for file in /align/jelly/*_out; do jellyfish histo -t 10 "$file" >"$file".histo; done
I uploaded the obtained plots into genome scope to estimate genome size with kmer size=31 read length =36 max kmer coverage =1000
Results of which are:
GenomeScope version 1.0
k = 31
property min max
Heterozygosity 2.14024% 2.22375%
Genome Haploid Length 59,018 bp 61,351 bp
Genome Repeat Length 29,755 bp 30,931 bp
Genome Unique Length 29,263 bp 30,420 bp
Model Fit 90.075% 94.9174%
Read Error Rate 11.7095% 11.7095%
Thank you. I figured as much. However, is it possible to use the input fastq files for estimation because these were not biased by selection?
Additionally, do you by chance know of any other approaches for estimating genome size from reads raw or aligned?
I think the input should be ok'ish (if sequenced to sufficient depth).
I assume you want to get a feeling for the mappable genome size (rather than the actual number of non-N bp in the latest release). Maybe the details from the deepTools documentation are helpful.
Do let us know what the ultimate goal of these endeavors is, it may help us to help you more.
I do not understand what you mean by input fastq files. If the fastq files are the result of ChIP-seq they are already biased.
Yes, using unaligned whole genome re-sequencing data in the same way you did. I just do not understand why you want to get a size estimate for a human genome. It's not going to be better than the actual assembly size.
The input fastq files I am referring to are simply the sequenced input DNA but without any antibody selection. So I would think they could be utilized for size estimation.
The reason I am interested in estimating genome size is because I filtered out my aligned reads based on MAPQ score and blacklisted regions. Which to my understanding will alter downstream analysis (e.g. peak calling and visualization) unless I re-estimate the genome size with respect to those considerations. If I am incorrect regarding this please let me know as I am still new to this analysis.
I have examined the deeptools documentation and found they recommend unique-khmer.py for this sort of task however, I am finding navigating the khmer documentation to be difficult and time consuming.
I think you are seriously overcomplicating this task. The genome size of humans is known, there is no need to estimate it. Filtering reads does not influence this, nor does it alter any visualization tasks. Do yourself a favor and skip whatever you are trying to do here and go on with standard analysis unless you find a reputable source that suggests do modify standard workflows.
I can relate to where the impetus is coming from, after all, if one wanted to generate very exact coverage files, it wouldn't be wrong to focus on the part of the genome that was actually covered in the individual files. So, in principle you're not wrong. In practice, however, you will find that no one goes to these lengths because the experiments themselves are imperfect and imprecise and the coverage normalization that uses the entire genome is mostly utilized for visualization purposes, which are subject to additional loss of precision. For the vast majority of the samples, it will therefore not matter to get the exact portion of the genome that was covered, just as ATpoint remarked, go with the standard values for the human genome (you can always subtract the number of bp covered in the Blacklisted regions to clear your conscience). But, to be clear, this mostly demonstrates how we've accepted the noisiness and lack of true quantification of these assays rather than your misunderstanding of the bioinformatics concepts. The trade-off of trying to find a seemingly precise number for the mappable genome for each sample is simply not worth it in the grand scheme of things, but that's something you cannot know until you've dealt with that type of data a bit.