Genome Size Estimation with Jellyfish and Genome Scope is Unexpectedly Small
1
1
Entering edit mode
3.0 years ago
dk0319 ▴ 70

I used jellyfish to count kmers and the obtained histograms were analyzed using Genome scope (http://qb.cshl.edu/genomescope/) in order to estimate effective genome size. However the resulting bp estimation of 61,351 bps is unexpectedly small for human data. The paired-end data was obtained from a Chip-seq experiment on Geo (GSE72141). Additionally, the model only converged on the input fastq files not the actual chip files. Any thoughts and/or suggestions are welcome.

My paired end processing workflow of raw data from GEO consisted of 
trimming with bbduk and filtering for minlen=36 (readlength)
mapping to hg38 with bowtie2 PE 
removing duplicates with samtools rmdup
filtering out unmapped and MAPQ<10 with samtools -F 4 -q 10
name sorting with samtools sort -n
and converting back to fastq with samtools fastq bam>-1.fq.gz -2.fq.gz

I used this code for counting kmers

jellyfish count -t 30 \
-C -m 31 \
-s 5G \
-o "$line"_out \
--min-qual-char=? <(zcat "$Folder"/"$line"_1.fastq.gz) <(zcat "$Folder"/"$line"_2.fastq.gz)
done<"$INPUT"

and this for histo generation

for file in /align/jelly/*_out; do jellyfish histo -t 10 "$file" >"$file".histo; done

I uploaded the obtained plots into genome scope to estimate genome size with kmer size=31 read length =36 max kmer coverage =1000

Results of which are:

GenomeScope version 1.0
k = 31

property                      min               max               
Heterozygosity                2.14024%          2.22375%          
Genome Haploid Length         59,018 bp         61,351 bp         
Genome Repeat Length          29,755 bp         30,931 bp         
Genome Unique Length          29,263 bp         30,420 bp         
Model Fit                     90.075%           94.9174%          
Read Error Rate               11.7095%          11.7095%       
genome ChIP-Seq jellyfish • 2.6k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
Michael 55k

You cannot estimate genome size from ChIP-seq data because the whole point of ChIP is to identify and sequence only the DNA bound to a DNA-binding protein. Therefore, your size estimate represents at best the proportion of DNA bound by protein. However, if the binding motifs are very similar you might even underestimate this size because you get less unique k-mers than by random sampling. In conclusion, in order to estimate genome size, you need whole genome sequencing data.

ADD COMMENT
0
Entering edit mode

Thank you. I figured as much. However, is it possible to use the input fastq files for estimation because these were not biased by selection?

Additionally, do you by chance know of any other approaches for estimating genome size from reads raw or aligned?

ADD REPLY
1
Entering edit mode

I think the input should be ok'ish (if sequenced to sufficient depth).

I assume you want to get a feeling for the mappable genome size (rather than the actual number of non-N bp in the latest release). Maybe the details from the deepTools documentation are helpful.

Do let us know what the ultimate goal of these endeavors is, it may help us to help you more.

ADD REPLY
0
Entering edit mode

However, is it possible to use the input fastq files for estimation because these were not biased by selection?

I do not understand what you mean by input fastq files. If the fastq files are the result of ChIP-seq they are already biased.

Additionally, do you by chance know of any other approaches for estimating genome size from reads raw or aligned?

Yes, using unaligned whole genome re-sequencing data in the same way you did. I just do not understand why you want to get a size estimate for a human genome. It's not going to be better than the actual assembly size.

ADD REPLY
0
Entering edit mode

The input fastq files I am referring to are simply the sequenced input DNA but without any antibody selection. So I would think they could be utilized for size estimation.

The reason I am interested in estimating genome size is because I filtered out my aligned reads based on MAPQ score and blacklisted regions. Which to my understanding will alter downstream analysis (e.g. peak calling and visualization) unless I re-estimate the genome size with respect to those considerations. If I am incorrect regarding this please let me know as I am still new to this analysis.

I have examined the deeptools documentation and found they recommend unique-khmer.py for this sort of task however, I am finding navigating the khmer documentation to be difficult and time consuming.

ADD REPLY
2
Entering edit mode

I think you are seriously overcomplicating this task. The genome size of humans is known, there is no need to estimate it. Filtering reads does not influence this, nor does it alter any visualization tasks. Do yourself a favor and skip whatever you are trying to do here and go on with standard analysis unless you find a reputable source that suggests do modify standard workflows.

ADD REPLY
1
Entering edit mode

I can relate to where the impetus is coming from, after all, if one wanted to generate very exact coverage files, it wouldn't be wrong to focus on the part of the genome that was actually covered in the individual files. So, in principle you're not wrong. In practice, however, you will find that no one goes to these lengths because the experiments themselves are imperfect and imprecise and the coverage normalization that uses the entire genome is mostly utilized for visualization purposes, which are subject to additional loss of precision. For the vast majority of the samples, it will therefore not matter to get the exact portion of the genome that was covered, just as ATpoint remarked, go with the standard values for the human genome (you can always subtract the number of bp covered in the Blacklisted regions to clear your conscience). But, to be clear, this mostly demonstrates how we've accepted the noisiness and lack of true quantification of these assays rather than your misunderstanding of the bioinformatics concepts. The trade-off of trying to find a seemingly precise number for the mappable genome for each sample is simply not worth it in the grand scheme of things, but that's something you cannot know until you've dealt with that type of data a bit.

ADD REPLY

Login before adding your answer.

Traffic: 1703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6