Hello, I am using GenomeScope to characterize (determine size, ploidy, quality etc) of a de novo green algae genome assembly. My first question is regarding the input to GenomeScope, because the input is using Illumina reads there are both Forward and Reverse sequencing reads but all examples online seem to only be using one input file. Does GenomeScope require both F and R read files as input? Or can you just select one to use? Or do the reads need to be paired as one input together? My second question is if there are any useful tutorials or manuals online on how to interpret the outputted graph profile? Specifically, what does the number of peaks mean? and what is considered a bad amount of error?
Thank you in advance for any assistance to these questions!
Thank you for your response. I did as suggested and combined my F and R reads to calculate the K-mer frequencies (following the Jellyfish and GenomeScope pipeline on GitHub). I expected my organism to be haploid however I received 2 peaks in the outputted graph. The first bump is noticeably smaller at 20X coverage and the second is at 40X coverage. If I understand correctly, other reasons for the first smaller peak could include 1) sequencing errors, 2) duplication in the genome, or 3) a contaminant organism? GenomeScope also gave me the value uniq:49.9% and dup:0.808%. I received a similar output in BUSCO: C: 94.3%[S:19.8%, D:74.5%] showing I have more duplicated copy of genes than single copies, and my genome size is larger than expected. Trying to figure out if this duplication is worrisome because of technical errors or if it is plausible biological duplication in the genome. Any further advice is appreciated, thank you!
The peak at 20x is definitely not because of sequencing errors. The peak of k-mers that originate from sequencing errors is always at 1x-2x.
I advise to do the following things:
To check whether the reason is contamination, you can align contigs that have different coverage to the NCBI nt database by Blastn or Megablast and examine the best matches. Maybe you'll find that contigs that have 20x coverage belong to one species and contigs that have 40x coverage to another.
Also, you can try other genome assemblers. I always assemble genomes using at least 5 assemblers and then find which one assembled the genome in the best way.
Thank you again for a great response and for a useful solution! I aligned my reads to the genome however I am having a hard time plotting a histogram to view the entire genome - I tried using 'samtools depth' on my sorted bam files and then used R to plot but it uses a lot of memory and once a plot was generated the file was too large to open. Using BAMstats I am able to view each scaffold separately as a histogram of coverage distribution. A few new questions for you:
Thank you again!
You can quickly calculate the coverage histogram for the whole genome with the command
samtools stats is a useful tool to calculate various statistics for an alignment.