Hi all! Is such a result ( https://imgur.com/DGxsFN7 ) concerning if the overall goal is to do de-novo genome assembly?
I continued with the data as is, assembled and predicted proteins and I did find that some proteins were duplicated (not sure if this is caused by what we see above?) anyways I then used a software for assembling heterozygous genomes and saw some improvement in # of duplicated proteins, but im unsure if this is the proper solution for this failed fastqc module. Any insight is greatly appreciated.
Additional info: I have 200bp paired reads and high coverage (~1000X)
Since you have extremely high coverage, what you need to do is normalize your data before you do the assembly. You can use
bbnorm.sh
from BBMap suite to do that. There is a guide available here.Thanks for your suggestion. I tried this out. Only problem is, my assembly statistics are worse after normalization (N50 is decreased by more than half), so I am tempted to avoid this step. Any other suggestion for dealing with this high level of duplication?
Is there a related (or same) genome available in public databases? You could try using it to guide your assembly.
As for the other result, even if N50 decreased by half did it take out the duplications that you were concerned with?
Thanks! Yes, it took out the duplications. Although, I went from ~19 million (paired) reads to ~2million post normalizing. The command I used was:
So it sounds like read normalization worked.
It is clearly indicating that if you remove the duplicated sequences from your data, it will leave only 42.31% of original data. Refer to fastqc report for bad illumina data . please attach your whole report of fastQC.
Please do not delete posts. The purpose of this site is two-fold: more immediately, to help people with their questions; but on the long run, to serve as a repository of knowledge. The second purpose is defeated if people delete their questions.