Dear all,
I recently started studying and trying to analyze NGS data. I believe that it would not be worth analyzing my NGS data due to the low depth (1x) and the large number of contigs in the assembly file (142). Although I was encouraged to try to perform the analyses, I am concerned about the quality of the results. I would like to know if there is any possibility of performing reliable analyses with this data or if it would be better to sequence the sample again. The data are from a complete eukaryotic genome and my goal is to annotate them, find possible secondary metabolites, biosynthetic gene clusters (BGCs) and other clusters. Would it be worth it?
I'm sure your coverage is not 1x if the assembly is that good. Try mapping the short reads to the assembly using a tool like bwa mem, then calculate the real read depth with a tool like samtools depth or
samtools coverage
.How did you get to this point? What kind of data is this? What organism is it from? Having 142 contigs may not be so bad, if the contigs are long. How did you come up with the 1x number (by aligning reads back to the assembly)?
This is the genome of a fungus. The vast majority of contigs are long (length>2000000 and length>900). 10 contigs have length<=200. As for the other questions, I can't say. The sequencing was done by someone else and I'm still waiting for them to tell me some information.
If this is a fungal genome then those stats are not bad. What is the expected genome size? Smallest contigs may not be usable but you can start start analyzing the data using the large contigs. Compare to a close genome available in databases etc.
The expected size would be 33.19 Mb. The complete sequenced, cleaned and assembled genome is 32 Mb. As for depth, some contigs are 1x and others 0.33x. Only the contig related to a circular DNA has depth = 30x.
Thank you very much colindaven and GenoMax!! You have no idea how much you helped me! I was deciding whether to continue with this to do my undergraduate thesis. I wanted to get away from proteomics for a bit and try my hand at genomics.
Thank you very much!!
It's not possible to create a good assembly with coverage 1x of short reads. It's good you are working through this but please do some basic introductory bioinformatics tutorials (eg try the galaxy training site) before ploughing ahead. You'll learn much more quickly if you learn it from a course rather than confusing specialists with incorrect statements.
Best of luck with your project.
I will second colindaven . Do a first pass analysis on the largest contigs (especially comparing them to other similar genomes available in databases). If things look strange then it would not be worth proceeding further. Granted this is an undergrad thesis but you want to start with data that is clean.
Thanks for your advice, I will follow it. But I'm not referring to coverage, but to depth.