Hi,
I have a data for two samples from permafrost, it's shotgun sequencing paired data 250bp, from Novaseq. The fastqc showed PolyG tails, wich I have removed with fastp tool, I have tried a few things to analyse this data.
First, after removal of PolyG tails, I have tried to assemble contigs with spades and megahit, then tried binning with MetaBAT, MaxBin and CONCOCT.
I didn't get any mags, almost all bins were pretty low completeness and high contamination (chekm output). I have tried to check the taxonomy with kraken Standard-16 database (downloaded from here - https://benlangmead.github.io/aws-indexes/k2), so contigs had 87-83% no hints, then I have tried kraken on raw reads trimmed and untrimmed - the picture was worse - 98% no hints.
So, I got advice to try to blastn 100 reads from single read. Blast showed a few dozen of bacteria with medium bit scores (60 bp hints but usually 12-15 mismatches), comparing the same 100 reads in kraken it provided only 4 bacterial species and most reads had no hints also.
Then, I have tried to do the mapping of paired and single reads (bowtie2) on reference genomes against which I performed blastn, it's less than 1% of the data got mapped paired and single there is no much of difference here.
I am kind of stuck with it. Not sure what can I do. Have anyone encountered with such low data quality? What can I do with it?
Main goal for this dataset it's reconstruct mags and metabolic pathways. During the library preparation we saw that our DNA were fragmented really hard, probably the quality of DNA was really bad.
I would appreciate any help.
Thanks,
Alla
So basically the DNA is degraded? What is the length of reads left after removal of adapters (which likely preceded poly-G tails)? It is possible that your data is not going to be usable (not what you want to hear) for analysis at hand.
PolyG tails were from 150n, so it's around 150 after trimming, Sorry forgot to mention that. Yeah, I hope that I can get something from it. But I understand that it's small chances to get something, if all above mention things didn't worked.
I wonder if you can use methods that work for degraded DNA (e.g. from bones/paleontology etc). What do you ultimately hope to get from this analysis. Someone who is more knowledgeable in the area would have more to add later.
What are the N50 and total length of the assembly after removing contigs with a length below 1500 bp?
Regarding the binning, MaxBin, CONCOCT, and MetaBAT have been outperformed by more advanced tools. You might want to try SemiBin2 (https://semibin.readthedocs.io/en/latest/semibin2/) or comebin (https://github.com/ziyewang/COMEBin).
Thank you so much, I will try them :)
We had: Total length (>= 5000 bp) 192967623 Total length (>= 10000 bp) 114074737 Total length (>= 25000 bp) 55645877 Total length (>= 50000 bp) 30758772 N50 1116
I have also found CarpeDeam assembly tool, https://www.biorxiv.org/content/10.1101/2024.08.09.607291v1
I will write an update if something works out for these data.
Thanks for advice
An N50 of 1116 is not great, but not terrible.
Older binners like MaxBin and MetaBAT perform best with contigs having a minimum length of 2.5 Kbp, but this largely depends on the complexity of the microbial community.
SemiBin2 and ComeBin can work with shorter contigs; however, I would not expect a significant improvement. Maybe you will get a bunch of MAGs but nothing more