Uncertainty of how to proceed with Metagenomic Analysis of WGS data
2
0
Entering edit mode
14 months ago
DonPhager • 0

Hello guys, I am relatively new to metagenomic data analysis and even bioinformatics in general. To maybe give you a brief idea of what my issue is, I will just explain what my data is about: the data generated derives from ~ 60 samples of phage enrichments, where I have multiple biological replicates (3-5) of each sample. Therefore the data is not strictly "metagenomic" because they have been enriched once but the phage in the samples are not known and there should be more than just one phage for example because they derive from an environmental sample. Also host dna has been removed in the extraction step.

I have already finished the QC, Trimming and Assembly and I am now left with a lot of contigs in each sample and I am unsure on how to proceed. I think the next step would be to bin my configs but I have already tried multiple programs like CDHit and Metabat2 and the results do not really convince me. CD hit appeared a bit to "easy" because of the heuristic approach but with metabat2 I also get around 1-2 bins with only one config which is also mostly classified as complete in CheckV but the rest of the bins contains 2-7 contigs each and I am unsure how to proceed with this. Before binning I removed all contigs <1kb.

The goal is to know which phages are in the samples (I would have used BLASTn for this?) and how abundant they are and how this differs between the sample groups.

Another thing is the k-mer size in metaSPAdes which I used to assemble. In most programs I focused on keeping most of the default settings because I am not as experienced but in SPAdes I noticed that it automatically only assembled with k=21,33,55. Online I read that some people also assemble with k=77, I tried this but mostly the results were equal.

I really appreciate any help, let me know if there might be information I need to give you.

Update - 17.11.23:

Hey guys, I am sorry to bother so much but I am still confused and not sure how to proceed further. I have completed QC, Trimming and Assembly. I am now set with all my contigs which I also checked with CheckV and VirSorter2. The biggest contigs >20-30 kbp in my 30 samples are mostly even marked as complete but I am also having a lot of small contigs (up to 9000 in some samples).

My plan was to set a length cutoff of 5000 bp and then assign Taxonomy to these with. BLASTn and then manually checking the samples and assign the contigs to vOTUs.

All my efforts with binning the contigs do not really work since MetaBat sometimes gives me bins of only 1 config (the longest) and then another 2nd bin with random contigs from one sample but does not bin the 2nd, 3rd or 4th longest config for example although they do not have bad quality in CheckV. What also makes the binning difficult is that I have 30 samples but really they are just 6 samples but with 5 biological replicates each so I am not sure if one should concatenate the contigs from these groups or if I can move on with all samples separately. Until now I did QC, Trimming and Assembly for every sample separately.

PS: Sorry for posting this as a reply first!

Sequencing Viral Binning Assembly • 2.1k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
2
Entering edit mode
14 months ago

Kmer lengths depend on read lengths; normally I go up to ~93 for 150bp reads. If you have overlapping paired end reads you can merge them and use even longer kmers which will yield a better assembly. You can look at the /docs/assemblyPipeline.sh in BBTools for more details. Also, I generally consider Spades to be a much better assembler in most situations (microbial isolates, metagenomes). But specifically for viruses, I've gotten better assemblies with Tadpole. I think it might be partly because viruses have a high mutation rate which Spades misinterprets as inexact repeats within a larger genome; basically, Spades seems to try to assemble all variants of the virus, leading to, for example, 50kbp of contigs for a 20kbp virus.

Binning viruses sounds difficult and I don't think I can help you with that, but I think someone recently wrote a binning tool that works by exploring the assembly graph... that sounds to me like it would work better on viruses than traditional taxonomy or kmer-frequency based approaches. Also, I don't think it's a good idea to remove short contigs prior to binning (especially if the binning tool needs to do graph traversal).

You can also try a different scaffolding tool. Spades does scaffolding but in practice I usually find that its contigs.fa and scaffolds.fa are almost identical. BBTools has a scaffolding tool, Lilypad, that works using paired reads aligned to the assembly. Of course in that case you should use ALL contigs, without throwing any (including very short ones) away.

ADD COMMENT
1
Entering edit mode

I've gotten better assemblies with Tadpole

Do you have specific recommendations for using tadpole with virome data?

ADD REPLY
2
Entering edit mode

It depends on the data, but unlike Spades, Tadpole only uses a single kmer length so it should be as long as possible; typically ~2/3rds of read length. If a high percentage of reads merge (including running BBMerge with the "extend2" flag, which allows merging of nonoverlapping reads) you can even use kmer lengths longer than read length. Error-correction is important too; generally, I follow the protocol in assemblyPipeline.sh.

ADD REPLY
0
Entering edit mode

Okay, thank you very much for trying to help! :)

I had a good look at all your points and tried a bit out: it seems to me that tadpole produces less contigs but most of the time the largest contig is shorter than the one I would get from metaSPAdes. I guess it is more favorable for me to stick to metaSPAdes and it auto-picking the k-mer since using 21,33,55,77,93 also produces a little shorter reads for me.

I also noticed that in SPAdes contigs and scaffolds are mostly the same. However, Lilypad also did not change much. It scaffolded a few of the short reads <1kbp but did not improve much in the longer reads.

I think I might have a look into the Cenote-Taker2 thing and maybe Vcontact2 is worth a look?

I have another question: I noticed when doing FastQC on my raw reads that I have really high duplication rates with ~79% at max in some samples. However, this remains the same in reports I made on adapter and q-trimmed reads. I thought this might be just because I already had a limited spectrum of phages already in my samples so there are many "copies" in there. What do you think? Also the blue and red line in the diagram are the same.

I used bbduk for trimming with these commands - I was unsure about the high trimq of 30 but apparently most of my reads are already > PHRED 30 according to FastQC. The rest should be fine. adapter: ktrim=r hdist=1 tpe tbo k=23 mink=11 q: qtrim=rl trimq=30 minlen=50

Should I deduplicate the reads? I feared this will have bad impact of the abundance calculation later on.

enter image description here

ADD REPLY
1
Entering edit mode

If these are paired reads, I would typically recommend deduplicating them before assembly. I don't generally recommend deduplicating single-end reads though. You can run Clumpify with and without the "optical" flag to try to determine whether these are library PCR duplicates or something that happened on the flowcell; for abundance calculations, removing nearby duplicates should be fine, while removing all duplicates is much more iffy in quantitative situations where you have short genomes (or transcripts) with super-high coverage since some duplicates will occur naturally by chance. You can always assemble from deduplicated reads and quantify with raw reads.

If you calculate the coverage you can determine approximately what fraction of reads are expected to be duplicates by chance, given the insert size distribution, so it's possible that these are in fact not artifacts, in which case they should not be removed since deduplication does enrich for reads with errors.

ADD REPLY
0
Entering edit mode

Thank you very much for the help up to this point. I did some adaptions to my protocol including reducing the high PHRED of 30 for trimming to 20. So I gained about 10% more sequences for the de novo assembly and I saw a lot of threads on the forum that did not advise to have such a high PHRED.

My plan before I think about binning again would now be to identify the viral sequences with some certainty. Do you have any specific recommendations for that? I saw this protocol https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-5qpvoyqebg4o/v3 but I am also kind of cautious with what I use in my analysis because in the end I want to write a paper so I need quite reliable sources with some kind of backup. Do not get me wrong, the guys that did this have a million times more expertise than me and are really good at what they do (it is also from Simon Roux from the DOE Joint Genome Institute, where you also work @Brian Bushnell) but I don't know about citing protocols.io in a published paper...

I however planned to use CheckV and VirSorter2 even before seeing the protocol and compare both results to each other and set some "custom/manual" thresholds to keep a sequence for further analysis or to discard it. One thing that I noticed is that in the contigs I have a lot of very small contigs like < 1kbp that still show a high coverage in its name. I know that this is the k-mer coverage but I noticed it is comparable (at least relatively) to the coverage in the covstats file you get from BBMap. Do you think it is a problem to discard these from the further analysis? I mean they are probably too short to really use but the high coverage makes me unsure. I also do not know about the right cutoff in my case. I thought about 1 or 5 kbp to cut off. The default cut off in VirSorter is already 1500 bp and I read it works better with cutoff 2,5-5kbp.

Am I right, when thinking that these are just smaller fragments of my phages but did just not fit into an assembly with the larger contigs? Therefore when calculating relative abundance that should not be an issue to discard them now, right? The abundance of all fragments of one species should, ideally, be the same for all fragments of this species.

ADD REPLY
0
Entering edit mode
14 months ago
DonPhager • 0

Hey guys, I am sorry to bother so much but I am still confused and not sure how to proceed further. I have completed QC, Trimming and Assembly. I am now set with all my contigs which I also checked with CheckV and VirSorter2. The biggest contigs >20-30 kbp in my 30 samples are mostly even marked as complete but I am also having a lot of small contigs (up to 9000 in some samples).

My plan was to set a length cutoff of 5000 bp and then assign Taxonomy to these with. BLASTn and then manually checking the samples and assign the contigs to vOTUs.

All my efforts with binning the contigs do not really work since MetaBat sometimes gives me bins of only 1 config (the longest) and then another 2nd bin with random contigs from one sample but does not bin the 2nd, 3rd or 4th longest config for example although they do not have bad quality in CheckV. What also makes the binning difficult is that I have 30 samples but really they are just 6 samples but with 5 biological replicates each so I am not sure if one should concatenate the contigs from these groups or if I can move on with all samples separately. Until now I did QC, Trimming and Assembly for every sample separately.

PS: Sorry for posting this as a reply first!

ADD COMMENT
0
Entering edit mode

DonPhager : If you wish to add more information you can edit the original post and add this information there. A new answer (submitted using Add answer dialog) should only contain an answer for the original question.

ADD REPLY

Login before adding your answer.

Traffic: 3430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6