Hi we have 17 Gb of Illumina NGS shotgun metagenome data (40 million reads 101 bp paired end). I am planning to assemble the metagenome using Meta-velvet and Ray, separately. I have already found extreme amounts of bacterial transposable elements (e.g. transposases and integrases) in the metagenome. My main purpose is to locate those elements in my contigs and see which other genes are located in the same contig and identify genes in the transposable elements. Which strategy is the best to do this? do masking the transposable elements first will this help the assembly to get longer contigs?
This sounds like an interesting study. I would not mask because it likely won't improve the assembly, it will be extremely slow and also the process will be inefficient unless you have a repeat library from closely related species. I would try the assembly approach first.
Another option is to take a clustering approach followed by assembly, which may be faster. I developed Transposome for identifying TEs from unassembled reads using a clustering approach but there is no assembly step. It would be straightforward to assemble the separate clusters, but I have not tested this with metagenomics data. I have combined data from 15 different plant species and found that it generated species-specific clusters, but this may or may not work as well for this data. I'd be interested to know actually, and it wouldn't be hard to test with a small sample of reads. Alternatively, you may want to use a thoroughly tested method for identifying metagenomic clusters and assemble those separately. That would probably be my second choice unless the method turns out to be impractical in terms of analysis time or resource usage.
thanks for your suggestions. I just had a look at the abstract of your Transposome paper. It sounds like, it will be an important tool for people dealing with transposable elements in eukaryotes. Anyways I will play with this tool just to check how it performs on metagenomic data.
Yes, it was intended for eukaryotes, but I'd be interested to know what you find. You'll get best results if you can somehow construct a repeat library of bacterial TEs, that will be important. Also, I would start with a couple hundred thousand reads to see how it performs (results and memory usage), and then go from there. You won't need to (or be able to) use all that data, so generate random samples and work on that. You can email me or report issues on github if you have specific questions about that tool.
do masking the transposable elements first will this help the assembly to get longer contigs
I would recommend to remove any reads mapping to sequences of known/relevant transposases prior to assembly. In this way you will get shorter but much more reliable contigs. Transposase genes identical in sequence may be present in bacteria of even different species due to horizontal gene transfer. Even if you assemble reads from a single isolate you will frequently get misassembled chimeric contigs around transposase genes.
ADD COMMENT
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
piet
★
1.9k
0
Entering edit mode
How will removing TEs prior to assembly help in identifying TEs? You know that you are creating artifacts by removing TEs prior to doing the assembly so I don't think this is more reliable.
Removing reads mapping to repeats will help the assembler to avoid misassemblies. The real bias is caused by the fact that Illumina read length is much shorter than the size of the expected repeat sites (transposase genes are typically 500 to 1500 nt). To get out most from your limited data you may run a second assembly after you have identified sites of unusual high coverage in the first assembly.
Decision making in the assembler is heavily based on coverage. The assembler does not know if a particular region with (apparently) high coverage encodes a transposase. If you identify that region as a transposase, you add additional knowledge into the assembling process. That is the basic idea. How to accomplish this in praxis will depend on many details of your project. palc explicitely wants to analyze synteny around transposase genes in metagenomic data. I would not trust the results of standard assembling procedures in that case.
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.7 years ago by
piet
★
1.9k
0
Entering edit mode
I agree that coverage is important and it is easy to find the best assembly by sampling at a different levels of coverage. I don't agree that systematically removing reads mapping to transposase is the best approach, especially when the goal of the study is to look at TEs. I'm afraid that procedure would create technical artifacts for the study, but either way, all these things could be tested.
thanks for your suggestions. I just had a look at the abstract of your Transposome paper. It sounds like, it will be an important tool for people dealing with transposable elements in eukaryotes. Anyways I will play with this tool just to check how it performs on metagenomic data.
Yes, it was intended for eukaryotes, but I'd be interested to know what you find. You'll get best results if you can somehow construct a repeat library of bacterial TEs, that will be important. Also, I would start with a couple hundred thousand reads to see how it performs (results and memory usage), and then go from there. You won't need to (or be able to) use all that data, so generate random samples and work on that. You can email me or report issues on github if you have specific questions about that tool.