Title is pretty selg-explanatory, can i filter down a de-novo metagenomic assembly to make downstream analysis like blastn run faster?
Title is pretty selg-explanatory, can i filter down a de-novo metagenomic assembly to make downstream analysis like blastn run faster?
depending on the tool you get a coverage information in some format, sometimes it is embedded into the contig name, sometimes you get other files that contain that information.
you would need to find out where the contig information is, then write a fasta parser to keep only the contigs of interest.
another solution would be to align your reads (or subset of them) to the contigs and then compute the coverages with a tool like bedtools genomecov
to determine how much coverage does each contig have
I get the rationale for removing short contigs, because it is sometimes difficult to bin them properly. Many tools, for example seqtk, can remove fasta sequences based on a given length cutoff.
I wouldn't be removing anything based on low coverage, especially if the only reason was "to make downstream analysis like blastn run faster." But if you really want to do it, Istvan Albert already gave you a general explanation. All the reads are mapped to an assembly, and coverage can be calculate for each contig. An automated way of doing it with a single python script - and with some dependencies installed - is described here. As to why removing contigs based on coverage may not be a good idea, there are many MAGs that have a mix of coverages, which can be seen here.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.