Question

How could I filter a metagenomic assembly file (from Megahit or metaspades) for low coverage/short contigs to enable faster downstream analysis (i.e blastn)

0

Entering edit mode

20 months ago

Lucas R.F. ▴ 20

Title is pretty selg-explanatory, can i filter down a de-novo metagenomic assembly to make downstream analysis like blastn run faster?

metagenomics blast spades megahit assembly • 1.2k views

ADD COMMENT • link updated 20 months ago by Mensur Dlakic ★ 28k • written 20 months ago by Lucas R.F. ▴ 20

0

Entering edit mode

20 months ago

Mensur Dlakic ★ 28k

I get the rationale for removing short contigs, because it is sometimes difficult to bin them properly. Many tools, for example seqtk, can remove fasta sequences based on a given length cutoff.

I wouldn't be removing anything based on low coverage, especially if the only reason was "to make downstream analysis like blastn run faster." But if you really want to do it, Istvan Albert already gave you a general explanation. All the reads are mapped to an assembly, and coverage can be calculate for each contig. An automated way of doing it with a single python script - and with some dependencies installed - is described here. As to why removing contigs based on coverage may not be a good idea, there are many MAGs that have a mix of coverages, which can be seen here.

ADD COMMENT • link 20 months ago by Mensur Dlakic ★ 28k

score 3 · Accepted Answer · 2023-04-25

depending on the tool you get a coverage information in some format, sometimes it is embedded into the contig name, sometimes you get other files that contain that information.

you would need to find out where the contig information is, then write a fasta parser to keep only the contigs of interest.

another solution would be to align your reads (or subset of them) to the contigs and then compute the coverages with a tool like bedtools genomecov to determine how much coverage does each contig have