How could I filter a metagenomic assembly file (from Megahit or metaspades) for low coverage/short contigs to enable faster downstream analysis (i.e blastn)
2
0
Entering edit mode
19 months ago
Lucas R.F. ▴ 20

Title is pretty selg-explanatory, can i filter down a de-novo metagenomic assembly to make downstream analysis like blastn run faster?

metagenomics blast spades megahit assembly • 1.1k views
ADD COMMENT
3
Entering edit mode
19 months ago

depending on the tool you get a coverage information in some format, sometimes it is embedded into the contig name, sometimes you get other files that contain that information.

you would need to find out where the contig information is, then write a fasta parser to keep only the contigs of interest.

another solution would be to align your reads (or subset of them) to the contigs and then compute the coverages with a tool like bedtools genomecov to determine how much coverage does each contig have

ADD COMMENT
0
Entering edit mode
19 months ago
Mensur Dlakic ★ 28k

I get the rationale for removing short contigs, because it is sometimes difficult to bin them properly. Many tools, for example seqtk, can remove fasta sequences based on a given length cutoff.

I wouldn't be removing anything based on low coverage, especially if the only reason was "to make downstream analysis like blastn run faster." But if you really want to do it, Istvan Albert already gave you a general explanation. All the reads are mapped to an assembly, and coverage can be calculate for each contig. An automated way of doing it with a single python script - and with some dependencies installed - is described here. As to why removing contigs based on coverage may not be a good idea, there are many MAGs that have a mix of coverages, which can be seen here.

ADD COMMENT

Login before adding your answer.

Traffic: 2604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6