Question

Filter our reads from multiple hosts

0

Entering edit mode

9 weeks ago

Luca Arbore • 0

Hi,

I have PE sequenced a batch of ticks attached and fed on mammals for which I am trying to assemble and classify virus genomes and also detect non-viral pathogens, mainly bacteria. To begin with this, after trimming and deduplication, I would like to filter out reads from the tick genomes before de novo assembly.

Because my dataset contains several species of ticks (hosts), my main question is how to automate the host read removal from my data. Should I compile a list of host genomes and map each sample to the whole list or the data should be grouped according to the tick species?

My other question is how to set parameters for detection of virus reads derived from endogenous elements, e.g. min. >50% query cover and >80% identity are common thresholds used to filter our reads associated with viral elements integrated into host genomes.

many thanks

genome metagenome virus host • 431 views

ADD COMMENT • link 9 weeks ago by Luca Arbore • 0

score 0 · Answer 1 · 2024-07-13

You can use bbsplit.sh from BBMap suite for this purpose as it allows binning reads to multiple references. I have an example of how to do this here: BBSplit syntax for generating builds for the reference genome and how to call different builds.

Because my dataset contains several species of ticks (hosts)

Depending on how similar those species are the reads in your dataset are going to multi-map. bbsplit has options that allow you to handle these in specific ways.

ambiguous2=<best>   Set behavior only for reads that map ambiguously to multiple different references.
                    Normal 'ambiguous=' controls behavior on all ambiguous reads;
                    Ambiguous2 excludes reads that map ambiguously within a single reference.
                       best   (use the first best site)
                       toss   (consider unmapped)
                       all   (write a copy to the output for each reference to which it maps)
                       split   (write a copy to the AMBIGUOUS_ output for each reference to which it maps)