Question

Remove specific reads from a metagenomic sequencing run

0

Entering edit mode

4 months ago

SushiRoll ▴ 140

Hi everyone!

I'm proccesing the sequencing data from an experiment that involved spiking an unknown community with a known and sequenced E. coli. The DNA was short read sequenced and now we would be interested in analysing the community without considering the E. coli that was artificially introduced. I was thinking on running bbduk to get rid of E. coli as follows:

in=reads_1.fq in2=reads_2.fq ref=E_coli_reference.fasta out=filtered_reads.fq

will this do? I was also wondering about bbsplit. I think my concern here is that our reference is not as a single contig but rather as a multifasta file and I'm not certain if there is an advantage of one tool over the other.

Many many thanks!

metagenomics • 306 views

ADD COMMENT • link updated 4 months ago by colindaven 7.0k • written 4 months ago by SushiRoll ▴ 140

score 1 · Answer 1 · 2024-08-15

1

Entering edit mode

4 months ago

GenoMax 148k

The DNA was short read sequenced and now we would be interested in analysing the community without considering the E. coli that was artificially introduced.

You probably appreciate that this process would not be perfect because of the nature of the technology i.e. short reads. You are likely to find sequence similarities by chance (unless your community if drastically different than E coli). I am assuming the E coli library is not molecule tagged since you are using these options.

Be careful about using outu= outm= depends on how you want to collect the filtered reads with bbduk.sh.

Using bbsplit.sh will allow you to keep reads that multi-map or discard them all together (ambiguous2= option). With bbduk you are going to filter all reads that match E. coli.

ADD COMMENT • link 4 months ago by GenoMax 148k

0

Entering edit mode

Hi GenoMax

Thanks for the reply, yes, I'm aware that I will falsely discard some reads but as you say that's related to the short reads and I don't think there's much I can do to avoid this other than sequencing using another approach. I still think this is the best approach I can use at the moment. I think in this case I'll stick to bbduk and keep the risk in mind. Regarding the reference would a multifasta work or would it be better to merge each one of them using NNN to get a fasta file with just one sequence. Thanks on the outu= and outm= tip by the way

ADD REPLY • link 4 months ago by SushiRoll ▴ 140