Hello all,
I have a project that I am working on that I wanted to get some guidance on if possible. Basically, we have sent samples for RNA-seq in which we want to determine infection and levels of infection in host tissues at various developmental stages. I have a host genome and annotation files (both GTF and GFF) for the host genome which I have mapped data to many times. But I'm not sure what the best approach is regarding detecting reads from the pathogen (bacterial), for which I have only a single scaffold with no annotation file. Does anyone know if it would be best to:
A) combine the pathogen genome with the host genome and add a simple annotation entry to the host genome GTF/GFF file to account for the pathogen B) map the reads to the host and then map the unmapped reads to the pathogen C) utilize software that is capable of mapping to multiple genomes at once, such as BBSplit D) use some approach that I haven't thought of
Any guidance that anyone could provide would be greatly appreciated. Thanks so much!
Try
bbsplit.sh
as a first pass since you have both genomes.That said, is the host a eukaryote? This would matter since doing total RNAseq may be the way to go here since doing mRNAseq (depletion or capture) is likely to miss bacterial RNA in sample.
Yes, host is a eukaryote. I'll give BBSplit a try. I haven't used it before but I've used BBMap and BBDuk quite a bit so I'm somewhat familiar with the software. Thank you!
The issue with approach B is most aligners will try to force reads to align to the reference they are given, so a lot of reads that are pathogen will not be unmapped. Better to include your best guess reference in there for the aligner to work with.