Hello,
My lab recently finished an experiment in which a community of three bacterial species were grown together and then their RNA was Illumina sequenced. From my perspective, this means that I have .fastq files containing reads from all three species. The plan is thus to parse these species out during alignment. I have quality reference genomes for each of the three species.
I have used STAR previously to align to a single bacterial genome, which worked great. However, while STAR can take multiple .fasta files for the reference, it can only accept a single .gtf annotation. I'm wondering what to do. I see two main possibilities:
do three separate alignments on each set of data, with each different alignment having a different reference genome. The downside here is that many of the genes in these species are well (but not perfectly) conserved, and so I think it is likely that this will result in many false positives, where reads are assigned to the wrong species. I guess this could be dealt with down-stream, by finding all reads which mapped to more than one reference, and 'giving' them to the species with the highest alignment confidence, but this seems messy.
I think that the better option is to figure out a way to combine the gtfs from my three reference species into a single gtf. I have not ever done this before. It seems like cellranger (https://github.com/10XGenomics/cellranger) seems to do this, but I can't find reviews of the package. Is it as simple as doing a cat command, then scrolling through the result to delete the headers of the second and third gtf?
Has anyone else combined gtfs successfully for this purpose: to align community-derived RNA-Seq reads against multiple prokaryote reference genomes? I see there is a similar question here from five years ago, which has no answers: Combining Gtf Files
Thank you!
Isn't that (technically speaking) the same as a meta-transcriptomic analysis (in this case with 3 species)? I would check how people from this field typically align their RNA-seq data.
That's a good point. I will look into that as well.
Hi Chaco001
How did you perform the Analysis? Can you please explain it?
I have RNA Seq data for Host and parasite together and I want to map eliminate reads mapping to both genome to reduce over quantification.
Technically it's called dual RNA seq analysis. Here is a link from the reference paper I m the following.
https://stm.sciencemag.org/content/scitransmed/suppl/2018/06/25/10.447.eaar3619.DC1/aar3619_SM.pdf