Heyya,
I have some bacterial RNA-SEQ data to analyse. The goal is to get a list and read depth of all expressed genes. I have 5 different experimental conditions (in triplicate).
Although there is the same bacterial strain, a pET system has been used. In each experimental condition, there is a different plasmid sequence per construct --> I'm going to create a custom genome (with two "chromosomes") for each of them in order to perform the alignment. Would this be a good approach? In the end, I'm going to compare the gene expression between samples (after I get a list and read depth of all expressed genes). Would that work taking into account the reference genome is gonna change slightly every time (due to the different plasmid construct used)?
What alignment tool should I use? HISAT2 or STAR would work only if I'm restricting the maximum splice length to something fairly short (bacteria don't have that many splices). However, I read about some prokaryotes-specific tools available I can use. I was wondering, what about just using Bowtie2, which is not a splice-aware alignment?! Would that be alright?
This question is really specific but I'm going to ask: I work with ClearColi which is an E.coli mutant. However, I can't find the ClearColi genome. Would it be good enough to use E.coli (the parent strain) genome? My guess is that I have to read the ClearColi paper and decide if there are significant changes in the genome or just a couple of mutations. However, assuming I don't have that information, will it be alright to use the parent strain (E.coli) genome?
Thanks. As usual, any input is welcomed.
Hi Joe. Thank you for your response.
I am aware the comparison might be tricky based on those experimental conditions. However, personally, I am more interested in the list and read depth of all expressed genes for each bacterial construct. This can't be that noisy, can it?
Yes, the plasmid has the same backbone but different inserts. Each construct expresses a different class of the same protein. We are interested in which class is more abundantly transcribed.
Regarding differential expression, I will deal with that issue separate, once I get there.
Something else you might be able to do if your inserts are suitably diverse, would be to create an artificial reference of all the sequences in question for all conditions. So you could have something like:
If you map everything against everything, you should see no signal for insert #3 in your #1 and #2 samples for example. I think this will work at least, but it will depend heavily on those inserts being diverse enough that reads aren't mapping between samples.
All of your conditions are going to have noise. There will be fluctuations in the transcription regardless of what you do, subtle differences between library preps/RNA isolations etc, that's just the nature of sensitive techniques like this. The problem is that you will be confounding this by using references which are not 100% identical, so some of your reads may not map if the sequence identity is off. Now, if you are using the same reference for them all, and mapping with the same parameters, your noise/error will at least be systematic so it might not matter but its something to bear in mind.
You will need good controls for this, so I would ensure you have an empty vector control for all of your conditions.