Hi Biostars,
I have some bacterial genes containing phase variable short sequence repeats in their coding regions, and as the sequencing method for these genes was short read seq, I cannot fully trust that the predicted number of repeats of the PV tract is accurate.
I do however have RNAseq reads for the entire genome of my isolates, so I would like to check if those RNAseq reads confirm the DNA sequence assembled from short read DNA fragments.
I was thinking of just running some form of alignment or blast search using my gene sequence against the full set of RNAseq reads and see what comes out but I imagine theres probably some tool out there that may be more helpful.
If anyone has any recommendations or advice I'd really appreciate your time as I'm not confident in how appropriate my alignment idea is and if theres some flaw in that.
Thank you!
Can you describe this in a bit more detail? Are the copies next to each other? What is the length of these repeats?
Thinking out aloud, this could be a tricky thing since bacterial RNA's may not be expressed as discrete transcripts. The best solution may be to do nanopore sequencing of the DNA to identify the structure of the genome instead of trying to use RNAseq.
I have the DNA sequences of every phase variable locus in the genomes of 8 bacterial isolates.
Repeats range from polyG tracts around 9-20 repeats long, up to pentanucleotide tracts 3-10 repeats long. Repeats are adjacent to each other.
My hope is theres some easy way to align RNAseq reads against the DNA sequence of a given phase variable locus and check if the sequences match well over the repeat tract. The tracts aren't so long that they'd cover an entire short read and make accurate overlap impossible. Im wondering if some kind of assembly tool can be used where a given gene would be input as a reference and then just see what maps to that, and doing so for every gene I'm interested in.
We don't have a huge amount of funding and have a large enough number of genes x isolates (~40 * 8) that doing all that sequencing would be time and cost prohibitive, especially when we already have the rnaseq data.