I recently ran Prokka on a bacterial genome and the results showed 1 16S rRNA. I had moderately expected two 16S rRNA in this genome (just because a similar genome had two 16S rRNA). As a result, I want to double check that there really is only one copy in these genome (and not that there are two similar copies and Prokka could only find one).
What I have done:
1) Mapping the raw reads of my genome to the 16S rRNA sequence detected by Prokka as the reference genome. 2) I have resulting .sam, .bam, and .bed file. 3) It seems 893 reads in my genome mapped to the reference 16S rRNA.
What I would like to do now:
I would like to view these 893 reads against the reference genome and determine if there are any nucleotides that have "variants". If I see certain nucleotides in the reference rRNA genome having multiple bases aligning to it from the raw reads, then I will assume this is evidence that perhaps there are multiple (but similar) 16S rRNA in my genome and that perhaps only the consensus was picked up by Prokka.
My questions:
1) Does my thinking about this process seem sound conceptually? (That variants in the mapped reads to the 16S rRNA reference sequence may be evidence for the presence of more than 1 16S rRNA in my genome sequence?)
2) What is a simple step for someone using MacOSX with only basic Linux skills to indeed view these 893 reads against the reference genome and check for possible variants?
What is the origin of this genome? Was it assembled from short reads alone? If yes, then it is indeed possible several 16S rRNA genes may have been assembled into just one gene. I would look at coverage of 16S rRNA gene versus average coverage of other regions, more than looking at variants at the 16S rRNA.
It was assembled from reads each about 150 base pairs. I obtained coverage looking at the results of running metaspades (which gave contigs.fasta). This resulted in 81 contigs. The header for each contig looked like
>NODE_81_length_78_cov_21.000000
. So, I considered this contig to have coverage of 21. I took the average of all 81 contigs using this method and it turned out to be 37.5. My questions then are:1) Is this a correct way to determine the average coverage of the overall genome?
2) How can I determine the average coverage of the 16S rRNA gene? I only have the 16S rRNA information from running a separate program (Prokka). Is it possible to elegantly see which contig this 16S rRNA sequence lies on (to get its coverage from the same metaspades contig.fasta file)?
3) I will definitely report your suggestion about coverage differences, but I am still curious to know if there is a simple way (i.e. for beginners with MacOSX and basic Linux skills) to determine variants of these 893 reads against the reference 16S rRNA genome?