Hi,
I am trying to do some differential expression experiments on my bacteria strain and I am very new to the field.
I aligned my (paired-end) reads with STAR to both a genome and plasmid (using 2 separate fasta files + 1 combined gff file, which was checked for identical annotation format). Afterwards I used featureCounts, but unfortunately, I couldn't detect some of the essential genes of the plasmid (without these genes the bacteria would not grow). I can find the reads in the STAR output (Aligned.out.sam) so they must be filtered out by featureCounts. I tried to run featureCounts including multimapping reads but no luck. So now I am back to looking at the STAR output and have a few questions.
As you can see in picture 1, I took the read sequence of one of the essential genes and searched for it in the sam file. And it shows that there are several alignments all starting at position 878 (which is the exact beginning of the gene). ChatGPT tells me the following about column 8 and 9: Mate Position (column 8): The 1-based leftmost mapping position of the mate of the read on the reference sequence. Inferred Insert Size (column 9): The inferred size of the DNA fragment from which the read was sequenced, based on the alignment of the read pair. This column is only applicable for paired-end reads.
- I understand that column 8 shows me the 1-based leftmost mapping position of the second read. is that correct? And column 9 shows me the size of the DNA fragment based on the start of the first read and the end of the second read. Is that correct?
- I do not understand how there can be a negative inferred size (column 9). What is the explanation for that?
- How can the inferred size be larger than 150? As I understand paired-end reading, it should be exactly 150bp long as the primer attaches to the p7 region when sequencing read 2. so It must have the same length as read 1.
- Might this be causing my issues with counting reads?
What is also confusing to me is that some of the reads are only found once (see picture 2) but still dont show up after featureCounts. But as you can see, column 9 shows both negative and larger than 150 values.. So this actually confirms my suspicions.
Lastly, can you maybe think of another way how to check why these genes are not taken in account by featurecounts?
Thank you for your help!
What do the alignments look like in IGV?