Question

Understanding STAR output (Aligned.out.sam file)

0

Entering edit mode

6 months ago

heelpPlease • 0

Hi,

I am trying to do some differential expression experiments on my bacteria strain and I am very new to the field.

I aligned my (paired-end) reads with STAR to both a genome and plasmid (using 2 separate fasta files + 1 combined gff file, which was checked for identical annotation format). Afterwards I used featureCounts, but unfortunately, I couldn't detect some of the essential genes of the plasmid (without these genes the bacteria would not grow). I can find the reads in the STAR output (Aligned.out.sam) so they must be filtered out by featureCounts. I tried to run featureCounts including multimapping reads but no luck. So now I am back to looking at the STAR output and have a few questions.

alignment of a specific sequence

As you can see in picture 1, I took the read sequence of one of the essential genes and searched for it in the sam file. And it shows that there are several alignments all starting at position 878 (which is the exact beginning of the gene). ChatGPT tells me the following about column 8 and 9: Mate Position (column 8): The 1-based leftmost mapping position of the mate of the read on the reference sequence. Inferred Insert Size (column 9): The inferred size of the DNA fragment from which the read was sequenced, based on the alignment of the read pair. This column is only applicable for paired-end reads.

I understand that column 8 shows me the 1-based leftmost mapping position of the second read. is that correct? And column 9 shows me the size of the DNA fragment based on the start of the first read and the end of the second read. Is that correct?
I do not understand how there can be a negative inferred size (column 9). What is the explanation for that?
How can the inferred size be larger than 150? As I understand paired-end reading, it should be exactly 150bp long as the primer attaches to the p7 region when sequencing read 2. so It must have the same length as read 1.
Might this be causing my issues with counting reads?

single entry

What is also confusing to me is that some of the reads are only found once (see picture 2) but still dont show up after featureCounts. But as you can see, column 9 shows both negative and larger than 150 values.. So this actually confirms my suspicions.

Lastly, can you maybe think of another way how to check why these genes are not taken in account by featurecounts?

Thank you for your help!

STAR paired-end read • 1.1k views

ADD COMMENT • link 6 months ago by heelpPlease • 0

0

Entering edit mode

What do the alignments look like in IGV?

ADD REPLY • link 6 months ago by swbarnes2 14k

score 1 · Answer 1 · 2024-05-07

The insert size is measured from outermost edge of this read, to the outer most edge of its mate. Like, literally, start_second - start_first. If the first read is on the negative strand, and the second read on the positive strand, then the start coordinate of the first read will be higher than the second, and thus the insert size negative.

Example:

Genome Coordinate 10000            10150                        10350               10500
                  |                  |         Fragment          |                  |
                  -------------------------------------------------------------------
                  |>>>>>>>>>>>>>>>>>>|                           |<<<<<<<<<<<<<<<<<<|
                     Second in pair                                  First in pair

Here the outer coordinate of the first read is 10500, and the outer coordinate of the second read is 10000, so the insert size is -500. This also answers your second question - fragments are often longer than reads. For a 500bp fragment, we only sequence 150bp in from each end.

This is unlikely to be causing your problems with featureCounts. More likely bacterial genes are expressed from longer transcription units than the gene. Thus reads that cover the gene are likely to run outside the boundaries of the gene, and thus not be counted by featureCounts. This can be change in the featureCount settings.

score 0 · Answer 2 · 2024-05-07

After thoroughly reviewing all the documents again, I've identified the root cause of my initial issue: the failure to detect essential genes after using featureCounts. It appears that the problem stemmed from a mixture of tabs and spaces after the concatenation of the two GFF files. Consequently, this oversight (the formating looked the same when looking at it by eye) resulted in the exclusion of the plasmid annotations.