Question

Unannotated reads through STAR

0

Entering edit mode

4 months ago

rajdeepboral00 ▴ 70

Hello all,

I have done RNA-seq analysis of mouse spinal cord and there are total of 24 samples and the I used STAR for alignment and getting the gene counts. All samples have an alignment % of around 90 but the median annotated when I used the gtf file from GENCODE is around 50%. SO, is this what I should expect?

I have tried extracting the unannotated reads using the following code

samtools view -h -F 4 CC1_Aligned.sortedByCoord.out.bam > mappedreads_CC1.sam
bedtools intersect -v -abam CC1_Aligned.sortedByCoord.out.bam -b /data/sata_data/home/rajdeep/GRCm39/annotation.bed > unannotated_reads_CC1.bam

and then I used this bam file in IGV to visualize the reads and seeing that no raeds really is mapping to the genes of the gtf file...So what more I can do? or should I proceed with this low annotated files only? IS 50% okay?

STAR IGV Unannotated-reads • 660 views

ADD COMMENT • link 4 months ago by rajdeepboral00 ▴ 70

score 0 · Answer 1 · 2025-01-28

This depends on how the sample was produced. I tend to find that if the data is polyA selected, then you might expect around 3/4 of reads to map to exons. However, if the sample is total RNA (ribodepleted), then you would expect less than this, and 50% would not be unusual, and you will see many of the reads mapping to intronic sequences when you look at the alignments. If your samples are not ribodepleted or polyA, then I would expect to see a large amount of sequence mapping to ribosomal sequences. However, I don't think the GENCODE annotations contains annotations for ribosomal RNA. I would expect to see these reads accumulating to very high levels over the locations of the ribosomal RNA transcripts.

Finally, I would worry if you have a fairly even level of reads mapping across almost the entire genomic sequence - a background. This probably suggests your samples were contaminated with genomic DNA.

score 0 · Answer 2 · 2025-01-28

0

Entering edit mode

4 months ago

GenoMax 151k

then i used this bam file in IGV to visualize the reads and seeing that no raeds really is mapping to the genes of the gtf file.

Are you sure about that because then 90% mapping would not mean much if majority of your reads are mapping outside gene models. That may indicate DNA contamination if the alignments are generally distributed all over.

Unannotated reads

I assume you mean reads that do not map to gene model regions. There is no annotation for reads per se.

ADD COMMENT • link 4 months ago by GenoMax 151k

0

Entering edit mode

Yes, the STAR overall alignment % shows around 90% for all samples.

Yes, the reads that didnot have any annotations in the gtf file...

ADD REPLY • link 4 months ago by rajdeepboral00 ▴ 70

0

Entering edit mode

When you say that the reads don't align to genic regions, do you mean they are entirely outside any gene annotaiton, or do you mean they do not overlap with any exons. FeatureCounts only counts reads that are entirely in exons. However, with ribodepleated libraries, you would expect lots of intronic reads, as you will sequence pre-mRNA as it is transcribed, but before it is sequenced. You will also sequence many non-polyadenylated non-coding transcripts that may not be present in some annotations.

Consdier:


Annotation:            |>>>>>>>>>>>>>|--------------------|>>>>>>>>>>|

Sample 1        -----   -----     ------ ------       ---    ----    ----  ----        ----        ----
                         ---- ----  ----                  ----   ---- 
________________________________________________________________________________________________________
Sample 2:              -----     ------ ------  ----     ---    ----    ----             ----          
                         ---- ----  ----       ----         ----   ---- 
                               ----                           ----
________________________________________________________________________________________________________
Sample 3:              -----     ------ ------       ---    ----    ----            ---- ---- ---- ----        
                         ---- ----  ----                  ----   ----                  ---- ---- ----
                               ----                          ----                        ---- ----
                                                                                            ----

In sample one there is a continuous background across all locations. Its not that the background is particularly strong, and back ground reads may be present only at intervals, but the level is even. This is indicative of DNA contamination in your RNA samples.

In sample 2 you see strong signal in the introns, but not outside gnes (although a few reads might be present). This is expected in ribodepleted samples.

In sample 3 you see a strong, but localised pile of reads that doesn't seem to align to any annotation. This suggests that there is an unannotated transcript in that location. It could be that your ribodepletion hasn't worked that well, and this is a ribosomal RNA gene (which tend not to be in annotations like REFSEQ (the default annotation in IGV) or GENCODE (which you say you used for counting).