Why I have too many reads non uniquely aligned?
2
0
Entering edit mode
5.2 years ago
zizigolu ★ 4.3k

Hi

By this code I have obtained my raw read counts from RNA-seq

Alignment by STAR

STAR --genomeDir ./STAR_hg38_Genome --readFilesIn $fastq1 $fastq2

Sorting sam

samtools view -h -o aligned.sam sort.bam

Bam 2 Sam

samtools view -h -o aligned.sam ligned.sorted.bam

GeneCount

htseq-count --stranded=no -q aligned.sam ./gtf > counts.txt

But I have this results; Too many reads non uniquely mapped

__no_feature    427159
__ambiguous 1250237
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  10648052

I tried to summarized raw read counts by Featurecounts I saw rate of summarization was low

Load annotation file Homo_sapiens.GRCh38.dna.primary_assembly.gtf ...      ||
||    Features : 1262162                                                      ||
||    Meta-features : 58735                                                   ||
||    Chromosomes/contigs : 47                                                ||
||                                                                            ||
|| Process SAM file aligned.sam...                                            ||
||    Paired-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 47361817                                             ||
||    Successfully assigned alignments : 23092843 (48.8%)                     ||
||    Running time : 1.34 minutes

Should I worry about htseq results? You think problem comes from where? When I tried to sort sam by picard I saw less alignment_not_unique, Does it make any sense?

RNA-Seq alignment • 2.9k views
ADD COMMENT
0
Entering edit mode

What is the read length?

ADD REPLY
0
Entering edit mode

Sorry how I should know the read length? Yes my ultimate goal is differential expression analysis

ADD REPLY
0
Entering edit mode

Just do head your.fastq and count how long the read in line 1 is. If the fastq is compressed do zcat your.fq.gz | head

ADD REPLY
0
Entering edit mode

Thank you

[fi1d18@cyan01 ~]$ head ./1.fastq
@NB501007:194:HNKMVBGXB:1:11101:20881:1030 1:N:0:GTCCGC
GGGGCNGTGTTGCCATTATTAACTCTTTTGCTTTCATTTAATGCTCTCGCCCTGCCGCCGCCGGTGCTCCGTC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:11582:1032 1:N:0:GTCCGC
GTTGGNGGCCAGTGCCCTCCTAATTGGGGGGTAGGGGCTAGGCTGGAGTGGTAAAAGGCTCAGAAAAATCCTGCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:20558:1032 1:N:0:GTCCGC
CCGCANGTGGTGGTTCTCCCCGAGGTCGTGGTTGTGCAGGTCAATGAGGGCCTGGACCGCCTCCTCCACGGAGCC
[fi1d18@cyan01 ~]$
ADD REPLY
0
Entering edit mode

Ok, and the reverse read, so 2.fastq?

ADD REPLY
0
Entering edit mode
[fi1d18@cyan01 ~]$ head ./2.fastq
@NB501007:194:HNKMVBGXB:1:11101:20881:1030 2:N:0:GTCCGC
NACGGAGCACCGGCGGCGGCAGGGCGAGAGCATTAAATGAAAGCAAAAGAGTTAATAATGGCAACACGGCCCC
+
#AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:11582:1032 2:N:0:GTCCGC
NAAGGCCTTCGATACGGGATAATCCTATTTATTACCTCAGAAGTTTTTTTCTTCGCAGGATTTTTCTGAGCCTTT
+
#AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:20558:1032 2:N:0:GTCCGC
NCTGTTTTCCAGCAATGGGGGCGTCGTCAAAGGATTCAAGTTCTTCCAGAAGGACCGCAAGATGGCACTGATCC
[fi1d18@cyan01 ~]$
ADD REPLY
2
Entering edit mode

Ok, so normal 2x75bp. I was asking because overly short reads like 2x25 often lead to multimappers but in this case it probably comes down to what swbarnes2 says, having rRNA or similar contaminations. I would not worry too much about this. Proceed with the normal analysis, use something like PCA to check for potential batch effects (DESeq2 vignette has good manual on how to do that) and then see if results are reasonable after DEG.

ADD REPLY
0
Entering edit mode

For dif. gene expression analysis (I assume you do it because you count the reads), multimappers are usually not counted, though having high proportion of multimapped reads is not desired. What is the read length of your data? STAR has several options (e.g. --outFilterMismatchNmax) to restrict the number of mismatches to the reference, which can influence the mapping rates.

ADD REPLY
0
Entering edit mode

Additionally, the overlap resolution mode from htseq can influence your final read count stats, chech the docs to see which mode better fits your data.

ADD REPLY
3
Entering edit mode
5.2 years ago

Reads that align to rRNA are typically multimappers. Your sample might for some reason be enriched for those kinds of reads. This would be a sample type/library prep issue.

RSEM is a program like feature counts that works on STAR output ( in this case, you have to tell STAR to make a transcriptome output bam) that is smarter about proportionately assigning multimappers to genes.

ADD COMMENT
0
Entering edit mode

Thank you so much both; Sorry @swbarnes2 what you mean by transcriptome output bam? Should I add additional command to STAR alignment code for that?

You think if I just use feature counts instead, the problem would solve?

ADD REPLY
0
Entering edit mode

6 Output in transcript coordinates. With --quantMode TranscriptomeSAM option STAR will outputs alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file (in addition to alignments in genomic coordinates in Aligned.*.sam/bam files). These transcriptomic alignments can be used with various transcript quantification software that require reads to be mapped to transcriptome, such as RSEM or eXpress. For example, RSEM command line would look as follows: rsem-calculate-expression ... --bam Aligned.toTranscriptome.out.bam /path/to/RSEM/reference RSEM

ADD REPLY
2
Entering edit mode
5.2 years ago
boaty ▴ 220

Hi,

I think it's fine, depends on what kind of analysis you want to perform? Here the main question is, do you have enough read depth or statistical power to do a GDE?

you have 23M reads out of aligner and 10M of them are not unique so 13M unique reads for analysis. I think it's OK for GDE.

RNA-seq normally has issue with rRNA, so we treat them with ribo-zero kit. And one of the important indicator for a good RNA-seq is ratio of number of rRNA over number of total RNA. this is called annotation-based quality control.

Even your ribo-zero faild, having 90% of rRNA overall, but you have > 10M mapped unique reads, still you can get something out of water in GDE

ADD COMMENT

Login before adding your answer.

Traffic: 1995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6