Question

Human transcriptome STAR multi-mapped problem

0

Entering edit mode

23 months ago

Валентин • 0

Hi, Biostars!

I'm a bit newbie in RNAseq analysis, and encountered the following problem. I'm trying to prepare human transcriptome sequence data for downstream differential expression analysis. I have 100 bp PE sequnces from MQIseq 2000, FastqQC looks good, didn't saw necessity in trimming. Next, I'm trying to align reads on human genome using STAR, with reference fasta and gtf taken from STAR manual manual link

I'm constanly recieving a result with tons of reads going to "Number of reads mapped to multiple loci" section, assuming that they will not be counted in downstream analysis. Posting here as an example results for one of the samples. STAR settings:

/data/SOFT/STAR-2.7.9a/bin/Linux_x86_64/STAR --runThreadN 52 \ --genomeDir /data/covid_expression/reference/STAR \ --readFilesCommand zcat \ --readFilesIn /data/V350096731_new/L03/V350096731_L03_A12_1.fq.gz /data/V350096731_new/L03/V350096731_L03_A12_1.fq.gz \ --outFileNamePrefix A12_L03_multimap_filt \ --sjdbOverhang 100 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0 \ --outFilterMatchNmin 0 \ --outFilterMultimapScoreRange 1 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 2

and result:

enter image description here

I've already checked for rRNA with bbduk, here are the results

Input: 28370078 reads 2837007800 bases. Contaminants: 2658046 reads (9.37%) 265804600 bases (9.37%) Total Removed: 2658046 reads (9.37%) 265804600 bases (9.37%) Result: 25712032 reads (90.63%) 2571203200 bases (90.63%)

So if I'm not mistaken, there should be only about 9% of rRNA which is often the case of multi-mapping problem according to my search, but here i have more than 90% multi-mapped. And I also tried some different setting ajustment, but none of them worked out. Can somebody please give an advice on what I'm doing wrong, or a way to find out that it might be sequncing problem?

Thanks in advance, Valentin

rnaseq reads multi-mapped STAR • 1.2k views

ADD COMMENT • link updated 23 months ago by Soumajit ▴ 50 • written 23 months ago by Валентин • 0

0

Entering edit mode

You should look at where your reads are aligning. What type of genes? Are they all from the same few number of genes etc...

ADD REPLY • link 23 months ago by benformatics 4.0k

0

Entering edit mode

Thanks! Is there an easy way to automate this process?

ADD REPLY • link 23 months ago by Валентин • 0

0

Entering edit mode

Hey Valentin,

I am also kind of new to RNA-seq analysis. I was checking your code passed to the STAR and noticed something. I might be wrong though (again, I am an amateur as well).

When you are telling STAR where to look for the input sequences, aren't you using the same file twice? Looks like you are using the same fq.gz files twice. Does the experimental setup have single-end or paired-end reads?

--readFilesIn /data/V350096731_new/L03/V350096731_L03_A12_1.fq.gz /data/V350096731_new/L03/V350096731_L03_A12_1.fq.gz

Soumo

ADD REPLY • link 23 months ago by Soumajit ▴ 50

0

Entering edit mode

Thanks, that was my mistake, will update my post :( still have 45+% non-unique, don't know if it is usual for human transcriptome

ADD REPLY • link 23 months ago by Валентин • 0

0

Entering edit mode

I usually get about 80% unique aligning TruSeq preps with STAR to a human genome.

ADD REPLY • link 23 months ago by swbarnes2 14k

0

Entering edit mode

In all my samples (large datasets prepared with NovaSeq), I got a minimum of 91% uniquely mapped alignments. Hope my samples are alright :D :D

ADD REPLY • link 23 months ago by Soumajit ▴ 50