Question

RNASeq data: STAR low alignment score

1

Entering edit mode

4.2 years ago

Kumar ▴ 170

Hello, I am working on RNA seq analysis with the human genome. I have a total of 180 samples (90 control and 90 mutant). Initially, I am trying with a total of 4 samples (2 control and 2 mutant). I am trying to compare genes deferentially expressed between these two groups. Let me explain my data structure and process:

I got paired-end samples R1 and R2. These were in four parts (.fastq files) of each pair (R1 and R2 separately) so I concatenated these four parts (.fastq) using $cat to make a single R1 and R2 paired-end. Subsequently, I RUN trimGalore to improve the reads with the below command. Then I used STAR for alignment. However, I am getting a very low score of alignment. Please see the multiQC report. In this case, do I need to use Kallisto or any other way to improve alignment score?

$trim_galore --illumina --paired --fastqc -o /DataAnalysis/mutant/

alignment RNA-Seq STAR genome • 3.0k views

ADD COMMENT • link updated 4.2 years ago by Biostar 20 • written 4.2 years ago by Kumar ▴ 170

1

Entering edit mode

@Manoj this is not a matter of using kallisto or some other program. Doing that is not going to improve your alignments, if your input data has problems.

It does looks like your alignment % are low, assuming you have done everything right (aligned against genome, not transcriptome etc). Now you need to do some investigative work. Take a selection of reads from each file that are unmapped (you can extract them easily from BAM) and blast them at NCBI (you will need to change them to fasta format) to make sure you don't have a contamination problem with your data. If there is a problem with your data it is not going to clear itself just by using a different program/technique.

I concatenated these four parts (.fastq) using $cat to make a single R1 and R2 paired-end.

Did you make sure that the order of pieces of files in your cat command was identical for R1/R2 files. It not this would cause low/discordant alignments.

cat Piece1_R1.fq.fz Piece2_R1.fq.gz .. Piece4_R1.fq.gz > total_R1.fq.gz
cat Piece1_R2.fq.fz Piece2_R2.fq.gz .. Piece4_R2.fq.gz > total_R2.fq.gz

ADD REPLY • link 4.2 years ago by GenoMax 147k

0

Entering edit mode

I am using kallisto for alignment. How I can get BAM file. I am still getting low alignment score.

ADD REPLY • link 4.2 years ago by Kumar ▴ 170

0

Entering edit mode

Read the manual. As genomax already said, changing your alignment method/tool is not going to magically solve your problem. Did you investigate more closely as suggested?

ADD REPLY • link 4.2 years ago by jared.andrews07 ★ 18k

1

Entering edit mode

Yes! I checked my reads using bbduk using the following command. However, it seems it has high rRNA contamination. Please let me know how to remove these contaminants.

./bbduk.sh -Xmx64g in1=/DataAnalysis/1000_R1.fastq in2=/DataAnalysis/1000_R2.fastq out=/DataAnalysis/ribo.fa 
 ref=/DataAnalysis/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Results:

Input:                      17550820 reads      2267836215 bases.
Contaminants:               17518736 reads (99.82%)     2263769594 bases (99.82%)
 Total Removed:             17518736 reads (99.82%)     2263769594 bases (99.82%)
  Result:                   32084 reads (0.18%)     4066621 bases (0.18%)

   Time:                            273.196 seconds.
    Reads Processed:      17550k    64.24k reads/sec
    Bases Processed:       2267m    8.30m bases/sec

ADD REPLY • link 4.2 years ago by Kumar ▴ 170

0

Entering edit mode

This is NOT contamination. You are using the actual human genome sequence as the reference in ref=. So no wonder bbduk.sh is able to match 99%+ of your data to it.

ADD REPLY • link 4.2 years ago by GenoMax 147k

0

Entering edit mode

Ok! Actually, I am new for RNASeq analysis. Could you elaborate your answer, how I can proceed? Should I use rRNA sequences file that is downloaded from BioMart to see if there are rRNA contamination?

I used tagdust to improve my reads files. In its output it generates output_READ1.fastq, output_READ2.fastq, output_un_READ1.fastq, output_un_READ1.fastq.

./tagdust -t 6 -1 R:N -o /DataAnalysis/tagduest-analysis/output -ref /DataAnalysis/tagduest- 
analysis/mart_export.txt /DataAnalysis/tagduest-analysis/1000_R1.fastq /DataAnalysis/tagduest- 
analysis/1000_R2.fastq

ADD REPLY • link 4.2 years ago by Kumar ▴ 170

0

Entering edit mode

Should I use rRNA sequences file that is downloaded from BioMart to see if there are rRNA contamination?

Since you know that why did you use the entire human genome as "contaminant reference" to search against in your bbduk.sh analysis? You seem to have good data (99% reads match to the human reference, which may contain rRNA, which you need to account for). If you don't use a tool in the way it is meant to be used you are creating additional work (and confusion) for yourself for no reason. If you used tagdust in the way it is meant to be used then your data has already been cleaned of extraneous sequences and is now ready for further analysis. No need to do bbduk analysis again.

ADD REPLY • link 4.2 years ago by GenoMax 147k

0

Entering edit mode

However, I am trying STAR aligner after all the contamination removing process. But, the alignment score is still low. Here is my multiQC result.

ADD REPLY • link 4.2 years ago by Kumar ▴ 170

0

Entering edit mode

So, if you did everything right, you need to investigate more deeply... That happened to me once and I discovered that, unfortunately, the problem was on the library prep. If you did this work will be easier for you to find the issue but, if not, investigate all metrics. Second, make sure that you have concatenated everything the right way. Third, be careful when trimming your reads so, make sure that everything was right on that task.

ADD REPLY • link 4.2 years ago by brunobsouzaa ▴ 830

0

Entering edit mode

I checked all my process and concatenated files. To remove contamination I first I RUN Trimgalore with the following command. Furthermore, I tagdust with the following command to remove rRNA contamination. However, I am still getting low alignment score using Kallisto. Here is the Kallisto statistical result.

"n_targets": 190522,
"n_bootstraps": 0,
"n_processed": 34341063,
"n_pseudoaligned": 11756474,
"n_unique": 3750824,
"p_pseudoaligned": 34.2,
"p_unique": 10.9,
"kallisto_version": "0.46.1",
"index_version": 10,
"start_time": "Tue Sep  1 10:19:18 2020",
"call": "./kallisto quant -i transcripts.idx -o output 270_R1.fastq.gz 270_R2.fastq.gz"

 TrimGalore-0.6.5/trim_galore --paired --fastqc -q 20 -o /DataAnalysis/test-trim/

./tagdust -t 6 -1 R:N -o /DataAnalysis/output -ref /DataAnalysis/mart_export.txt /DataAnalysis/278C_R1.fasta.gz 
 /DataAnalysis/278C_R2.fastq.gz

ADD REPLY • link 4.2 years ago by Kumar ▴ 170

0

Entering edit mode

Changing your align method is not going to solve your problem! Try running FastQC on all your R1 and R2 files before alignment and check data quality.

ADD REPLY • link 4.2 years ago by brunobsouzaa ▴ 830

0

Entering edit mode

If you want another method of checking for rRNA to double check, you can try split_bam.py from RSeQC package. You can also try read_distribution.py from the same package to check for potential gDNA contamination.

ADD REPLY • link 4.2 years ago by newbio17 ▴ 360