Hello, I am working on RNA seq analysis with the human genome. I have a total of 180 samples (90 control and 90 mutant). Initially, I am trying with a total of 4 samples (2 control and 2 mutant). I am trying to compare genes deferentially expressed between these two groups. Let me explain my data structure and process:
I got paired-end samples R1 and R2. These were in four parts (.fastq files) of each pair (R1 and R2 separately) so I concatenated these four parts (.fastq) using $cat to make a single R1 and R2 paired-end. Subsequently, I RUN trimGalore to improve the reads with the below command. Then I used STAR for alignment. However, I am getting a very low score of alignment. Please see the multiQC report. In this case, do I need to use Kallisto or any other way to improve alignment score?
$trim_galore --illumina --paired --fastqc -o /DataAnalysis/mutant/
@Manoj this is not a matter of using
kallisto
or some other program. Doing that is not going to improve your alignments, if your input data has problems.It does looks like your alignment % are low, assuming you have done everything right (aligned against genome, not transcriptome etc). Now you need to do some investigative work. Take a selection of reads from each file that are unmapped (you can extract them easily from BAM) and blast them at NCBI (you will need to change them to fasta format) to make sure you don't have a contamination problem with your data. If there is a problem with your data it is not going to clear itself just by using a different program/technique.
Did you make sure that the order of pieces of files in your cat command was identical for R1/R2 files. It not this would cause low/discordant alignments.
I am using kallisto for alignment. How I can get BAM file. I am still getting low alignment score.
Read the manual. As genomax already said, changing your alignment method/tool is not going to magically solve your problem. Did you investigate more closely as suggested?
Yes! I checked my reads using bbduk using the following command. However, it seems it has high rRNA contamination. Please let me know how to remove these contaminants.
Results:
This is NOT contamination. You are using the actual human genome sequence as the reference in
ref=
. So no wonderbbduk.sh
is able to match 99%+ of your data to it.Ok! Actually, I am new for RNASeq analysis. Could you elaborate your answer, how I can proceed? Should I use rRNA sequences file that is downloaded from BioMart to see if there are rRNA contamination?
I used tagdust to improve my reads files. In its output it generates output_READ1.fastq, output_READ2.fastq, output_un_READ1.fastq, output_un_READ1.fastq.
Since you know that why did you use the entire human genome as "contaminant reference" to search against in your
bbduk.sh
analysis? You seem to have good data (99% reads match to the human reference, which may contain rRNA, which you need to account for). If you don't use a tool in the way it is meant to be used you are creating additional work (and confusion) for yourself for no reason. If you usedtagdust
in the way it is meant to be used then your data has already been cleaned of extraneous sequences and is now ready for further analysis. No need to dobbduk
analysis again.However, I am trying STAR aligner after all the contamination removing process. But, the alignment score is still low. Here is my multiQC result.
So, if you did everything right, you need to investigate more deeply... That happened to me once and I discovered that, unfortunately, the problem was on the library prep. If you did this work will be easier for you to find the issue but, if not, investigate all metrics. Second, make sure that you have concatenated everything the right way. Third, be careful when trimming your reads so, make sure that everything was right on that task.
I checked all my process and concatenated files. To remove contamination I first I RUN Trimgalore with the following command. Furthermore, I tagdust with the following command to remove rRNA contamination. However, I am still getting low alignment score using Kallisto. Here is the Kallisto statistical result.
Changing your align method is not going to solve your problem! Try running FastQC on all your R1 and R2 files before alignment and check data quality.
If you want another method of checking for rRNA to double check, you can try
split_bam.py
from RSeQC package. You can also tryread_distribution.py
from the same package to check for potential gDNA contamination.