Hello, I am a bioinformatics student and I am learning how to do rnaseq so I am new and I have doubts that I don't know if it is my fault or because of the quality of the sequencer. I have to do an Rna-seq of chromohalobacter salexigens DM3043. They are paired readings. I follow the next algoritm for each sample:
bwa mem ../../Genoma_referencia_gtf/GCF_000055785.1_ASM5578v1_genomic.fna combinedRpoS25C_R1.fastq combinedRpoS25C_R2.fastq > aligned_reads_RpoS25C.sam
samtools view -Sb aligned_reads_RpoS25C.sam | samtools sort -o aligned_reads_RpoS25C_sorted.bam
samtools flagstat aligned_reads_RpoS25C_sorted.bam > aligned_stats_RpoS25C.txt
samtools stats aligned_reads_RpoS25C_sorted.bam > alignment_stats_RpoS25C_full.txt
featureCounts -a ../../Genoma_referencia_gtf/genomic.gtf -o counts_bwa_RpoS25C.txt -t gene -g gene_id -p aligned_reads_RpoS25C_sorted.bam
I also tried trimming the data although it did not vary much and with hisat2 and bowtie and in both cases in some samples I got a very low alignment % of 20 and then in 4 samples when I do the FeatureCounts I got very low counts with featureCounts and in another 4 samples if I get a high number of accounts applying the same protocol. Also when I do FeatureCounts I dont know if I should use --countReadPairs because it takes too much time to perform de analyisis.
What could be the reason for this difference when I apply the same protocol and everyone in the previous FASTQC has good qualities.
If this is paired end data then you have to use that option with
-p
in order not to double count reads.When a situation like this arises (low % mapping), take a sampling of unmapped reads and blast them at NCBI to make sure they are aligning to correct genome i.e. you don't have unexpected contamination.
Are the original reads being aligned as is or did you do any trimming/scanning? If yes, then were the reads trimmed together?
Hi. At first I had each sample with 4 or 5 .fastq R1 files and 4 or 5 .fastq R2 files.
Then I did this to combine them: (for example with fordward(R1))
And then the FASTQC:
But I am working with the untrimmed samples because the number of counts does not vary too much and it is more direct.With the original combined R1 and R2.
With the bwa I do sample by sample, not all at once.Another thing I can say is that in some samples I had 30 million counts and then in other samples I had 5 million which doesn't make much sense.
Here I show a summary of my data.
Looking at the
001/002 ..
in file names this must be old data. Those kind of file pieces have not been in use for quite some time.As long as the order of the files
catted
was exactly identical for the two reads (R1/R2) thing should be OK. You can use a tool likerepair.sh
from BBMap suite to confirm that read identifiers are matching.Are you referring to variable number of reads across samples or file pieces for individual samples?
For example when I do: featureCounts -a ../../Genoma_referencia_gtf/genomic.gtf -o counts_bwa_RpoS25C.txt -t gene -g gene_id -p aligned_reads_RpoS25C_sorted.bam
I got an excel where I get the counts and in some samples I have 30373011 counts with a lot of MQ0 and in others I have the same with low MQ0. The brown bars in the image.
And when I use hisat2 to align I have counts with 30373011 total counts and others with 5373011 of a same replicate. I also have 6 samples with SOLID samples that were already preprocessed and I have to work with the rest of Illumina and it is difficult because those already processed give a very high number of counts and reads and I am applying another protocol to them.
It would be best not to try and pool illumina/SOLiD data.
If you are referring to mapping quality there then
MQ0
means those reads are multi-mapping.featureCounts
will not count those reads unless you add-M
option. Multi-mapping reads have special considerations in RNAseq.If you have overlapping genes then that could also be an issue.
Have you considered using a program like https://cs.wellesley.edu/~btjaden/Rockhopper/ that is targetted for bacterial RNAseq.
Yes I used Rockhopper and I some samples It appear like 20-10% protein coding and the rest 80% RNAr. Its a university project I need to use both data.
But If I want to study differential expression I shouldnt use Multimapping reads no? Is there any way I could contact to you to show how Im doing the analysis?
That is the answer. Not much you can do about data you did not generate yourself. If this is a project assigned to you then show what you did and the issues you ran into.
And what should I do with the data? I also tried to give more or less missmatches but it varies very little.
I will try to use the countReadPairs of FeatureCounts although it takes too long for a single sample because until now I used only -p.
And when I normalize the data and do a pca, the results come out a little different.And there are some samples that are grouped in another group. I used the DESEQ2 normalizer. Could this disparity be due to the lack of counts and the use of SOLID and Illumina data?
My teacher told me that the samples were depleted, so why do I get those results in RockHopper?
Experiments don't work as expected every time. If your result is accurate then it is possible that the depletion did not work in some samples.
Part of learning data analysis involves being able to provide logical explanation for results you are observing (may not be possible 100% of time). Perhaps your teacher is expecting something similar from all of you.