Dear Community, I need some help with an analysis I need to do for the reviewers before my manuscript can be published. My Bioinformatics level is beginner. See the case, pipeline and the questions below.
Case:
I did an RNA-seq differential gene expression study using two yeast strains (strain A and B) and three media conditions (M1, M2, M3) in Triplicates (I-III) resulting in the following samples:
StrainA_M1_I
StrainA_M1_II
StrainA_M1_III
StrainA_M2_I
StrainA_M2_II
StrainA_M2_III
StrainA_M3_I
StrainA_M3_II
StrainA_M3_III
StrainB_M1_I
StrainB_M1_II
StrainB_M1_III
StrainB_M2_I
StrainB_M2_II
StrainB_M2_III
StrainB_M3_I
StrainB_M3_II
StrainB_M3_III
My Pipeline was:
Raw Reads -> [FastQC] -> [Trimmomatic] -> trimmed reads -> [SubRead_align] -> SAM/BAM -> [SubRead_featurecount] -> count_matrix.
I needed to build an index for the reference genome prior read mapping.
Questions:
My manuscript was accepted but the reviewers want some additional analysis:
1.) Quantification of the number of SNPs identified by RNA seq: Similarity of the strains at the nucleotide level?
2.) Unmapped reads: Are there any transcripts that don't map to the reference?
3.) Which of the unmapped reads are involved in the metabolic pathways I am investigating?
Concerning 1.):
How would a pipeline look like quantifying the number of SNPs identified by RNA seq?
I heard exactSNP from the Subread package could do the job.
Does it make sense to use all samples for this kind of analysis or only a subset?
Concerning 2.):
How can I identify unmapped reads?
I guess the results are different for each sample.
Concerning 3.):
What is a good way to check, if unmapped reads belong to a subset of genes of interest?
For the analyses, I have the reference genome + the above mentioned RNA-Seq data of the samples.
Thanks a lot
Surprised reviewers are asking you to do that. SNP calling using RNAseq data is not a norm. If
subread
has SNP calling functionality then try it first. If strains are closely related there would be a small number of SNP's, IF they are present.GATK has a pipeline available for RNAseq SNP calling.
No familiar with
subread
but it may have an option to collect upmapped reads in a separate file. Otherwise you can filter your alignment file to collect unmapped reads usingsamtools
, providedsubread
includes them in the alignment file.You will need to try and assemble unmapped reads to first identify new transcripts. Then annotate them and finally try to map on to pathways.
If they did originate from genes in reference, then the reads should not remain unmapped (as long as the aligner did its job well)?
Having said all this, with yeast the number unmapped reads in your samples should be relatively small (< 10%) so I am not sure if you are going to get a lot out of this whole lot of work. Have you tried to push back against reviewers suggestion?
Did
subread
provide you with stats about the % of reads that remained unaligned?@genomax: Thanks for your answer. So far we have not pushed back against the reviewers. They do not expect an in-depth analysis but want at least some statements about similarity of the strains on nucleotide level - based on the RNA-seq reads. We used subreads exactSNP on the BAM files and the samples have around 2000 SNPs compared to the reference genome.
Related to question 2.) Unfortunately, subread seems to have no function to collect unmapped reads, it can just show the number of unmapped reads. I will try samtools or STAR aligner.
Please use
ADD REPLY/ADD COMMENT
to keep the threads logically organized.SUBMIT ANSWER
is only for new answers to original question.What % of reads remained unaligned in your samples?
2000 SNP's seems reasonable. I assume that is the total across all strains?