I have RNA sequencing data from 10 samples of two genotypes (5 samples for each genotype). Two replicates for control and three replicates for treated samples. I have aligned the data using both STAR and hisat2. For one sample STAR shows only 38.78% uniquely mapped reads and 46.86% reads were mapped to multiple loci. And hisat2 shows 48.79% of reads for this sample aligned concordantly >1 times. When I run featureCounts the successfully assigned alignment rate for all samples was between 70-80%, but for this one sample it was 26%. In fastqc report for this sample, there were many overrepresented sequences. I blast those sequences and they turn out to be from rRNA. I used ribodetector to remove rRNA from this sample which removed 56% of reads from this sample, which left me with 9M reads while all other samples have around 22M reads. In PCA analysis, clustering pattern is completely fine whether I use ribodetector or not. I think multi-mapping reads in this sample are coming from rRNA as suggested by overlapping sequences in Fastqc report. I want to ask whether I should continue with this sample as it is, and remove rRNA genes from my gtf/count file as I am not interested in those genes. Or I should remove reads coming from rRNA and then proceed. If I remove 50% of data from one sample it will not cause any discrepancy in downstream analysis?
I am working on rice data and my goal is to identify DEGs using DESeq2.
Which of those samples is it? In case it's one of the replicates for treated samples, you can simply discard it since you will still end up having at least 2 replicates for that group (which is not ideal but for sure better than having only one replicate left).
It is one of the replicate for control samples.
Then you can keep it for the sole purpose of keep analyzing the data and eventually have an idea of some interesting results you might get. But for later eventually publishing the work you need to redo the RNA-seq and I would suggest including 3 replicates for the control group too, so you avoid such problems. It would not be considered reliable to publish data where either you only have one replicate in a sample group or where some replicates in the dataset do not meet the minimum recommended usable read numbers.
It means the data of that one sample is of no use?
Clean reads are of course usable but for sure they do not meet the recommended minimum number of reads for that sample.
Yes but the sample cannot be included for the sake of differential gene expression analysis.