Entering edit mode
7.0 years ago
arvchi
▴
60
I have RNA-seq data for 24 samples that were prepared with smart-seq2 library prep. I have noticed in fastQC that I have many overrepresented reads that are identical, or closely resemble, parts of my oligo primer that I used in reverse transcription (oligo poly-T sequence).
Is there any program for aligning these reads with the oligo sequence, and trim them away if they align well with the oligo sequence? Have googled a bit and asked around but no one has heard of this, which surprises me.
Thankful for any advice.
If I understood your question correctly you could use Trimmomatic, to which you can add specific adapters to trim against.
It's not an adapter though, it's a long oligonucleotide. I have used TrimGalore to trim adapters, but will it work on this additional sequence too? This oligo sequence is longer than the reads, and there could be mismatches, so the reads are not identical to the oligo sequence. It needs some sort of aligning algorithm and trimming based on probability.
I suggest that you not get distracted by the
over-represented sequences
in FastQC. If you have scanned/trimmed the adapters (both Illumina and your kit specific ones) then go on to the actual alignment and analysis of the data. IF there is a downstream problem noted (e.g. with alignments etc) then backtrack and look at this some more.There has been downstream problems (relatively poor alignment, but more worrisome there are a lot of counts assigned as "no feature", which is odd because this is mRNA).
I am not sure if this is related to the oligo being sequenced. But anyway, the oligo is overrepresented and surely doesn't do anything good, so it seems better to trim it away. Moreover, seeing the percentage of trimmed away reads might provide insight too.
If things are not aligning then trimming that oligo may not make a significant difference. You should take a sample of the reads that do not align and then do blast with their sequence at NCBI to see if you have a problem of contamination or some other basic issue. If you have more counts assigned to
no feature
(have you examined alignments in IGV?) and if the aligned reads mapping there are uniformly scattered across the genome then you could potentially have DNA contamination in your original RNA. Also check to see if you have rRNA contamination.Like I said below. You can easily trim thereads by providing sequence of the oligo with the command below and then everything to the right of that sequence can be trimmed by adding
ktrim=r
option forbbduk.sh
. An example belowAdjust value of
k=
depending on length of the sequence you are providing for the oligo.Thanks. Alignment is around 80 %, with the rest being mainly multi mapping (STAR report). However, when counting features, only a few percent is assigned to a feature. 40-70 % is unassigned due to multi-mapping, and the rest is "no feature".
I am not too concerned about the oligo any more, but still have some questions. 1) How come the multi-mapping proportion is higher when counting features with featurecounts compared to the STAR alignment report? 2) Is there any standard way of checking for rRNA and/or genomic DNA? For genomic DNA, I have checked alignments in IGV to look for a uniform distribution, but I don't feel this visual approach is conclusive to me and IGV constantly crashed on my Mac.
Thanks for all the help
See this recent thread with my comments for answers to your rRNA questions: high % of reads mapped to multiple loci after STAR mapping If you don't work with human data then get the rDNA repeat for your organism.
the op in that thread has only 20 % uniquely mapped reads in STAR, I have around 80 %. Still featurecounts reports 40-70% multimapping. How come there can be such a big discrepancy between what STAR and fc reports?
Are you counting the STAR generated BAM file with featureCounts? What options are you using with that?
Yes, I count on the STAR generated BAM files. No additional options really, featureCounts -T 16 -a $gtf -g gene_id -o [path to output] [path-to-BAM].
For instance, in one of my samples, multimapping is 13.5 % in STAR, but 37.4 % in FeatureCounts.
Unless you add
-M
option multi-mapping reads are not counted byfeatureCounts
. Are you referring to multi-overlapping reads perhaps? You would also want to add-p
if you are using a PE dataset so fragments are counted instead of reads.Having said that you should not be counting multi-mapping alignments.
They are not counted, you are right, but featurecounts reports percentages of reads that were not assigned due to multi-mapping, and this percentage is not consistent with what is reported by STAR.
For instance, in one sample FeatureCounts reports that 37 % of the reads were not assigned to any feature because of multimapping, but STAR reports that 13.5 % of the reads were mapped to multiple loci.
bbduk.sh
from BBMap suite using theliteral=sequence1,sequence2
etc.