Hello,
I have 17 samples collected through patch-seq, processed using SMART-Seq HT Plus Kit, and sequenced on the HiSeq X platform. I used STAR v2.7.7a to align my reads to the mouse genome mm10 (my reference genome). I'm relatively new to this field and would like to understand the difference between unmapped reads (flagged by STAR) and duplicate reads (reported by FastQC). We have a high value of duplicate reads which I read is "normal" for single-cell RNA seq (assuming that it applies to patch-seq too) because of overamplification which is inevitable for a small amount of starting material. In addition to this, we have some samples with high unmapped reads too. Are the unmapped reads the same as duplicate reads? Or do they include duplicate reads? I have attached a link for the graph of %unmapped reads and %duplicate reads for all my samples.
Thanks in advance.
Some undoubtedly are but they are not all likely to be duplicates.
SMRT-seq is the full length RNAseq kit correct? Have you taken some of the unmapped reads and blasted them at NCBI to see what they are. Specifically to rule out problem with contamination of some sort.
I am very new to this. I tried blasting only the unmapped reads. I keep getting Server error on NCBI. So I tried the command line version of BLAST which generated a ~274GB text file. I don't know how to visualize this file or even open this file as my Mac keeps crashing whenever I try. Any ideas?
When doing diagnostic testing you should take a few reads (< 10) and then use the web blast interface to do the search. Convert the fastq sequence to fasta format when you paste them into blast web search.
I tried for the top 10 sequences of the unmapped reads for my first Sample. The percent identity hit for Mus Musculus is >99.13% e-value <3e-50. The rest are with rats and other members of the rodent family plus primates. This is very confusing for me now, if the e-value is so low why is STAR assigning it as unmapped?
Are you sure you are selecting unmapped reads? Did you use
--outReadsUnmapped Fastx
option to get reads that are unmapped?Another possibility. If these reads are multi-mapping (I think STAR allows max of 10 locations by default)
STAR
may be marking them as unmapped.Yes, I did use
--outReadsUnmapped Fastx
to get the unmapped reads. The amount of multi-mapped reads is relatively low (4.62 ± 1.46 %).This is puzzling. You have very high % of unmapped reads but the reads when spot checked at NCBI are showing up as aligning to correct genome. Did you run
STAR
with more or less default options? If you did not then could you do that and see if the % unmapped number goes down?I mostly used the default options: For indexing, I did set the
--sjdbOverhang 149
as I have 2*150bp reads. And set the genomeSAsparseD to 2 to save computing power. The code line used:STAR --runThreadN 4 --runMode genomeGenerate --genomeDir {path to genome folder} --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm38.102.gtf --sjdbOverhang 149 genomeSAsparseD 2
For alignment, I used default settings:
STAR --runThreadN 4 --runMode alignReads --readFilesCommand gunzip -c --genomeDir {path_to_Dir} --outFileNamePrefix Sample1_R1 --readFilesIn {read_1}, {read_2} --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx
Does the above code look alright or what parameters should I play around with to improve my results?
I am not a regular
STAR
user but those are more or less default options for alignments. I think thatis causing these reads to end up in "unmapped files". Even if you are not explicitly changing this parameter it is getting applied when a read maps to more than 10 locations.
Does your data have rRNA contamination? Is it from FFPE type samples where you have degraded short RNA's?
No, we don't have FFPE type samples. We extracted samples from fresh tissue. I'm mapping the samples to rRNA FASTA from NCBI to check for rRNA contamination. I just did Qualimap on the sample and see a relatively high percentage of introns in some of the samples which also points to rRNA contamination.
Let us know how that goes. rRNA contamination seems like the best bet for now.
I'm still running mapping for rRNA fasta. But I noticed something in one of the unmapped reads that there is a sequence, primarily comprising of A's (with N) occurring multiple times. I have attached the image with these sequences in the red box. I think it is crap and I don't get any hits on BLAST. Any idea what does this mean and why this is appearing:
**>MG01HX01:1039:HCTMGCCX2:8:1101:1255:54717/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAACAAACAAAAAAAATTAAAAGAAA
**>MG01HX01:1039:HCTMGCCX2:8:1101:1255:71735/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Those reads are not useful. Perhaps they are a side effect of SMRTseq. So your
STAR
result may be ok after all.STAR
was able to align whatever reads it could and the rest are .. you know.That's a bummer. Thanks for the reply. Is the data still usable though?
Some samples have very high (70% unmapped) reads so not sure how that is going to affect your downstream analysis. You may have to try that out and see what happens. Your power to detect changes may be limited by amount of data that aligned.
What genome is the data from? How many million reads have aligned for the sample where 70% reads are unmapped.
You could use
bbduk.sh
and scan/remove these polyA/polyT containing reads. My hunch is that remaining reads would likely equate to whatSTAR
was able to align.The genome is the mouse genome (mm10). The sample with ~70% unmapped reads has ~21.5M uniquely aligned reads (number of raw reads >70M), so definitely, it is not the best sample.
I will try to scan and remove the polyA/polyT containing reads as you suggested. Hopefully, it helps in making a call on which sample is usable and which is not.
If you have ~22 M mapped reads in that worst sample then all data should be OK. That is good news. I would say instead of spending time on the unmapped reads go forward with your analysis.
You just made my day. Thanks a ton.