Hello everybody,
I have troubles to obtain reasonable mapping rates for my data. I am familiar with troubles around so of course I did my homework before asking. I have zebrafish samples prepared by SureSelect mRNA kit (Agilent) and sequenced on Illumina 550 highoutput 75 PE and I proceed with standard workflow (quality trimming, adapter removal, rRNA removal using bbduk and silva database). Fastqc reports looked very good. For mapping, I used firstly salmon (1.4.0) with Ensembl latest zebrafish cDNA version and obtain mapping rate around 40%. To explore more, I used genome and STAR with featurecounts where I get 82% of uniquely mapped reads and 35% assigned aligments. These can indicate rRNA, gDNA contamination, etc. These results are the same for my whole dataset (16 samples) First, I run distribution.py from rseqc package. to obtain:
Total Reads 49991974
Total Tags 56921335
Total Assigned Tags 47241859
=====================================================================
Group Total_bases Tag_count Tags/Kb
CDS_Exons 42081061 21781002 517.60
5'UTR_Exons 5556581 1383159 248.92
3'UTR_Exons 22439733 4272270 190.39
Introns 684575852 12781742 18.67
TSS_up_1kb 20797101 1336322 64.26
TSS_up_5kb 79640488 2862248 35.94
TSS_up_10kb 128987557 3879716 30.08
TES_down_1kb 20864195 737402 35.34
TES_down_5kb 79723162 2224013 27.90
TES_down_10kb 127725349 3143970 24.62
=====================================================================
So I was thinking I was kinda unlucky having high number of non-splice transcripts together with some DNA contamination. However, checking output from SeqMonk I got different results.
Percent in Gene
78.03321932747787
Percent in exons
66.4255989593314
Genes Measured
95.04024621212122
Percentage of max data size
100.0
Percent on sense strand
3.9634962173630424
So here comes my first questions: How is possible such a high percentage in exons? I would expect something around 40%.
To continue, I asked my more experienced colleague to run one of my samples in his zebrafish pipeline. He used trimmomatic and sortmerna but basically got similar number of reads as me. However, when mapping using salmon version 1.2.1 , he got mapping rate 72%. I examined his salmon code, and I find out that his version with parameter -lib A
guess library as IU
while my version guess is ISR
(which is correct). First I tried to run my files as -lib IU
but there was again cca 40%. I tried run his fastqc proccesed files with my salmon version as -lib IU
and -lib ISR
but again no change. Only thing which I did not try was using older version of salmon.
But I am heavily confused, how is that even possible? Strandness vs nonstrandess should not be such a big deal if my data are heavily stranded, and they are. Moreover, if I have some kind of contamination it should not be mapped at all no matter the version. Should I used older version of salmon for my data?
I would like to hear somebody opinion about that.
I can provide all codes and outputs if necessary but I have no idea what can be helpful in this case.
Thank you
Thomas
EDIT: I will show salmon codes and outputs Salmon version 1.4.0 Code:
{
"salmon_version": "1.4.0",
"index": "zebra_index",
"libType": "A",
"mates1": "A1_rRNAsequclean1.fastq",
"mates2": "A1_rRNAsequclean2.fastq",
"threads": "20",
"output": "A1_test",
"gcBias": [],
"auxDir": "aux_info"
}
Output:
[2021-02-16 18:43:03.874] [jointLog] [info] setting maxHashResizeThreads to 20
[2021-02-16 18:43:03.874] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2021-02-16 18:43:03.874] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2021-02-16 18:43:03.875] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2021-02-16 18:43:03.875] [jointLog] [info] parsing read library format
[2021-02-16 18:43:03.875] [jointLog] [info] There is 1 library.
[2021-02-16 18:43:03.936] [jointLog] [info] Loading pufferfish index
[2021-02-16 18:43:03.936] [jointLog] [info] Loading dense pufferfish index.
[2021-02-16 18:43:04.376] [jointLog] [info] done
[2021-02-16 18:43:04.376] [jointLog] [info] Index contained 57,192 targets
[2021-02-16 18:43:05.726] [jointLog] [info] Number of decoys : 0
[2021-02-16 18:43:06.173] [jointLog] [info] Automatically detected most likely library type as ISR
[2021-02-16 18:44:10.562] [jointLog] [info] Thread saw mini-batch with a maximum of 4.74% zero probability fragments
[2021-02-16 18:44:10.615] [jointLog] [info] Thread saw mini-batch with a maximum of 4.62% zero probability fragments
[2021-02-16 18:44:10.617] [jointLog] [info] Thread saw mini-batch with a maximum of 4.64% zero probability fragments
[2021-02-16 18:44:10.633] [jointLog] [info] Thread saw mini-batch with a maximum of 4.78% zero probability fragments
[2021-02-16 18:44:10.639] [jointLog] [info] Thread saw mini-batch with a maximum of 4.64% zero probability fragments
[2021-02-16 18:44:10.641] [jointLog] [info] Thread saw mini-batch with a maximum of 4.78% zero probability fragments
[2021-02-16 18:44:10.653] [jointLog] [info] Thread saw mini-batch with a maximum of 4.60% zero probability fragments
[2021-02-16 18:44:10.655] [jointLog] [info] Thread saw mini-batch with a maximum of 4.72% zero probability fragments
[2021-02-16 18:44:10.659] [jointLog] [info] Thread saw mini-batch with a maximum of 4.48% zero probability fragments
[2021-02-16 18:44:10.664] [jointLog] [info] Thread saw mini-batch with a maximum of 4.60% zero probability fragments
[2021-02-16 18:44:10.675] [jointLog] [info] Thread saw mini-batch with a maximum of 4.80% zero probability fragments
[2021-02-16 18:44:10.687] [jointLog] [info] Thread saw mini-batch with a maximum of 4.94% zero probability fragments
[2021-02-16 18:44:10.692] [jointLog] [info] Thread saw mini-batch with a maximum of 4.56% zero probability fragments
[2021-02-16 18:44:10.697] [jointLog] [info] Thread saw mini-batch with a maximum of 4.64% zero probability fragments
[2021-02-16 18:44:10.704] [jointLog] [info] Thread saw mini-batch with a maximum of 4.84% zero probability fragments
[2021-02-16 18:44:10.712] [jointLog] [info] Thread saw mini-batch with a maximum of 4.76% zero probability fragments
[2021-02-16 18:44:10.717] [jointLog] [info] Thread saw mini-batch with a maximum of 4.90% zero probability fragments
[2021-02-16 18:44:10.732] [jointLog] [info] Thread saw mini-batch with a maximum of 4.56% zero probability fragments
[2021-02-16 18:44:10.747] [jointLog] [info] Thread saw mini-batch with a maximum of 4.54% zero probability fragments
[2021-02-16 18:44:10.774] [jointLog] [info] Thread saw mini-batch with a maximum of 4.68% zero probability fragments
[2021-02-16 18:44:11.090] [jointLog] [info] Computed 264,671 rich equivalence classes for further processing
[2021-02-16 18:44:11.090] [jointLog] [info] Counted 13,824,703 total reads in the equivalence classes
[2021-02-16 18:44:11.105] [jointLog] [info] Number of mappings discarded because of alignment score : 28,212,971
[2021-02-16 18:44:11.105] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 1,485,310
[2021-02-16 18:44:11.105] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2021-02-16 18:44:11.105] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 206,125
[2021-02-16 18:44:11.105] [jointLog] [info] Mapping rate = 43.3618%
[2021-02-16 18:44:11.105] [jointLog] [info] finished quantifyLibrary()
[2021-02-16 18:44:11.106] [jointLog] [info] Starting optimizer
[2021-02-16 18:44:11.184] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2021-02-16 18:44:11.202] [jointLog] [info] iteration = 0 | max rel diff. = 16011
[2021-02-16 18:44:11.389] [jointLog] [info] iteration 11, adjusting effective lengths to account for biases
[2021-02-16 18:44:11.092] [fileLog] [info]
At end of round 0
==================
Observed 31882241 total fragments (31882241 in most recent round)
Salmon version 1.2.1 Code:
{
"salmon_version": "1.2.1",
"index": "./Indices/ZF11_transcripts_index",
"libType": "A",
"mates1": "./03_SortMeRNA_v2.1/A1_S1_join_R1_paired_trimmomatic_rRNAFiltered.fastq.gz",
"mates2": "./03_SortMeRNA_v2.1/A1_S1_join_R2_paired_trimmomatic_rRNAFiltered.fastq.gz",
"gcBias": [],
"threads": "7",
"output": "./04_Salmon_quant/A1_S1_join_quant.gz",
"auxDir": "aux_info"
}
Output:
[2021-02-15 16:00:45.373] [jointLog] [info] setting maxHashResizeThreads to 7
[2021-02-15 16:00:45.373] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2021-02-15 16:00:45.373] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2021-02-15 16:00:45.373] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2021-02-15 16:00:45.373] [jointLog] [info] parsing read library format
[2021-02-15 16:00:45.373] [jointLog] [info] There is 1 library.
[2021-02-15 16:00:45.854] [jointLog] [info] Loading pufferfish index
[2021-02-15 16:00:45.858] [jointLog] [info] Loading dense pufferfish index.
[2021-02-15 16:00:47.331] [jointLog] [info] done
[2021-02-15 16:00:47.331] [jointLog] [info] Index contained 57,192 targets
[2021-02-15 16:00:48.603] [jointLog] [info] Number of decoys : 0
[2021-02-15 16:05:39.728] [jointLog] [info] Computed 4,256,242 rich equivalence classes for further processing
[2021-02-15 16:05:39.728] [jointLog] [info] Counted 19,896,162 total reads in the equivalence classes
[2021-02-15 16:05:39.746] [jointLog] [info] Number of mappings discarded because of alignment score : 46,176,663
[2021-02-15 16:05:39.746] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 1,395,266
[2021-02-15 16:05:39.746] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2021-02-15 16:05:39.746] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 2,493
[2021-02-15 16:05:39.746] [jointLog] [info] Mapping rate = 72.3414%
[2021-02-15 16:05:39.746] [jointLog] [info] finished quantifyLibrary()
[2021-02-15 16:05:39.760] [jointLog] [info] Starting optimizer
[2021-02-15 16:05:39.729] [fileLog] [info]
At end of round 0
==================
Observed 27503155 total fragments (27503155 in most recent round)
[2021-02-15 16:14:20.747] [jointLog] [warning] NOTE: Read Lib [[ ./03_SortMeRNA_v2.1/A1_S1_join_R1_paired_trimmomatic_rRNAFiltered.fastq.gz, ./03_SortMeRNA_v2.1/A1_S1_join_R2_paired_trimmomatic_rRNAFiltered.fastq.gz]] :
Detected a *potential* strand bias > 1% in an unstranded protocol check the file: ./04_Salmon_quant/A1_S1_join_quant.gz/lib_format_counts.json for details
Note that I deleted info about bias reduction and iteration because I could not fit all text in quesiton (it seems there was no information).
Although, I think this problem is beyond salmon, because also SeqMonk shows higher number of exons.
You should post all of your salmon code, otherwise it might be difficult to find the problem. I also applaud you for the effort you took troubleshooting before posting the question.
Thank you for your response, I will edit my question as soon as possible.
For the part about having high exon coverage, your kit is selecting for mature polyadenylated transcripts, so you would expect most reads to be over exons.
Are you running this code through some sort of pipeline, or from the command line? It's not clear how it's being invoked.
I am sorry I forget to mention, that we used our library kit with one exception (instead of poly-A, we used rRNA depletion). But anyway, if there would be higher or small number of rRNA, it still should not make such a big difference in mapping. Moreover, I am glad that I have high number of exons of course, but my question was more about. why there is such high number of exons and yet they are not mapped by salmon or collected by feature counts.
I use command line, my colleague with higher mapping rate has some kind of pipeline includes fastqc, trimmomatic, sortmerna and salmon.
One thing I don't quite understand; are the read files here identical? They have different names (of course that's not important), but you mention your colleague has a "pipeline". It's quite possible that one of the trimming or quality control steps your colleague is doing is leading to the improved mapping rate. I would suggest you try with his exact read sequences if you have not already. Also, if you are able to share read files, we might be able to help take a look over on the salmon GitHub page.
Hello Rob, sorry for late reply I did some testing because with every other analysis, I am more confused. The salmon version is not a problem so sorry for misleading information and question. These files are not identical because they were run via different pipelines. I find out at which step make salmon output behave differently. For some reason, removing rRNA and tRNA via bbduk.sh makes salmon mapped circa 43% (from total 28340001) while using sortmerna (version 2.1) makes salmon produce circa 70 % of mapped reads (from total 27503155). Not using sortmerna makes salmon behave again circa 43 %. Total amout of rRNA and tRNA in samples is cca 2%-3% so it is not big bias. Therefore, I conculed that part of sortmerna process when one needs to use merge and unmerge script on paired reads makes some heavy changes in fastq composition; however, I cannot explain this and I have never ever seen this problem before. Thank you for any advice or comment.