Question

salmon quantification vs. star mapping

2

Entering edit mode

4.9 years ago

Assa Yeroslaviz ★ 1.9k

I have a dataset of human samples. I have tested both quantification with salmon as well as standard mapping with star.

Looking at the results, the salmon quantification is a lot worse than star. In average I get ~90% mapping with star, but only ~60% with salmon. But the most interesting thing is the last sample. Here I get only ~3% of the reads aligned (see table below)

#sample #reads  STAR_uniq_mapped STAR_unmapped  STAR_%total_mapped  salmon
1   104736525   90449978    5532361.28  94.72   66.36%
2   85549527    74153554    4745912.55  94.45   71.43%
3   82846983    68374449    5751113.94  93.06   70.27%
4   93190747    79607891    4592829.92  95.07   70.50%
5   187095419   127045161   18491637.89 90.11   54.72%
6   128349390   111761579   7158095.58  94.43   62.78%
7   179986447   157822483   8252445.58  95.42   67.83%
8   168862755   155255503   5564534.29  96.7    72.82%
9   68120928    52917367    9064366.31  86.69   64.79%
10  91055123    81982113    3763139.13  95.87   68.26%
11  102762099   87200096    7685804.48  92.52   73.37%
12  54279758    47416487    4163073.67  92.33   70.03%
13  77098401    56630953    13580101.62 82.38   71.51%
14  43196895    35167151    4397392.18  89.82   62.50%
15  47153275    40438052    2346857.98  95.02   63.87%
16  38437377    1867650 8585679.34  77.66   2.75%

I have attached the log file from the salmon run for this sample at the bottom.

Is there a way to try and understand this vast difference between the two runs?

The salmon run waas done against an indexed transcriptome of gencode v32, while the star mapping was done with Ensembl genome build Hsp GRCh38

Can be the cause for such a difference?

thanks,

Assa

$less ./16/logs/salmon_quant.log 
[2019-12-10 13:22:38.684] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2019-12-10 13:22:38.702] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2019-12-10 13:22:38.702] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2019-12-10 13:22:38.702] [jointLog] [info] parsing read library format
[2019-12-10 13:22:38.702] [jointLog] [info] There is 1 library.
[2019-12-10 13:22:38.848] [jointLog] [info] Loading pufferfish index
[2019-12-10 13:22:38.848] [jointLog] [info] Loading dense pufferfish index.
[2019-12-10 13:22:42.666] [jointLog] [info] done
[2019-12-10 13:22:42.670] [jointLog] [info] Index contained 226,608 targets
[2019-12-10 13:22:43.647] [jointLog] [warning] len : 25, but txp.RefLenght : 25
[2019-12-10 13:22:43.714] [jointLog] [warning] len : 23, but txp.RefLenght : 23
[2019-12-10 13:22:44.470] [jointLog] [warning] len : 12, but txp.RefLenght : 12
[2019-12-10 13:22:45.457] [jointLog] [warning] len : 28, but txp.RefLenght : 28
[2019-12-10 13:22:45.575] [jointLog] [warning] len : 8, but txp.RefLenght : 8
[2019-12-10 13:22:45.575] [jointLog] [warning] len : 9, but txp.RefLenght : 9
[2019-12-10 13:22:45.575] [jointLog] [warning] len : 13, but txp.RefLenght : 13
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 11, but txp.RefLenght : 11
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 20, but txp.RefLenght : 20
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 18, but txp.RefLenght : 18
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 20, but txp.RefLenght : 20
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 19, but txp.RefLenght : 19
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 28, but txp.RefLenght : 28
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 17, but txp.RefLenght : 17
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 21, but txp.RefLenght : 21
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 20, but txp.RefLenght : 20
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 16, but txp.RefLenght : 16
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 17, but txp.RefLenght : 17
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 21, but txp.RefLenght : 21
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 23, but txp.RefLenght : 23
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 16, but txp.RefLenght : 16
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 17, but txp.RefLenght : 17
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 18, but txp.RefLenght : 18
[2019-12-10 13:22:45.710] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 17, but txp.RefLenght : 17
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 28, but txp.RefLenght : 28
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 23, but txp.RefLenght : 23
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 19, but txp.RefLenght : 19
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 31, but txp.RefLenght : 31
[2019-12-10 13:22:45.711] [jointLog] [warning] len : 17, but txp.RefLenght : 17
[2019-12-10 13:22:46.147] [jointLog] [warning] len : 28, but txp.RefLenght : 28
[2019-12-10 13:22:46.494] [jointLog] [warning] len : 28, but txp.RefLenght : 28
[2019-12-10 13:22:47.043] [jointLog] [info] Number of decoys : 0
[2019-12-10 13:22:47.043] [jointLog] [info] First decoy index : 18,446,744,073,709,551,576 
[2019-12-10 13:22:50.936] [jointLog] [info] Automatically detected most likely library type as IU
[2019-12-10 13:29:10.048] [jointLog] [info] Computed 427,077 rich equivalence classes for further processing
[2019-12-10 13:29:10.048] [jointLog] [info] Counted 1,056,715 total reads in the equivalence classes 
[2019-12-10 13:29:10.054] [jointLog] [info] Number of mappings discarded because of alignment score : 119,095,129
[2019-12-10 13:29:10.054] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 611,432
[2019-12-10 13:29:10.054] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2019-12-10 13:29:10.054] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 540,160
[2019-12-10 13:29:10.094] [jointLog] [warning] Only 1056715 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2019-12-10 13:29:10.094] [jointLog] [info] Mapping rate = 2.74919%

[2019-12-10 13:29:10.094] [jointLog] [info] finished quantifyLibrary()
[2019-12-10 13:29:10.095] [jointLog] [info] Starting optimizer
[2019-12-10 13:29:10.048] [fileLog] [info] 
At end of round 0
==================
Observed 38437377 total fragments (38437377 in most recent round)

[2019-12-10 13:29:10.560] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2019-12-10 13:29:10.727] [jointLog] [info] iteration = 0 | max rel diff. = 454.515
[2019-12-10 13:29:11.683] [jointLog] [info] iteration 11, adjusting effective lengths to account for biases
[2019-12-10 13:29:30.368] [jointLog] [info] Computed expected counts (for bias correction)
[2019-12-10 13:29:30.368] [jointLog] [info] processed bias for 0.0% of the transcripts
[2019-12-10 13:29:32.301] [jointLog] [info] processed bias for 10.0% of the transcripts
[2019-12-10 13:29:34.196] [jointLog] [info] processed bias for 20.0% of the transcripts
[2019-12-10 13:29:36.202] [jointLog] [info] processed bias for 30.0% of the transcripts
[2019-12-10 13:29:38.308] [jointLog] [info] processed bias for 40.0% of the transcripts
[2019-12-10 13:29:40.396] [jointLog] [info] processed bias for 50.0% of the transcripts
[2019-12-10 13:29:42.148] [jointLog] [info] processed bias for 60.0% of the transcripts
[2019-12-10 13:29:43.930] [jointLog] [info] processed bias for 70.0% of the transcripts
[2019-12-10 13:29:46.253] [jointLog] [info] processed bias for 80.0% of the transcripts
[2019-12-10 13:29:48.261] [jointLog] [info] processed bias for 90.0% of the transcripts
[2019-12-10 13:29:49.987] [jointLog] [info] processed bias for 100.0% of the transcripts
[2019-12-10 13:29:49.999] [jointLog] [info] processed bias for 100.0% of the transcripts
[2019-12-10 13:29:57.503] [jointLog] [info] iteration = 100 | max rel diff. = 2.82733
[2019-12-10 13:30:05.612] [jointLog] [info] iteration = 200 | max rel diff. = 5.36446
[2019-12-10 13:30:15.810] [jointLog] [info] iteration = 300 | max rel diff. = 0.0958765
[2019-12-10 13:30:26.099] [jointLog] [info] iteration = 400 | max rel diff. = 0.782063
[2019-12-10 13:30:28.606] [jointLog] [info] iteration = 424 | max rel diff. = 0.00927753
[2019-12-10 13:30:28.648] [jointLog] [info] Finished optimizer
[2019-12-10 13:30:28.648] [jointLog] [info] writing output 

[2019-12-10 13:30:29.091] [jointLog] [info] Starting Gibbs Sampler
[2019-12-10 13:30:48.888] [jointLog] [info] Finished Gibbs Sampler
[2019-12-10 13:30:48.888] [jointLog] [warning] NOTE: Read Lib [[ rawData/HUDEP2_D3_WT_Band3_Low_R2_1677407_PXU011_WT2_D3_B3_L_R1.fastq.gz, rawData/HUDEP2_D3_WT_Band3_Low_R2_1677407_PXU011_WT2_D3_B3_L_R2.fastq.gz]] :

Detected a *potential* strand bias > 1% in an unstranded protocol check the file: quants/D3_WT_Band3_Low_R2_1677407_PXU011_WT2_D3_B3_L/lib_format_counts.json for details

salmon star alignment mapping • 2.6k views

ADD COMMENT • link 4.9 years ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

This is not a fair comparison since mapping rate in STAR refers to mapping towards the whole genome and salmon mapping refers to assigned reads towards transcriptome. Check the distribution of reads actually mapping to exons when using STAR, e.g. with the check_distribution.py from RseQC.

ADD REPLY • link 4.9 years ago by ATpoint 85k

0

Entering edit mode

This is true. And I didn't expect it to be the same. But as one can see from the table the last file is also a lot less then the other salmon runs, so I don't know what to think of it.

$ read_distribution.py -i star/D3_WT_Band3_Low_R2_1677407_PXU011_WT2_D3_B3_L.bam -r Homo_sapiens.GRCh38.79.Ensembl.bed 
processing Homo_sapiens.GRCh38.79.Ensembl.bed ... Done
processing star/D3_WT_Band3_Low_R2_1677407_PXU011_WT2_D3_B3_L.bam ... Finished

Total Reads                   58852716
Total Tags                    60001725
Total Assigned Tags           48629421
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb             
CDS_Exons           103371993           2343924             22.67             
5'UTR_Exons         5217678             10272               1.97              
3'UTR_Exons         29324747            160712              5.48              
Introns             1500197093          23261796            15.51             
TSS_up_1kb          33306654            5239                0.16              
TSS_up_5kb          148463534           23168               0.16              
TSS_up_10kb         265823549           239937              0.90              
TES_down_1kb        35215293            22547151            640.27            
TES_down_5kb        152556214           22583224            148.03            
TES_down_10kb       268614580           22612780            84.18             
=====================================================================

and one of the good samples

 read_distribution.py -i star/D3_MAEAKO_Band3_Low_R2_1677411_PXU015_KO2_D3_B3_L.bam -r Homo_sapiens.GRCh38.79.Ensembl.bed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      processing Homo_sapiens.GRCh38.79.Ensembl.bed ... Done                                                                                                                                                                                                                                                                                                                    processing star/D3_MAEAKO_Band3_Low_R2_1677411_PXU015_KO2_D3_B3_L.bam ... Finished                                                                                                                                                                                                                                                                                        

Total Reads                   100197260
Total Tags                    130656806
Total Assigned Tags           128548137
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb             
CDS_Exons           103371993           91605838            886.18            
5'UTR_Exons         5217678             845458              162.04            
3'UTR_Exons         29324747            10396245            354.52            
Introns             1500197093          24428566            16.28             
TSS_up_1kb          33306654            68491               2.06              
TSS_up_5kb          148463534           187891              1.27              
TSS_up_10kb         265823549           295816              1.11              
TES_down_1kb        35215293            309282              8.78              
TES_down_5kb        152556214           750602              4.92              
TES_down_10kb       268614580           976214              3.63              
=====================================================================

How do I interpret this problem (these results)?

What would go wrong, so that the TES_down* values are so high in the first sample?

** BTW, the command is called read_distribution.py

ADD REPLY • link 4.9 years ago by Assa Yeroslaviz ★ 1.9k