Question

Mapping rate troubles

2

Entering edit mode

4.2 years ago

thomas58 ▴ 20

Hello everybody,

I have troubles to obtain reasonable mapping rates for my data. I am familiar with troubles around so of course I did my homework before asking. I have zebrafish samples prepared by SureSelect mRNA kit (Agilent) and sequenced on Illumina 550 highoutput 75 PE and I proceed with standard workflow (quality trimming, adapter removal, rRNA removal using bbduk and silva database). Fastqc reports looked very good. For mapping, I used firstly salmon (1.4.0) with Ensembl latest zebrafish cDNA version and obtain mapping rate around 40%. To explore more, I used genome and STAR with featurecounts where I get 82% of uniquely mapped reads and 35% assigned aligments. These can indicate rRNA, gDNA contamination, etc. These results are the same for my whole dataset (16 samples) First, I run distribution.py from rseqc package. to obtain:

Total Reads                   49991974
Total Tags                    56921335
Total Assigned Tags           47241859
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb  
CDS_Exons           42081061            21781002            517.60   
5'UTR_Exons         5556581             1383159             248.92   
3'UTR_Exons         22439733            4272270             190.39   
Introns             684575852           12781742            18.67    
TSS_up_1kb          20797101            1336322             64.26    
TSS_up_5kb          79640488            2862248             35.94    
TSS_up_10kb         128987557           3879716             30.08    
TES_down_1kb        20864195            737402              35.34    
TES_down_5kb        79723162            2224013             27.90    
TES_down_10kb       127725349           3143970             24.62    
=====================================================================

So I was thinking I was kinda unlucky having high number of non-splice transcripts together with some DNA contamination. However, checking output from SeqMonk I got different results.

Percent in Gene 
78.03321932747787
Percent in exons
66.4255989593314
Genes Measured  
95.04024621212122
Percentage of max data size 
100.0                   
Percent on sense strand
 3.9634962173630424

So here comes my first questions: How is possible such a high percentage in exons? I would expect something around 40%.

To continue, I asked my more experienced colleague to run one of my samples in his zebrafish pipeline. He used trimmomatic and sortmerna but basically got similar number of reads as me. However, when mapping using salmon version 1.2.1 , he got mapping rate 72%. I examined his salmon code, and I find out that his version with parameter -lib A guess library as IU while my version guess is ISR (which is correct). First I tried to run my files as -lib IU but there was again cca 40%. I tried run his fastqc proccesed files with my salmon version as -lib IU and -lib ISR but again no change. Only thing which I did not try was using older version of salmon. But I am heavily confused, how is that even possible? Strandness vs nonstrandess should not be such a big deal if my data are heavily stranded, and they are. Moreover, if I have some kind of contamination it should not be mapped at all no matter the version. Should I used older version of salmon for my data? I would like to hear somebody opinion about that. I can provide all codes and outputs if necessary but I have no idea what can be helpful in this case.

Thank you

Thomas

EDIT: I will show salmon codes and outputs Salmon version 1.4.0 Code:

{
    "salmon_version": "1.4.0",
    "index": "zebra_index",
    "libType": "A",
    "mates1": "A1_rRNAsequclean1.fastq",
    "mates2": "A1_rRNAsequclean2.fastq",
    "threads": "20",
    "output": "A1_test",
    "gcBias": [],
    "auxDir": "aux_info"
}

Output:

[2021-02-16 18:43:03.874] [jointLog] [info] setting maxHashResizeThreads to 20
[2021-02-16 18:43:03.874] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2021-02-16 18:43:03.874] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2021-02-16 18:43:03.875] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2021-02-16 18:43:03.875] [jointLog] [info] parsing read library format
[2021-02-16 18:43:03.875] [jointLog] [info] There is 1 library.
[2021-02-16 18:43:03.936] [jointLog] [info] Loading pufferfish index
[2021-02-16 18:43:03.936] [jointLog] [info] Loading dense pufferfish index.
[2021-02-16 18:43:04.376] [jointLog] [info] done
[2021-02-16 18:43:04.376] [jointLog] [info] Index contained 57,192 targets
[2021-02-16 18:43:05.726] [jointLog] [info] Number of decoys : 0
[2021-02-16 18:43:06.173] [jointLog] [info] Automatically detected most likely library type as ISR

[2021-02-16 18:44:10.562] [jointLog] [info] Thread saw mini-batch with a maximum of 4.74% zero probability fragments
[2021-02-16 18:44:10.615] [jointLog] [info] Thread saw mini-batch with a maximum of 4.62% zero probability fragments
[2021-02-16 18:44:10.617] [jointLog] [info] Thread saw mini-batch with a maximum of 4.64% zero probability fragments
[2021-02-16 18:44:10.633] [jointLog] [info] Thread saw mini-batch with a maximum of 4.78% zero probability fragments
[2021-02-16 18:44:10.639] [jointLog] [info] Thread saw mini-batch with a maximum of 4.64% zero probability fragments
[2021-02-16 18:44:10.641] [jointLog] [info] Thread saw mini-batch with a maximum of 4.78% zero probability fragments
[2021-02-16 18:44:10.653] [jointLog] [info] Thread saw mini-batch with a maximum of 4.60% zero probability fragments
[2021-02-16 18:44:10.655] [jointLog] [info] Thread saw mini-batch with a maximum of 4.72% zero probability fragments
[2021-02-16 18:44:10.659] [jointLog] [info] Thread saw mini-batch with a maximum of 4.48% zero probability fragments
[2021-02-16 18:44:10.664] [jointLog] [info] Thread saw mini-batch with a maximum of 4.60% zero probability fragments
[2021-02-16 18:44:10.675] [jointLog] [info] Thread saw mini-batch with a maximum of 4.80% zero probability fragments
[2021-02-16 18:44:10.687] [jointLog] [info] Thread saw mini-batch with a maximum of 4.94% zero probability fragments
[2021-02-16 18:44:10.692] [jointLog] [info] Thread saw mini-batch with a maximum of 4.56% zero probability fragments
[2021-02-16 18:44:10.697] [jointLog] [info] Thread saw mini-batch with a maximum of 4.64% zero probability fragments
[2021-02-16 18:44:10.704] [jointLog] [info] Thread saw mini-batch with a maximum of 4.84% zero probability fragments
[2021-02-16 18:44:10.712] [jointLog] [info] Thread saw mini-batch with a maximum of 4.76% zero probability fragments
[2021-02-16 18:44:10.717] [jointLog] [info] Thread saw mini-batch with a maximum of 4.90% zero probability fragments
[2021-02-16 18:44:10.732] [jointLog] [info] Thread saw mini-batch with a maximum of 4.56% zero probability fragments
[2021-02-16 18:44:10.747] [jointLog] [info] Thread saw mini-batch with a maximum of 4.54% zero probability fragments
[2021-02-16 18:44:10.774] [jointLog] [info] Thread saw mini-batch with a maximum of 4.68% zero probability fragments
[2021-02-16 18:44:11.090] [jointLog] [info] Computed 264,671 rich equivalence classes for further processing
[2021-02-16 18:44:11.090] [jointLog] [info] Counted 13,824,703 total reads in the equivalence classes 
[2021-02-16 18:44:11.105] [jointLog] [info] Number of mappings discarded because of alignment score : 28,212,971
[2021-02-16 18:44:11.105] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 1,485,310
[2021-02-16 18:44:11.105] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2021-02-16 18:44:11.105] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 206,125
[2021-02-16 18:44:11.105] [jointLog] [info] Mapping rate = 43.3618%

[2021-02-16 18:44:11.105] [jointLog] [info] finished quantifyLibrary()
[2021-02-16 18:44:11.106] [jointLog] [info] Starting optimizer
[2021-02-16 18:44:11.184] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2021-02-16 18:44:11.202] [jointLog] [info] iteration = 0 | max rel diff. = 16011
[2021-02-16 18:44:11.389] [jointLog] [info] iteration 11, adjusting effective lengths to account for biases
[2021-02-16 18:44:11.092] [fileLog] [info] 
At end of round 0
==================
Observed 31882241 total fragments (31882241 in most recent round)

Salmon version 1.2.1 Code:

{
    "salmon_version": "1.2.1",
    "index": "./Indices/ZF11_transcripts_index",
    "libType": "A",
    "mates1": "./03_SortMeRNA_v2.1/A1_S1_join_R1_paired_trimmomatic_rRNAFiltered.fastq.gz",
    "mates2": "./03_SortMeRNA_v2.1/A1_S1_join_R2_paired_trimmomatic_rRNAFiltered.fastq.gz",
    "gcBias": [],
    "threads": "7",
    "output": "./04_Salmon_quant/A1_S1_join_quant.gz",
    "auxDir": "aux_info"
}

Output:

[2021-02-15 16:00:45.373] [jointLog] [info] setting maxHashResizeThreads to 7
[2021-02-15 16:00:45.373] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2021-02-15 16:00:45.373] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2021-02-15 16:00:45.373] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2021-02-15 16:00:45.373] [jointLog] [info] parsing read library format
[2021-02-15 16:00:45.373] [jointLog] [info] There is 1 library.
[2021-02-15 16:00:45.854] [jointLog] [info] Loading pufferfish index
[2021-02-15 16:00:45.858] [jointLog] [info] Loading dense pufferfish index.
[2021-02-15 16:00:47.331] [jointLog] [info] done
[2021-02-15 16:00:47.331] [jointLog] [info] Index contained 57,192 targets
[2021-02-15 16:00:48.603] [jointLog] [info] Number of decoys : 0
[2021-02-15 16:05:39.728] [jointLog] [info] Computed 4,256,242 rich equivalence classes for further processing
[2021-02-15 16:05:39.728] [jointLog] [info] Counted 19,896,162 total reads in the equivalence classes 
[2021-02-15 16:05:39.746] [jointLog] [info] Number of mappings discarded because of alignment score : 46,176,663
[2021-02-15 16:05:39.746] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 1,395,266
[2021-02-15 16:05:39.746] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2021-02-15 16:05:39.746] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 2,493
[2021-02-15 16:05:39.746] [jointLog] [info] Mapping rate = 72.3414%

[2021-02-15 16:05:39.746] [jointLog] [info] finished quantifyLibrary()
[2021-02-15 16:05:39.760] [jointLog] [info] Starting optimizer
[2021-02-15 16:05:39.729] [fileLog] [info] 
At end of round 0
==================
Observed 27503155 total fragments (27503155 in most recent round)


[2021-02-15 16:14:20.747] [jointLog] [warning] NOTE: Read Lib [[ ./03_SortMeRNA_v2.1/A1_S1_join_R1_paired_trimmomatic_rRNAFiltered.fastq.gz, ./03_SortMeRNA_v2.1/A1_S1_join_R2_paired_trimmomatic_rRNAFiltered.fastq.gz]] :

Detected a *potential* strand bias > 1% in an unstranded protocol check the file: ./04_Salmon_quant/A1_S1_join_quant.gz/lib_format_counts.json for details

Note that I deleted info about bias reduction and iteration because I could not fit all text in quesiton (it seems there was no information).

Although, I think this problem is beyond salmon, because also SeqMonk shows higher number of exons.

RNA-Seq alignment salmon STAR SEQmonk • 2.4k views

ADD COMMENT • link 4.2 years ago by thomas58 ▴ 20

1

Entering edit mode

You should post all of your salmon code, otherwise it might be difficult to find the problem. I also applaud you for the effort you took troubleshooting before posting the question.

ADD REPLY • link 4.2 years ago by rpolicastro 13k

0

Entering edit mode

Thank you for your response, I will edit my question as soon as possible.

ADD REPLY • link 4.2 years ago by thomas58 ▴ 20

1

Entering edit mode

For the part about having high exon coverage, your kit is selecting for mature polyadenylated transcripts, so you would expect most reads to be over exons.

Are you running this code through some sort of pipeline, or from the command line? It's not clear how it's being invoked.

ADD REPLY • link 4.2 years ago by rpolicastro 13k

0

Entering edit mode

I am sorry I forget to mention, that we used our library kit with one exception (instead of poly-A, we used rRNA depletion). But anyway, if there would be higher or small number of rRNA, it still should not make such a big difference in mapping. Moreover, I am glad that I have high number of exons of course, but my question was more about. why there is such high number of exons and yet they are not mapped by salmon or collected by feature counts.

I use command line, my colleague with higher mapping rate has some kind of pipeline includes fastqc, trimmomatic, sortmerna and salmon.

ADD REPLY • link 4.2 years ago by thomas58 ▴ 20

0

Entering edit mode

One thing I don't quite understand; are the read files here identical? They have different names (of course that's not important), but you mention your colleague has a "pipeline". It's quite possible that one of the trimming or quality control steps your colleague is doing is leading to the improved mapping rate. I would suggest you try with his exact read sequences if you have not already. Also, if you are able to share read files, we might be able to help take a look over on the salmon GitHub page.

ADD REPLY • link 4.2 years ago by Rob 7.1k

0

Entering edit mode

Hello Rob, sorry for late reply I did some testing because with every other analysis, I am more confused. The salmon version is not a problem so sorry for misleading information and question. These files are not identical because they were run via different pipelines. I find out at which step make salmon output behave differently. For some reason, removing rRNA and tRNA via bbduk.sh makes salmon mapped circa 43% (from total 28340001) while using sortmerna (version 2.1) makes salmon produce circa 70 % of mapped reads (from total 27503155). Not using sortmerna makes salmon behave again circa 43 %. Total amout of rRNA and tRNA in samples is cca 2%-3% so it is not big bias. Therefore, I conculed that part of sortmerna process when one needs to use merge and unmerge script on paired reads makes some heavy changes in fastq composition; however, I cannot explain this and I have never ever seen this problem before. Thank you for any advice or comment.

ADD REPLY • link 4.2 years ago by thomas58 ▴ 20

score 0 · Accepted Answer · 2021-02-19

Hello, after some investigations, I find out cause of this problem. In the pipeline, there was mistake during the sortmerna, my colleague did not apply step for merging paired reads before filtering and they run every file separetly witch caused chaos in fastq files. After applying, repair.sh from bbmap, files were restored and produced similar results. But what is interesting, that while STAR and hisat2 could not process these corrupted files, salmon showed 70% mapping rate. Anyway, thank you for your help. :)