Hello!
I have tried to look for existing answers but did not really find any satisfying one. I am running Hisat2 on the existing mouse genome (gencode.VM25) with paired reads and got this results:
HISAT2 summary stats:
Total pairs: 41539152
Aligned concordantly or discordantly 0 time: 4746559 (11.43%)
Aligned concordantly 1 time: 13453617 (32.39%)
Aligned concordantly >1 times: 23240434 (55.95%)
Aligned discordantly 1 time: 98542 (0.24%)
Total unpaired reads: 9493118
Aligned 0 time: 6292703 (66.29%)
Aligned 1 time: 990850 (10.44%)
Aligned >1 times: 2209565 (23.28%)
Overall alignment rate: 92.43%
I am worried about the Aligned concordantly >1 times: 23240434 (55.95%). It seems awfully high. Whether I use trimmomatic for trimming or not does not matter I get the same rate. The size distribution of my reads is this (fastQC):
Length Count
35 47812.0; 36 49054.0; 37 52457.0; 38 55554.0; 39 54966.0; 40 58943.0; 41 62925.0; 42 53991.0; 43 58224.0; 44 63050.0; 45 53349.0; 46 55182.0; 47 51612.0; 48 58391.0; 49 53000.0; 50 57727.0; 51 54120.0; 52 54592.0; 53 63569.0; 54 57556.0; 55 53580.0; 56 58251.0; 57 56248.0; 58 53003.0; 59 56622.0; 60 60135.0; 61 54936.0; 62 57950.0; 63 56381.0; 64 56827.0; 65 62153.0; 66 61679.0; 67 61212.0; 68 66007.0; 69 65395.0; 70 70038.0; 71 120537.0; 72 262637.0; 73 1021662.0; 74 3263828.0; 75 9960097.0; 76 2.48439E7;
Any recommandations? Should I drop all reads smaller than 75 or something?
Thank you very much for any tip in advance!
Jean-Michel Fustin
Looks like your reads are multi-mapping. I assume this is RNAseq data? Have you checked to see if you have rRNA contamination in your reads?
If you drop all short reads, you will drop a whole lot that mapped fine. Your data is what it is. You probably can't fix it, all you can do is understand it.
My lab is cheap, and we do single end 50-bp runs all the time on mouse RNA, and I get 70-80% unique reads. So dropping 75-bp read which have a paired mate is not going to fix anything.