Hello,
I have a problem with using salmon with the output from STAR aligning to the transcriptome ordered by name and not by coordinate, the number of counts of salmon is very low.
I have looked at the salmon logs and I saw this warning a total of 196026 times:
WARNING: Detected suspicious pair ---
The names are different:
After getting into the bam files, I saw that the first problematic read is one that appears 5 times in the bam file. Salmon is taking the 5th read and pairing with the next one, which is not its pair, and therefore there is a problem.
To illustrate the problem, the bam file looks like this:
A00125:488:H2YHYDSX2:2:1120:5376:2440_TTCAGGAAGGGC 339 ENST00000593393.1 2170 1 34M = 1945 -259 GCCCTGCCCGGCCGCCCCTACTGGGAAGTGAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:2 RG:Z:CRC01_001
A00125:488:H2YHYDSX2:2:1120:5376:2440_TTCAGGAAGGGC 339 ENST00000593393.1 2346 1 34M = 1945 -435 GCCCTGCCCGGCCGCCCCTACTGGGAAGTGAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:3 RG:Z:CRC01_001
A00125:488:H2YHYDSX2:2:1120:5376:2440_TTCAGGAAGGGC 83 ENST00000444227.2 236 1 34M = 11 -259 GCCCTGCCCGGCCGCCCCTACTGGGAAGTGAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 RG:Z:CRC01_001
A00125:488:H2YHYDSX2:2:1120:5376:2440_TTCAGGAAGGGC 419 ENST00000593393.1 1945 1 46M = 2170 259 GCCCTGCCCGGCCGCCCCTACTGGGAAGTGAGGAGCCCTTCCTGAA FFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:2 RG:Z:CRC01_001
A00125:488:H2YHYDSX2:2:1120:5376:2440_TTCAGGAAGGGC 163 ENST00000444227.2 11 1 46M = 236 259 GCCCTGCCCGGCCGCCCCTACTGGGAAGTGAGGAGCCCTTCCTGAA FFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 RG:Z:CRC01_001
A00125:488:H2YHYDSX2:2:1120:5385:34053_AGATATAAAGTT 99 ENST00000624866.1 111 255 99M = 99 99 TTAAAAAGGTGCCATTCCAGCCCTTTCCAGCTCTCACCTCCCCACTCCCTTATAAGTGACACCGCCTTTCCCCACCAGGCCCTGACTCAGGCCCAGAGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 RG:Z:CRC01_001
and the first warning of salmon is:
WARNING: Detected suspicious pair ---
The names are different:
read1 : A00125:488:H2YHYDSX2:2:1120:5376:2440_TTCAGGAAGGGC
read2 : A00125:488:H2YHYDSX2:2:1120:5385:34053_AGATATAAAGTT
Do you know how could I solve this?
Thanks! Lluc
It is also possible that your input fastq files were out of sync (perhaps trimmed independently). You should use
repair.sh
from BBMap suite to re-sync them to remove singletons. Then realign fixed files.Hi! The output from
STAR
to quantify usingsalmon
? What are you trying to do? Why don't you give the fastq files directly tosalmon
?I need to use umi deduplication, so I need to use an aligner
I would double check that your fastq files used in STAR alignment are properly formatted and paired. Try running them through seqkit sana to remove or rescue malformed reads, and then seqkit pair to make sure that the R1 and R2 reads are properly paired.
You should also include all of the code you ran so we can check whether there were any errors.