I got this error while trying to do salmon from an aligned bam file. The bam file is an output of tophat from paired end fastq file.
WARNING: Detected suspicious pair ---
The names are different:
read1 : SRR2103637.12484815
read2 : SRR2103637.12492049
The proper-pair statuses are inconsistent:
read1 [SRR2103637.12484815] : proper-pair; mapped; matemapped
read2 : [SRR2103637.12484815] : no proper-pair; mapped; matemapped
My salmon command is:
salmon quant -t ../idx/Ch38.cdna.all.fa -l IU -a sorted.bam -g ../gene_map.txt -o salmon_out
I tried both unsorted bam and sorted bam and both resulted the same warning.
My questions are:
- What can I do to fix this or I can just ignore it?
- For libtype parameter in Salmon, how to choose the correct parameter? How can I check whether it is inward, backward, or matching and stranded of not stranded? Can I check from the fastq files or it has something to check from the experiment itself? I downloaded the data from NCBI GEO and it said it is from Illumina HiSeq2500100bp paired-end mode of the TruSeq Rapid PE Cluster kit and TruSeq Rapid SBS kit (Illumina)
Thank you for your answer and suggestion.
Hi baharata,
One thing that seems strange to me in your description is this:
The fact that the alignments are the output of TopHat suggests that the alignment was against the genome. However, Salmon requires alignment against the transcriptome (where the aligner might be e.g. Bowtie2). It is, of course, possible that you aligned against the genome with TopHat and then converted the alignments into transcriptomic coordinates, but since this is uncommon and you didn't mention this, I assume this isn't the case. Can you clarify if these alignments are to the genome or to the transcriptome (as Salmon expects).
I used tophat to align fastq files to transcriptome reference from Ensembl (human cDNA reference) so I think it is okay to process with Salmon directly after that.
That's interesting, as TopHat is generally a split-read aligner that is used to map RNA-seq reads to the genome. When aligning directly to the transcriptome, one wants to avoid split-read mappings. I might see how things differ on one of the samples if you map with Bowtie2 (or something comparable) instead of TopHat.
I am having the same problem, did you manage to solve it. I have tried,
The major problem I am facing is, I just have BAM files aligned using HISAT. I do not have any source FASTQ files.
Hi EagleEye,
If you are using pre-aligned reads with salmon, the "reference" sequences that you pass to salmon should be the FASTA file containing your reference transcripts. However, if your reads have been aligned using HISAT, the bigger concern is that the alignment is likely done with respect to the genome rather than the transcriptome. A tool like sam-xlate (mentioned here should be able to covert from genomic to transcriptomic alignments). It is also important to note that Salmon (like RSEM) requires the alignment records for a given read, in case of multimapping, to (1) be consecutive in the input SAM/BAM file and (2) for the records for mates (i.e. left and right reads) to be consecutive. The most common alignment tools (e.g. Bowtie2, BWA, STAR) do this, and I believe HISAT does as well (since it uses much of Bowtie2's infrastructure for file parsing / writing), but that is worth verifying. Finally, in a worst-case scenario, you could always consider trying to recover some FASTQ file from the BAM (using e.g. bam2fastq) and then performing your quantification with that.
Thanks a lot for your suggestions, I will get back once I try these solutions.