Question

Salmon BAM target length different to FASTA file sequence length

1

Entering edit mode

4.2 years ago

bdy8 ▴ 80

Good afternoon :)

This is more of a theory question. My general pipeline is BBDuk.sh -> STAR align -> salmon quant. I am using 3' RNA-seq (Lexogen QuantSeq FWD prep).

During BBDuk.sh, this does a poly a tail trim, as well as a adapter trim (all happy here).

I then align the trimmed reads to the reference genome using STAR (again, all happy here).

This is where some differences occur with Salmon versions.

Using salmon v0.10.0, i get expected outputs and quantified files.

/nethome/bdy8/programs/salmon-0.10.0_linux_x86_64/bin/salmon \
quant \
-t /nethome/bdy8/apal_genome/version3.0/Apalm_gffread_for_salmon.fasta \
-l SF \
-a /scratch/projects/transcriptomics/ben_young/DHE/tagseq/host/aligned/'"$PALPAL"'/'"$PALPAL"'_Aligned.toTranscriptome.out.bam \
-o /scratch/projects/transcriptomics/ben_young/DHE/tagseq/host/salmon/'"${PALPAL}"'_salmon

However, using salmon 1.40 I experience thew following error message

    /nethome/bdy8/programs/salmon-latest_linux_x86_64/bin/salmon quant \
         -t /nethome/bdy8/apal_genome/Apalm_gffread_for_salmon.fasta \
         -l SF \
         --noLengthCorrection \
         -a /scratch/projects/transcriptomics/ben_young/POR/tagseq/host/aligned_reads/bagnumber-apal-1009/bagnumber-apal-1009_Aligned.toTranscriptome.out.bam \
         -o /scratch/projects/transcriptomics/ben_young/POR/tagseq/host/test

SAM file says target evm.model.Sc0a5M3_402_HRSCAF_756.4.1.5f5b2bc4 has length 536, but the FASTA file contains a sequence of length [538 or 537]

What I am thinking is happening is that this is resulting due to the poly a tail trimming in the BBDuk.sh stage. Apart from this I do not know what is happening and was wondering if anyone else has any ideas on this or a fix to ignore this mismatch of 1.

If any more information is needed please let me know and I will supply it.

salmon Salmon RNA-Seq • 3.7k views

ADD COMMENT • link 4.2 years ago by bdy8 ▴ 80

0

Entering edit mode

This is talking about the reference vm.model.Sc0a5M3_402_HRSCAF_756.4.1.5f5b2bc4 sequence? BBduk.sh should have nothing to do with the reference.

ADD REPLY • link 4.2 years ago by GenoMax 150k

0

Entering edit mode

Hi GenoMax, how I was taking the error was BAM file (-a in salmon) = 536, and the FASTA file (-t in salmon) has 538 or 537. So my thinking is that something has occurred with the STAR or BBDuk.sh processing to cause it to be shorter length. . I am currently running a STAR alignment of the raw read (i.e. no poly a tail trimming or quality trimming) and will then quant with salmon to see if this error is maintained or it runs successfully. Does that make sense ?

ADD REPLY • link 4.2 years ago by bdy8 ▴ 80

0

Entering edit mode

An update, running with no prefiltering (BBduk poly a tail trimming and quality trimming) but still using star alignment causes the same problem.

I am now running using only salmon (indexing the genome using gffread, and then salmon index) and then the subsequent salmon quant with the index.

ADD REPLY • link 4.2 years ago by bdy8 ▴ 80

1

Entering edit mode

Hi bdy8,

The length that salmon will assume for each transcript is the length in the FASTA reference file you provide. It is a stickler about the lengths matching because salmon applies an error model to the alignments that are processed (i.e. it models the conditional probability of each alignment based on a learned model of the likelihood of each CIGAR string under the true read origin). If the lengths of the targets don't match (and, in fact, if the targets themselves don't match precisely in terms of sequence), then this model can be applied incorrectly; for example, a CIGAR string could go off the end of a reference.

It seems that something in the processing of the alignments upstream is causing STAR to believe that a transcript is a different length than in the reference FASTA that you are providing. Salmon determines compatibility by comparing the transcript's length to the length provided in the SAM/BAM header for each target. Can you check if the length for the transcript is the same as what is implied by the GTF file you are providing to STAR? If not, STAR is changing the length somehow / for some reason. If those two sources agree, and the transcriptome FASTA disagrees, then you can perhaps use a tool like gffread or rsem-prepare-reference to produce a transcriptome FASTA that matches the GTF annotation.

ADD REPLY • link 4.2 years ago by Rob 7.1k

0

Entering edit mode

Good morning Rob :)

First of all thank you for the in depth response !

Thankyou for the more in depth on the CIGAR, i get this a bit more now. I am guessing this is a new implementation as using the older version of salmon (v0.10.1) does not yield this error and quantification progresses successfully.

For the question regarding transcript length I have just done this. Please note that the transcritpome fasta I am using in salmon quant mode is using gffread already.

First of all BAM and my transcriptome fasta both have the same number of transcripts (31, 827 transcripts, yay)
Second, I took the BAM transcript lengths (using SAMtools), and then generated the sequence lengths from the transcriptome fasta (I used R, happy to provide workflow if anyone wants to see)
On doing a == for the lengths in r, of the 31,827 transcripts, there are 23 with differences of 1 or 2 (in attached picture).

Why these 23 are different I have no idea, I am now going to inspect these a little closer. I am also going to try removing them from the transcriptome fasta and seeing if salmon quant will run with these mismatches removed.

Let me know if you would like any more info but thankyou again for the help.

![23 mismatch gene lengths transcriptome fasta to BAM file][1][1]: https://ibb.co/r591n3n

ADD REPLY • link 4.2 years ago by bdy8 ▴ 80

0

Entering edit mode

Hi Rob

TX = transcriptome

Just some more information for you, after some more digging and checking I believe it is STAR that is the culprit. I checked all TX lengths in my STAR bam file (B), mrna_tx supplied with genome (M) and gffread generated TX using same inputs for generating STAR index (G).

G to B = 23 mismatches (As stated in above response) M to B = 23 mismatches (the same as the ones identified in G to B) G to M = 0 mismatches

So ye this identifies that STAR is doing something to those 23 transcripts, what it is I currently do not know.

At this point I can hear you probably saying "why are you not using salmon index and quant if you have the reference transcriptome?". Well this is what I have just been doing. It is working well although I am getting some pretty low mapping rates (~30%). I am not to worried about this due to it being a coral species (Acropora palmata) so the genome will not be as high quality as, say, Human, and as well as this I also expect highish percentage of transcripts from the symbiotic Symbiodiniceae. I have also read the other posts on this matter and incorporated suggestions (i.e. softclipping and minscorefraction parameters), (https://github.com/COMBINE-lab/salmon/issues/533 https://github.com/COMBINE-lab/salmon/issues/160) and this is comparable for 3' rna-seq on this type of organism.

Thank you so much for all the help, happy to provide any more info or if you have any other suggestions I would be welcome to hear them :). On your response I will also type up a answer to this thread and close it.

Have a great day !!

ADD REPLY • link 4.2 years ago by bdy8 ▴ 80

1

Entering edit mode

Hi bdy8,

Thanks for your detailed follow up! I am not sure what STAR is doing there, or why, but it may be worth dropping an issue over on the STAR github page. Alex (the lead author and developer of STAR) is very responsive, and if you're observing this issue, I'm certain that others must be as well. As you mention, there are many reasons that low mapping rates might occur, and not all of them are unexpected (especially in species that are not well-annotated). You can always compare the effective mapping rate you are getting from using salmon directly to what you were getting using STAR and projecting to the transcriptome. Obviously, I'd expect the total reads mapped by STAR — including those mapped to the genome but not properly contained within the transcriptome — to be higher, but if the total that is projected to the transcriptome is relatively similar, then I think you can chalk it up to the completeness of the annotation. Other than that, I don't have much else to suggest; you've clearly done your homework here! Please do let me know if you follow up with Alex, as I'd like to track the resolution of that issue (you can tag me as @rob-p over on GitHub). Have a great day yourself!

ADD REPLY • link 4.2 years ago by Rob 7.1k

0

Entering edit mode

Good morning Apologies had to completely switch tack to another project. And yes, I did all the comparisons with aligning to STAR. STAR gets more but then when quanting with salmon it becomes less and what is expected. As you recommend I am going to follow up with Alex today or tomorrow :). Thank you for all the help !! Ben

ADD REPLY • link 4.2 years ago by bdy8 ▴ 80

score 2 · Accepted Answer · 2021-02-04

To provide an answer to this query, it seems STAR is doing something during the indexing step which is causing a slight mismatch for 23 of the transcripts. If you read above replies to the original post you will see the workflow that was done to identify the problem transcripts and the various lengths of them in different programs :).

I will be opening a question with Alex (STAR author) about identifying why this may be happening.