Question

Can you choose minimum transcript coverage in salmon

0

Entering edit mode

5.8 years ago

yryan ▴ 10

Hi folks,

I'm trying to use salmon to count viral transcripts in some clinical samples I have. However when I use salmon to quantify these viruses it's mapping single poly A or poly T regions of transcripts to similar size poly A and T regions in the viral genomes but only these, and registering these as a count. Is there any way to increase the size of the mapping required before it is considered a true count?

I'd like to be able to use a cut-off that say 50% of a viral genome should be present and mapped from the transcripts before it is a count, rather than a very small poly nucleotide region that is likely an artifiact rather than a true count of that virus.

Thanks!

salmon RNA-Seq rna-seq • 1.9k views

ADD COMMENT • link updated 5.8 years ago by Kristoffer Vitting-Seerup ★ 4.2k • written 5.8 years ago by yryan ▴ 10

0

Entering edit mode

Can you give a more "solid" example on such a "bad" mapping? It is at least for me difficult to understand what you mean. Also please share the command line.

ADD REPLY • link 5.8 years ago by ATpoint 88k

0

Entering edit mode

So if you have a look at this image here

Salmon wrongly is counting this transcript (viral genome) due to the presence of just these poly A reads. There is no other reads which align or map to any other section of the transcript but since this is just a poly A section I think I'd be justified in saying it is counted in error. I'd ideally want a way to set a minimum coverage so that I'd need say 20-50% minimum of a transcript to have some coverage before its accepted as a read.

My script is:

for fn in SRR{1..50};
do
echo "Processing sample ${fn}"
salmon quant -i ../indexes/index -l A \
-1 split/${fn}/${fn}_1.fastq.gz \
-2 split/${fn}/${fn}_2.fastq.gz \
--validateMappings \
--writeMappings=./sams/${fn}_viral  \
 -o quants/${fn}_quant
done

edit (couldnt get hyperlink working)

ADD REPLY • link updated 5.8 years ago by ATpoint 88k • written 5.8 years ago by yryan ▴ 10

0

Entering edit mode

Edited the link. You have to paste the full link including the suffix into field popping up when clicking the image button.

Ok I see what you mean. Did you check how the mate reads align in this case? It is paired-end sequencing so the mate would need to align somewhere near that problematic region, and it would need to be a valid alignment to be even considered by salmon from what I understand. I will tag the developer Rob.

ADD REPLY • link 5.8 years ago by ATpoint 88k

0

Entering edit mode

Ahh okay, thanks!

So changing the view type to read type in tablet and all are classed as "Mate unmapped". However the vast majority of these are all the same direction (arrow going to the right, I'm assuming this is read 1?), I'm not sure if that makes any difference to this.

![https://ibb.co/ZdP1tcK][1]

ADD REPLY • link 5.8 years ago by yryan ▴ 10

0

Entering edit mode

I don't know of a way to do what you ask in salmon (doesn't mean it doesn't exist of course). But a different approach might be to mask homopolymers of a certain length from the viral genome because aligning to it.

ADD REPLY • link 5.8 years ago by i.sudbery 21k

0

Entering edit mode

Wouldn't it be "safer" to trim trailing polyA sequences of a certain length from the reads directly? That way one would probably still get alignments where the polyA is flanked by non polyA reads and the true origin of the read is not a polyA-tail (given read length is sufficient which is seems to be here).

ADD REPLY • link 5.8 years ago by ATpoint 88k

0

Entering edit mode

Looking at the reads above, it doesn't look to me like these reads definately come from polyA tails. Specially, there are reads in that pile up above where there are non-A bases flanking the homopolymer run on both size, where non of the non-A reads match, but the read is still aligned.

ADD REPLY • link 5.8 years ago by i.sudbery 21k

score 0 · Answer 1 · 2019-08-21

I don't think any transcript quantification tool (RSEM/Kallisto/Salmon/Cufflinks/StringTie) repports the fraction of transcripts with coverage. The closes you get is StringTie which repports the average coverage - but that can still be driven by such events as you see. To do this you would need to do a standard genome mapping using e.g. STAR and then calculate the gene coverage afterwards with tools such as RNA-SeQC or Rseqc