Hi folks,
I'm trying to use salmon to count viral transcripts in some clinical samples I have. However when I use salmon to quantify these viruses it's mapping single poly A or poly T regions of transcripts to similar size poly A and T regions in the viral genomes but only these, and registering these as a count. Is there any way to increase the size of the mapping required before it is considered a true count?
I'd like to be able to use a cut-off that say 50% of a viral genome should be present and mapped from the transcripts before it is a count, rather than a very small poly nucleotide region that is likely an artifiact rather than a true count of that virus.
Thanks!
Can you give a more "solid" example on such a "bad" mapping? It is at least for me difficult to understand what you mean. Also please share the command line.
So if you have a look at this image here
Salmon wrongly is counting this transcript (viral genome) due to the presence of just these poly A reads. There is no other reads which align or map to any other section of the transcript but since this is just a poly A section I think I'd be justified in saying it is counted in error. I'd ideally want a way to set a minimum coverage so that I'd need say 20-50% minimum of a transcript to have some coverage before its accepted as a read.
My script is:
edit (couldnt get hyperlink working)
Edited the link. You have to paste the full link including the suffix into field popping up when clicking the image button.
Ok I see what you mean. Did you check how the mate reads align in this case? It is paired-end sequencing so the mate would need to align somewhere near that problematic region, and it would need to be a valid alignment to be even considered by salmon from what I understand. I will tag the developer Rob.
Ahh okay, thanks!
So changing the view type to read type in tablet and all are classed as "Mate unmapped". However the vast majority of these are all the same direction (arrow going to the right, I'm assuming this is read 1?), I'm not sure if that makes any difference to this.
![https://ibb.co/ZdP1tcK][1]
I don't know of a way to do what you ask in salmon (doesn't mean it doesn't exist of course). But a different approach might be to mask homopolymers of a certain length from the viral genome because aligning to it.
Wouldn't it be "safer" to trim trailing polyA sequences of a certain length from the reads directly? That way one would probably still get alignments where the polyA is flanked by non polyA reads and the true origin of the read is not a polyA-tail (given read length is sufficient which is seems to be here).
Looking at the reads above, it doesn't look to me like these reads definately come from polyA tails. Specially, there are reads in that pile up above where there are non-A bases flanking the homopolymer run on both size, where non of the non-A reads match, but the read is still aligned.