Should I be concerned about the magnitude of the number of non-ATCG nucleotides recorded to STDOUT while Salmon indexes my transcriptome? The line (sans timestamp) is here:
[puff::index::jointLog] [info] Replaced 8,836,877 non-ATCG nucleotides
I will be doing differential expression and differential alternative splicing analyses after exposure of various Brassica species to heat or cold temperature. I have nonredundant Stringtie2-reconstructed transcripts from which I have extracted spliced exons using gffread. I've used the generateDecoyTranscriptome script from SalmonTools and used the gentrome and decoy file for indexing. So far, so good.
I was puzzled by the line quoted above and searched for reports of issues with Salmon that included the non-numeric keywords, but the number in my output seems considerably higher than in indexing outputs that have been posted for other issues. The genomic fasta that I used to extract the spliced exonic sequence was soft-masked - is that the reason for the high number? I did not see any non-[acgtACGT] nucleotides. Should I be concerned? Indexing ran to completion and I do not see any other obviously disturbing values or messages.
Thanks for any feedback!
Oops, yes, that must be where they are coming from. For some reason, I was thinking the genome was just soft-masked. How embarrassing! Thanks for the quick reply.