We have some older published RNA-seq data that we would like to quantify with Salmon, but paired-end read lengths differ quiet a bit, ranging from 2x100bp, over 75/25bp, 50/50bp, 35/35bp and some other combinations. Should one use the same index for all files, so like -k=19
(to handle the 35bp reads), or rather several ones, like -k=31
for the longer reads, and -k=19
for for the shorter reads? Would it induce a kind of mappability/quantification bias if one uses different indices for different files, especially if different runs (lane replicates) belong to the same patient? Another possibility would be to trim all reads to a fixed read length, which in this case would be the shortest read length in the whole dataset (25bp). Your opinions would be appreciated.
EDIT: Following up on this, I quantified four different samples (varying read lengths, x-axis) with salmon, default settings and --validatemappings
against the human gencode_v28 (varying kmer length, legend) and saw that the kmer length plays a very minor role in the mapping percentage (not discussing now if these samples are good or bad towards quality):
As a comparison, these samples, mapped with HISAT2, defaults, against hg38, have the following mapping rates,
(%mapped,%properly-paired):
36-36 (75%, 63%) 51-51(76%, 69%) 75-75 (98%, 95%) 76-26 (43%, 0.06%)
The kmer length has a little influence on the mapping rate. Therefore, (as most samples I have have either 50 or 76bp reads) I will probably use something like -k=25
for all samples, categorically discard those that have either a fwd or rev read length of 26bp, and indicate the read length in the DE design.
tagging: Rob