So before mapping, I run trim my files using BBDuk and PRINSEQ - and PRINSEQ outputs separate singleton.fa for each of my paired-ends so that I end up with 4 total files (out1.fa
, singleton1.fa
, out2.fa
, singleton2.fa
)
${BBDuk} -Xmx120g in1="${file1}" in2="${file2}" out1="${file3}" out2="${file4}" ref="${adapter}" trimq=10
paste - - - - < "${file3}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file5}"
paste - - - - < "${file4}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file6}"
perl ${PRINSEQ} -fastq "${file5}" -fastq2 "${file6}" -no_qual_header -trim_right 1 -custom_params "A 75%;T 75%;G 75%;C 75%" min_qual_mean 25 -min_len 40 -out_format 3 -out_good "${file1%_1.fastq}_tsa" -out_bad null -log
With TopHat I fed all 4 files for alignment but from my understanding, STAR is singleton aware (?) and only needs the PE as inputs, but is it still possible to feed all 4 files?
#${STAR} \
--runMode alignReads \
--twopassMode Basic \
--runThreadN 24 \
--outSAMtype BAM SortedByCoordinate \
--sjdbGTFfile "${sjdb}" \
--genomeDir "${reference}" \
--readFilesIn "${file7}" "${file8}"
Lastly, Alex recommends to use a large value for --sjdbOverhang
(length of seq on each side of splice junction) like 100. But knowing that my reads may be very short I would like to find the optimal value (mate_length - 1
).
How could I go about, first, finding mate_length
and then automatically feeding the value without intervention for each of my RNA seq files?
EDIT: I saw in GATK RNA-seq variant calling best practice that they use --sjdbOverhang 75
and was curious if I should just set a hard value for all of my files instead (like 75 or default 100).
To do adapter-trimming with BBDuk, you also need to specify a few other parameters. Your above command will do adapter-filtering, not adapter-trimming. You can also set a min length cutoff and min quality cutoff if you wish with the minlen and maq flags (minlen=40 maq=25) but personally I think a minimum average quality of 25 is extremely high and likely to be detrimental.
When a pair has a read that falls below your minimum length threshold, it's easiest to discard the entire pair than to process both the pairs and singletons. I don't really recommend that in most situations where these singletons are only a very small percentage of your total data.