I work with TruSeq Custom Amplicon 1.5 data.
I have trimmed adapters using bbduk with parameters recommended in the manual: bbduk.sh -Xmx1g in1=forward.fastq.gz in2=reverse.fastq.gz out1=forward1.clean.fq out2=forward.clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
.
I've run FastQC after this and adapter contamination seems to be gone, but now I'm disturbed by the Sequence length distribution. I've noticed in Basic statistics that Sequence length is now 10-251, while before trimming it was uniform and equal to 251.
I've run awk '{if(NR%4==2) print length($0)}' FILE.fastq | sort -n | uniq -c > read_length.txt
for fastq files before and after trimming.
Before: 930813 251
After: 4 10 ... (the long list of numbers counting truncated reads)... 917278 251
Should I be worrying? What to do with truncated reads? Is BWA-MEM aware of such reads? These are going to be discarded due to the low MAPQ, right?
Thank you!
Extremely short reads are not going to be very useful since they are more likely to multimap. You could have filtered such reads out using
minlen=
option when you ranbbduk.sh
. Since you are using BBMap already why not stay with the same package and usebbmap.sh
the aligner to do the alignments. It can handle reads of varying lengths.Thank you, genomax. What are the factors to consider when setting the threshold for
minlen
ormlf
?