Entering edit mode
13 months ago
noodle
▴
590
Hi all,
Can someone recommend a one-liner or command-line tool to produce a complexity curve based on PE fastq files? Something like preseq c_curve
but with the input being the fastq files instead of an alignment, with some flags for number of mismatches allowed, etc. TIA!
Thanks, I know this tool but it doesn't seem very flexible when it comes to kmer length and number of mismatches. Do you think this would be appropriate for PE with R1=61bp and R2=51bp?
Yes, it will work fine with those read lengths, though whether the tool is appropriate depends on the exact question you wish to answer. It will, for example, give you upward spikes in low-quality areas on the flow cell due to sequencing errors, as it requires an exact kmer match to consider sequences duplicate. Also, it will asymptote at a level slightly above zero, depending on the error rate. Determining whether read pairs are duplicates on the fly while allowing for an arbitrary number of mismatches is rather difficult. Though of course you could error-correct the data prior to measuring complexity; then you wouldn't need to worry about mismatch flags since the spurious complexity will be eliminated (to the extent possible).