I've got non-trivial amounts of adapter contamination in a paired-end >100 bp read Illumina run (i.e. the machine reading technical adapters/primers rather than biological sequence). How would I best go about identifying such contaminated reads?
I know the sequence of the adapters used, but because of sequencing errors you can't simply do a straightforward regular expression pattern match. The adapter sequences are about 75 bp and seem to always begin at the very 5' end of the affected reads (though I can't be 100% that this always holds), and the remaining 3' parts of the reads seem to be nonsense low-complexity sequence, lots of homopolymers.
Thanks - I have now used this program with great success.
New home for cutadapt: https://cutadapt.readthedocs.io/en/stable/