Hi everyone,
We performed long-read amplicon sequencing using Pacbio, targeting a specific gene in Drosophila and we want to identify all variants of the gene and look at their relative proportions. The issue I am having is that while the amplicon is approximately 4 kb for most samples, in one Drosophila line it is actually 6 kb because they had previously inserted a vector of some sort inside this gene. So I have a few samples with 6 kb reads and most others with 4kb reads. I think I should remove this sequence from the middle of the reads before running the amplicon-sequencing analysis, but I am struggling to find a solution for this. I have found the sequence and used cutadapt on my fastq.gz files, which could correctly identify the sequence from most reads and trim it away. However, cutadapt also trims all that is after the sequence, while I would like to keep that and concatenate it with everything that is before this sequence. Could you suggest how to do this?
Thank you very much in advance for your help.
Best, Andrea
I don't think any off-the-shelf trimming programs will do that. As you discovered trimming programs are designed to remove all sequence on 3'-end once a match is made. You may need to write some custom code that would do it after you "filter" out reads that contain the sequence (easily done with
bbduk.sh
infilter
mode).Asking ChatGPT for a solution produced this code. Which can be used as a start:
If your subsequence shows up multiple times in any one sequence, this will produce mangled FASTQ output.
Correct. Above was not meant to be a complete solution and that is why it is posted in a comment. It is meant to be a pointer of how this could be done.