Question

Remove a specific DNA sequence from the middle of sequencing reads

0

Entering edit mode

13 months ago

Andrea.Wall ▴ 10

Hi everyone,

We performed long-read amplicon sequencing using Pacbio, targeting a specific gene in Drosophila and we want to identify all variants of the gene and look at their relative proportions. The issue I am having is that while the amplicon is approximately 4 kb for most samples, in one Drosophila line it is actually 6 kb because they had previously inserted a vector of some sort inside this gene. So I have a few samples with 6 kb reads and most others with 4kb reads. I think I should remove this sequence from the middle of the reads before running the amplicon-sequencing analysis, but I am struggling to find a solution for this. I have found the sequence and used cutadapt on my fastq.gz files, which could correctly identify the sequence from most reads and trim it away. However, cutadapt also trims all that is after the sequence, while I would like to keep that and concatenate it with everything that is before this sequence. Could you suggest how to do this?

Thank you very much in advance for your help.

Best, Andrea

filtering trimming fastq • 834 views

ADD COMMENT • link updated 13 months ago by GenoMax 154k • written 13 months ago by Andrea.Wall ▴ 10

0

Entering edit mode

while I would like to keep that and concatenate it with everything that is before this sequence

I don't think any off-the-shelf trimming programs will do that. As you discovered trimming programs are designed to remove all sequence on 3'-end once a match is made. You may need to write some custom code that would do it after you "filter" out reads that contain the sequence (easily done with bbduk.sh in filter mode).

Asking ChatGPT for a solution produced this code. Which can be used as a start:

def excise_sequence_fastq(fastq_file, subsequence, output_file):
    with open(fastq_file, 'r') as f_in, open(output_file, 'w') as f_out:
        while True:
            # Read four lines (each read in FASTQ format)
            header = f_in.readline().strip()
            if not header:
                break  # End of file
            sequence = f_in.readline().strip()
            separator = f_in.readline().strip()
            quality = f_in.readline().strip()

            # Check if the subsequence is in the sequence
            if subsequence in sequence:
                # Excise the subsequence from the sequence
                start_index = sequence.index(subsequence)
                end_index = start_index + len(subsequence)

                # Remove the corresponding quality scores
                sequence = sequence.replace(subsequence, "")
                quality = quality[:start_index] + quality[end_index:]

            # Write the processed read to the output file
            f_out.write(f"{header}\n")
            f_out.write(f"{sequence}\n")
            f_out.write(f"{separator}\n")
            f_out.write(f"{quality}\n")

# Example usage:
fastq_file = "input_reads.fastq"  # Input FASTQ file
subsequence = "TGAATC"            # Subsequence to excise
output_file = "excised_reads.fastq"  # Output FASTQ file with excised sequences

excise_sequence_fastq(fastq_file, subsequence, output_file)

ADD REPLY • link 13 months ago by GenoMax 154k

0

Entering edit mode

If your subsequence shows up multiple times in any one sequence, this will produce mangled FASTQ output.

ADD REPLY • link 13 months ago by Jesse ▴ 880

0

Entering edit mode

Correct. Above was not meant to be a complete solution and that is why it is posted in a comment. It is meant to be a pointer of how this could be done.

ADD REPLY • link 13 months ago by GenoMax 154k