We are using multiple primers for the PCR, and we want to remove any alternations introduced from the primer sequences while keeping the sequence as intact as possible since it will affected the structure of the downstream protein product. I only find tools that can mask the primer sequences with Ns, but I want to keep the sequence intact. Is there a way that I can trim off the matched primer sequences and replace it with the closest matched reference primer sequence?
The Input Fastq could be something look like this:
@M03739:62:000000000-JDHFC:1:1101:17064:1807 1:N:0:19
CATTCG**CAGATGCAGCTGGTGCA**GTCTGGGTCTGAGTTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTCACTAGCTATACTATGAATTGGATACGACAGGCCCCTGGACAAGGGCTTGAGTGGCTGGGATGGATCAACACCAACAGTGGGAACCCAACGTATACCCAGGGCTTCACAGGACGGTTTGTCTTCTCCTTGGACACCTCTGTCAGCACGGCATATCTGCAGATCAGCAGCCTAAAGGCTGAGGACACTGCCGTGTATTACTGTGCGAGGG
+
...
@M03739:62:000000000-JDHFC:1:1101:23479:1823 1:N:0:19
TATTAG**GAGGTGCAGCTGGTGCA**GTGAGCTGCCTTGATGGAGCTAGTACACTTGCTCAACATGGCTGAGTGTTCCCTGTGTTGCACCAGGCACAACACATCCCCCAAGAGCTTCTCATGCTTGCACATGCACTCAGAGTCCACCTTCACACAGCCACAACGACGGCCCAGAGCCGGATCTCTCATCTCCAAGATAAACATAGTGCCCTGGGGAGGGACCACGGTCACCGTCCCCTCACATTTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAATTAACATCTCGTATGCCGTCT
+
...
The primer fasta could be like this:
>Primer-1
CAGATGCAGCTGGTGCA
>Primer-2
CAGGTGCAGCTGGTGCA
The primer sequence starts at the 7th base pair, so after the primer trimming and masking, I hope I could get the output fastq like this:
@M03739:62:000000000-JDHFC:1:1101:17064:1807 1:N:0:19
**CAGATGCAGCTGGTGCA**GTCTGGGTCTGAGTTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTCACTAGCTATACTATGAATTGGATACGACAGGCCCCTGGACAAGGGCTTGAGTGGCTGGGATGGATCAACACCAACAGTGGGAACCCAACGTATACCCAGGGCTTCACAGGACGGTTTGTCTTCTCCTTGGACACCTCTGTCAGCACGGCATATCTGCAGATCAGCAGCCTAAAGGCTGAGGACACTGCCGTGTATTACTGTGCGAGGG
+
...
@M03739:62:000000000-JDHFC:1:1101:23479:1823 1:N:0:19
**CAGGTGCAGCTGGTGCA**GTGAGCTGCCTTGATGGAGCTAGTACACTTGCTCAACATGGCTGAGTGTTCCCTGTGTTGCACCAGGCACAACACATCCCCCAAGAGCTTCTCATGCTTGCACATGCACTCAGAGTCCACCTTCACACAGCCACAACGACGGCCCAGAGCCGGATCTCTCATCTCCAAGATAAACATAGTGCCCTGGGGAGGGACCACGGTCACCGTCCCCTCACATTTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAATTAACATCTCGTATGCCGTCT
+
...
The sequences preceding the primer shall be trimmed, and the altered the primer was replaced by the reference primer sequence.
Post example input data and expected output data instead of describing the data and don't post the data images.
Input data are primer fasta and target sequence fastq files, output files are modified target sequence fastq files
Example data is added to OP.
Since you were providing incomplete fastq records, I created a fasta file with necessary sequences from example data and am posting example code (and code works with fastq files):
input:
output:
Either you identify the fixed part of primer sequence to remove or allow cutadapt or any other trimming tool certain error tolerance (0.2 for example data) to trim fastq records.
If it is always after 7bp, remove first 7 bp first and replace first 17 characters (primer sequence length) with primer sequence, after removing 7 bp.