I am working on a simple fastq -> mutant enrichment score pipeline, but wonder if I'm not thiking to simplistic. This is the idea...
Setup:
- I have an UNSORTED and SORTED sample, 2 fastqs each.. R1 and R2.
Readlenght is 150bp.
- The sequence of interest is a 192bp long sequence.
- R1 has a primer1 indicating the start of sequence of interest
- R2 has a primer2 indicating the start of sequence of interest
Approach:
- Trim raw data using the primers, keeping only the region of interest
- Merge R1 and R2, creating the complete region of interest (discarding all resulting reads not being 192bp and filtering on quality 30). Little of over 80% of reads remain here btw.
- (Use seqtk to) translate DNA sequence to protein sequence (first fastq to fasta, then fasta to protein)
- Calculate frequency of protein mutants/variants (nr of variants divided by total amount) for each sample
- Calculate enrichment using ratios from 4) (freq-SORT/freq-UNSORTED)?
- log2 transform the results from 5)
End result:
Data table with amino acids sequence of interest as cols, amino-acid changes as rows and log2(enrichmentratios) as values which will then be plotted in the form of a heatmap based on enrichment ratios...
Because we are looking at a fixed sized sequence which is entirely within the PE reads no mapping is necessary.
Other:
I have been looking into various options for DMS (enrich2, dms_tools2, mutscan) but if the above is correct then diving into those tools feels a bit much...
I feel like I'm looking at iit too easy though, what am I missing?
UPDATE
After meeting up with the person requesting this it indeed seems this is the proper approach, at least... After processing an earlier datasets the results overlap with the original. (Apart from some details)