Question

Select sequences if contain two specific substrings

0

Entering edit mode

4.0 years ago

Apex92 ▴ 300

I have a big fasta file that I want to select reads if contain any substrings based on two dictionaries (in the structure below).

casp_fw1 ={'P_fw1':'CCAGGTAGTAGTACGTCTGTT','P_fw1_rc':'AACAGACGTACTACTACCTGG',
         'A_fw1':'CGGCCTCCTCCAATTAAAGAA','A_fw1_rc':'TTCTTTAATTGGAGGAGGCCG'}

casp_fw2 ={'P_fw2':'CGTGCTGCGAGAGTATTATCT','P_fw2_rc':'AGATAATACTCTCGCAGCACG',
            'A_fw2':'CGTGAACCGTTATTTGGGTAC','A_fw2_rc':'GTACCCAAATAACGGTTCACG'}

This is my code but I noticed that I miss some of the reads even though the have substrings from the two dictionaries. Could somebody please help to optimize the code? I afraid that 3 for loops cause the problem.

from Bio import SeqIO

count=0
with open('1_S1_L001_R1_001.fasta','r') as f:
    for i in SeqIO.parse(f,'fasta'):
        for k1,v1 in casp_fw1.items():
            for k2,v2 in casp_fw2.items():
                if v1 in i.seq and v2 in i.seq:
                    casp_fw1_pos = i.seq.find(v1) + 1
                    casp_fw2_pos = i.seq.find(v2) + 1
                    if casp_fw1_pos < casp_fw2_pos:
                        distance = casp_fw2_pos - casp_fw1_pos
                    if casp_fw1_pos > casp_fw2_pos:
                        distance = casp_fw1_pos - casp_fw2_pos
                    print("sample_1")
                    printi.id)
                    print(i.seq)
                    print(str(k1) + " : " + str(v1) +  " - The position of this fw 21nt is " + str(casp_fw1_pos))
                    print(str(k2) + " : " + str(v2) +  " - The position of this rv 21nt is " + str(casp_fw2_pos))
                    print("Distance is " + str(distance))
                    print('\n')
                    count+=1
    print("The total number of reads that have both 21nt protein-specific sequences is " + str(count))

sequence RNA-Seq software error python • 856 views

ADD COMMENT • link updated 3.9 years ago by Biostar 20 • written 4.0 years ago by Apex92 ▴ 300

0

Entering edit mode

Curious if you were not able to get bbduk.sh to work as discussed in a prior thread you had posted.

ADD REPLY • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

Hey genomax, I ran it but the output was not exactly as I was looking for - so did the mapping otherway around with making each fasta file as a separate reference and the custom reference file as the input - with this I was able to align each sample with the custom reference

ADD REPLY • link 4.0 years ago by Apex92 ▴ 300

0

Entering edit mode

Is understanding the position of the match necessary - it's not clear to me from your question? if you want to select sequences based on solely containing those subsequences you simply need to do if subseq in sequence

ADD REPLY • link 4.0 years ago by Joe 21k

score 0 · Answer 1 · 2021-01-03

@Biostar bot pushes this post to frontpage ~

It looks like a case of amplicon sequencing data, for retrieving amplicon from SE or merged PE reads. We did this a lot, so I wrote a tool seqkit amplicon (link for usage and examples) for these kind of operations.

It supports retrieving amplicon sequence (or specific region around it) via primer(s).
Mismatch is allowed when matching primers and sequences, but the mismatch position is not controled, e.g., 3' end mismatch is possible in PCR.
Output can be FASTA sequence or BED6 format.
Multiple pairs of primers are also supported.