Select sequences if contain two specific substrings
1
0
Entering edit mode
4.0 years ago
Apex92 ▴ 300

I have a big fasta file that I want to select reads if contain any substrings based on two dictionaries (in the structure below).

casp_fw1 ={'P_fw1':'CCAGGTAGTAGTACGTCTGTT','P_fw1_rc':'AACAGACGTACTACTACCTGG',
         'A_fw1':'CGGCCTCCTCCAATTAAAGAA','A_fw1_rc':'TTCTTTAATTGGAGGAGGCCG'}

casp_fw2 ={'P_fw2':'CGTGCTGCGAGAGTATTATCT','P_fw2_rc':'AGATAATACTCTCGCAGCACG',
            'A_fw2':'CGTGAACCGTTATTTGGGTAC','A_fw2_rc':'GTACCCAAATAACGGTTCACG'}

This is my code but I noticed that I miss some of the reads even though the have substrings from the two dictionaries. Could somebody please help to optimize the code? I afraid that 3 for loops cause the problem.

from Bio import SeqIO

count=0
with open('1_S1_L001_R1_001.fasta','r') as f:
    for i in SeqIO.parse(f,'fasta'):
        for k1,v1 in casp_fw1.items():
            for k2,v2 in casp_fw2.items():
                if v1 in i.seq and v2 in i.seq:
                    casp_fw1_pos = i.seq.find(v1) + 1
                    casp_fw2_pos = i.seq.find(v2) + 1
                    if casp_fw1_pos < casp_fw2_pos:
                        distance = casp_fw2_pos - casp_fw1_pos
                    if casp_fw1_pos > casp_fw2_pos:
                        distance = casp_fw1_pos - casp_fw2_pos
                    print("sample_1")
                    printi.id)
                    print(i.seq)
                    print(str(k1) + " : " + str(v1) +  " - The position of this fw 21nt is " + str(casp_fw1_pos))
                    print(str(k2) + " : " + str(v2) +  " - The position of this rv 21nt is " + str(casp_fw2_pos))
                    print("Distance is " + str(distance))
                    print('\n')
                    count+=1
    print("The total number of reads that have both 21nt protein-specific sequences is " + str(count))
sequence RNA-Seq software error python • 858 views
ADD COMMENT
0
Entering edit mode

Curious if you were not able to get bbduk.sh to work as discussed in a prior thread you had posted.

ADD REPLY
0
Entering edit mode

Hey genomax, I ran it but the output was not exactly as I was looking for - so did the mapping otherway around with making each fasta file as a separate reference and the custom reference file as the input - with this I was able to align each sample with the custom reference

ADD REPLY
0
Entering edit mode

Is understanding the position of the match necessary - it's not clear to me from your question? if you want to select sequences based on solely containing those subsequences you simply need to do if subseq in sequence

ADD REPLY
0
Entering edit mode
3.9 years ago

@Biostar bot pushes this post to frontpage ~

It looks like a case of amplicon sequencing data, for retrieving amplicon from SE or merged PE reads. We did this a lot, so I wrote a tool seqkit amplicon (link for usage and examples) for these kind of operations.

  • It supports retrieving amplicon sequence (or specific region around it) via primer(s).
  • Mismatch is allowed when matching primers and sequences, but the mismatch position is not controled, e.g., 3' end mismatch is possible in PCR.
  • Output can be FASTA sequence or BED6 format.
  • Multiple pairs of primers are also supported.
ADD COMMENT

Login before adding your answer.

Traffic: 1105 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6