Entering edit mode
4.0 years ago
Apex92
▴
300
I have a big fasta file that I want to select reads if contain any substrings based on two dictionaries (in the structure below).
casp_fw1 ={'P_fw1':'CCAGGTAGTAGTACGTCTGTT','P_fw1_rc':'AACAGACGTACTACTACCTGG',
'A_fw1':'CGGCCTCCTCCAATTAAAGAA','A_fw1_rc':'TTCTTTAATTGGAGGAGGCCG'}
casp_fw2 ={'P_fw2':'CGTGCTGCGAGAGTATTATCT','P_fw2_rc':'AGATAATACTCTCGCAGCACG',
'A_fw2':'CGTGAACCGTTATTTGGGTAC','A_fw2_rc':'GTACCCAAATAACGGTTCACG'}
This is my code but I noticed that I miss some of the reads even though the have substrings from the two dictionaries. Could somebody please help to optimize the code? I afraid that 3 for loops cause the problem.
from Bio import SeqIO
count=0
with open('1_S1_L001_R1_001.fasta','r') as f:
for i in SeqIO.parse(f,'fasta'):
for k1,v1 in casp_fw1.items():
for k2,v2 in casp_fw2.items():
if v1 in i.seq and v2 in i.seq:
casp_fw1_pos = i.seq.find(v1) + 1
casp_fw2_pos = i.seq.find(v2) + 1
if casp_fw1_pos < casp_fw2_pos:
distance = casp_fw2_pos - casp_fw1_pos
if casp_fw1_pos > casp_fw2_pos:
distance = casp_fw1_pos - casp_fw2_pos
print("sample_1")
printi.id)
print(i.seq)
print(str(k1) + " : " + str(v1) + " - The position of this fw 21nt is " + str(casp_fw1_pos))
print(str(k2) + " : " + str(v2) + " - The position of this rv 21nt is " + str(casp_fw2_pos))
print("Distance is " + str(distance))
print('\n')
count+=1
print("The total number of reads that have both 21nt protein-specific sequences is " + str(count))
Curious if you were not able to get
bbduk.sh
to work as discussed in a prior thread you had posted.Hey genomax, I ran it but the output was not exactly as I was looking for - so did the mapping otherway around with making each fasta file as a separate reference and the custom reference file as the input - with this I was able to align each sample with the custom reference
Is understanding the position of the match necessary - it's not clear to me from your question? if you want to select sequences based on solely containing those subsequences you simply need to do
if subseq in sequence