Hello,
I have a list of forward and reverse complement barcodes (I only show the first rows, there are 30,269 barcodes) :
kas077-1 AAACCCAAGAAGATCT AGATCTTCTTGGGTTT
kas077-2 AAACCCAAGACACACG CGTGTGTCTTGGGTTT
kas077-3 AAACCCAAGCGAGGAG CTCCTCGCTTGGGTTT
kas077-4 AAACCCAAGCGGACAT ATGTCCGCTTGGGTTT
kas077-5 AAACCCAAGCTTTGTG CACAAAGCTTGGGTTT
kas077-6 AAACCCAAGGATACAT ATGTATCCTTGGGTTT
kas077-7 AAACCCAAGGGCCCTT AAGGGCCCTTGGGTTT
And a multi-fasta file (883,993 sequences with a mean size of 1006pb). I need to split this file according to the barcodes. The specifity is :
- I need to find the forward barcodes into the first 50 pb
- I need to find the reverse barcodes into the last 50 pb
All the fasta sequences are not necessary found. I guess the half will be found.
from Bio import SeqIO
barcodes = open("my_barcodes_file.txt", 'r')
barcodes_fwd = {}
barcodes_rev = {}
for line in lines :
barcodes_fwd[line.split(' ')[0]]=line.split(' ')[1]
barcodes_rev[line.split(' ')[0]]=line.split(' ')[2]
with open("my_fasta_file.fasta") as handle:
for record in SeqIO.parse(handle, "fasta"):
for (k1,v1),(k2,v2) in zip(barcodes_fwd.items(),barcodes_rev.items()):
if v1 in record.seq[0:50] or v2 in record.seq[-50:]:
print(record.id,k1+"\n",record.seq)
For the moment, I have something like that which returns :
>read_ID barcode_ID
fasta_sequence
I will split later the output in multi-files by barcode ID.
The problem is that it is very slow. Do you have an idea to update the code?
Best
This is a perfect use case for
bbduk.sh
from BBmap suite in filter mode (it can do other things to). There is a guide available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/While this is not a direct answer for your question if you are worried about performance then this should help.
Thanks, I'll have a look on this tool. Anyway, do you think my code could be improved to get what I expect?
Biopython libraries, as I understand, are slow. Instead of loop on lists, try list comprehension.
Shoud I use :
bbduk.sh -Xmx2g in=my_file.fasta outm=matched.fa ref=barcodes.fasta k=16 stats=stats.txt
?There are
restrictleft=
andrestrictright=
options that you will need to add. You can pipe things from one search into other.An example would be (untested) for first index combo in your example
What this should do is it first finds all reads that have left adapter in first 50 bases and only send those on to the second command that looks for second adapter in right half of reads and then writes out ones that match both instances.
Thanks for your prompt reply. I see the
literal
option which is interesting , but I have 30,269 x2 barcodes sequences. Is there a way to loop over the full set of barcodes file ? Also, a read has either the right or the left instance, never both.