I have a list of 10 base-pair sequences say ACTTTTTTTT,TGGTTACTGA,.. in a textfile called id.txt. Using biopython,I want to iterate through a fastq file (say reads.fastq) and look for reads that contain these sequences and give me a fastq file with these reads only. Any suggestions?
Although I would recommend the grep solution, you asked for biopython.
Perhaps your question fits in a larger project (which might be a reason to go for a larger script).
Since you suggest biopython I assume you already have a basic knowledge and I won't write out your script :-)
So your strategy, and a bit of pseudocode:
parse the fastq (see biopython cookbook), and check for every sequence in your fastq whether you can find a match, easy as pie(thon)
with seqrecord your fastq record
with searchlist your sequences you want to select for (read from file, beware of line terminators)
with open('ids.txt') as inputfile:
searchlist =[item.strip()for item in inputfile.readlines()]
Followed by something like:
for seqrecord in records:
if[item for item in searchlist if seqrecord.seq.count(item)]: #List comprehensions are awesome, empty list = False
outlist.append(seqrecord)
If you need additional help, tell me what doesn't work and I'll be happy to have a look.
Thanks this works. I am trying to modify the seqrecord.id by appending the corresponding id to it after an underscore. Example the read id in the fastq file should look like
I want to know what I should modify in the above code? My current code is
id_array =[]for read_record in SeqIO.parse("in.fastq", "fastq"):
if[item for item in searchlist if read_record.seq.count(item)]:
print >> read1_out_handle, read_record.format("fastq")
id_array.appendread_record.id)
I want to modify read_record.id before it gets appended to id_array.
for read_record in SeqIO.parse("in.fastq", "fastq"):
matchlist =[item for item in searchlist if read_record.seq.count(item)]if matchlist:
read_record.id = read_record.id + '_' + '-'.join(matchlist)#every match in the matchlist gets joined with a '-'
print >> read1_out_handle, read_record.format("fastq")
id_array.appendread_record.id)
Be aware that also read_record.description exists, maybe you have to modify that or set that to an empty string to get the format you want.
I'm not sure what you are doing with print >> read1_out_handle, read_record.format("fastq") but it looks ugly like hell and very unpythonic. assuming you created that handle with e.g. read1_out_handle = open('outputfile.fastq', 'w'), I'd suggest you use read1_out_handle.write(read_record.format('fastq')) Besides looking better and being easier to read, that's going to work on python3 (and probably forever). But that's just my opinion ;-)
I understand your code, but I would like to do one more thing. My ids.txt file has two fields split by an underscore: example :
fastq-id_identifier. In my in.fastq, I would like to choose reads which have the fastq-id (first part of each line in ids.txt without the identifier) and then change the read_id of these chosen reads by appending the corresponding identifier after "_". How can I modify searchlist/matchlist for this. Thanks
example ids.txt :
M00770:337:000000000-AN0WE:1:2105:12937:3123_GCGCTCATACGC
using simple grep you can extract.
text_file should have all your 10bp sequences one per line. if you want to separate reads for every pattern,