Hi Community.
Context: I used MiSeq for sequencing a metagenome with paired end setup. But due to overclustering, my reverse reads can't be matched by the adaptor sequence because the quality isn't good.
Approach: I demultiplexed the forward (R1) reads based on adaptor sequence - which was quite possible - and want to use the header information/Sequence identifier to find the reverse reads (R2) in the second file, because they have similar header names until a certain position. For this i extracted the header information out of R1 to the point, that they are similar in R2.
Concrete: An example for the extracted R1 Header (Let's call it "Forward.txt") is
M05640:9:000000000-B6W7Y:1:1101:15044:1347
M05640:9:000000000-B6W7Y:1:1101:15768:1360
M05640:9:000000000-B6W7Y:1:1101:14907:1368
M05640:9:000000000-B6W7Y:1:1101:14347:1398
...
The R2 file (Let's call it "Reverse.fasta") looks like this
>M05640:9:000000000-B6W7Y:1:1101:14170:1443_2:N:0:0
GGCTAGCATTTTCGGAGTTCCTCATCTGGACAG
>M05640:9:000000000-B6W7Y:1:1101:14362:1443_2:N:0:0
GTCAGCCCGTTCGCCCATCGCCGTCTCAAGAGTGATC
>M05640:9:000000000-B6W7Y:1:1101:14362:1443_2:N:0:0
GTCAGCCCGTTCGCCCATCGCCGTCTCAAGAGTGAT
>M05640:9:000000000-B6W7Y:1:1101:13983:1444_2:N:0:0
TTCCGGCGAGCGTGGACGGCTTGTGCATGGGGAACTT
....
Aim: I want to collect all the sequences + header information from Reverse.fasta based on the header information from forward.txt.
What was done so far: I searched a real long time, and found a python script which deals with that problem, but I can't adapt it perfectly, to solve my problem. (https://stackoverflow.com/questions/15352219/extract-sequences-from-a-fasta-file-based-on-entries-in-a-separate-file) I adapted it:
f2 = open('Forward.txt','r')
f1 = open('Reverse.fasta','r')
f3 = open('FindRV.txt','w')
AI_DICT = {}
for line in f2:
AI_DICT[line[:-1]] = 1
skip = 0
for line in f1:
if line[0] == '>':
_splitline = line.split('_')
accessorIDWithArrow = _splitline[0]
accessorID = accessorIDWithArrow[1:-1]
# print accessorID
if accessorID in AI_DICT:
f3.write(line)
skip = 0
else:
skip = 1
else:
if not skip:
f3.write(line)
f1.close()
f2.close()
f3.close()
Problem: I Have multiple demultiplexed Forward.txt files with absolute same structure but sometimes the output file has quite some sequences (like I hoped for) in it and sometimes there are zero. For the files with zero output, I manually searched for the sequence header in geneious11.1.4 and found sequences, so it's not a problem that there aren't any, it seems to be a problem of the code itself.
Hope: Iam no coder and tried and searched a long time for the right solution but now i'am desperate enough to write here ^^ hope anyone understands my issue and maybe has a solution.
Greetings
Thanks for another option. the internet is so full of tools that one sometimes get lost :) I will try this