Question

Extract Reverse Reads Based on Header Information of Forward Reads

0

Entering edit mode

6.4 years ago

dennis.guettler • 0

Hi Community.

Context: I used MiSeq for sequencing a metagenome with paired end setup. But due to overclustering, my reverse reads can't be matched by the adaptor sequence because the quality isn't good.

Approach: I demultiplexed the forward (R1) reads based on adaptor sequence - which was quite possible - and want to use the header information/Sequence identifier to find the reverse reads (R2) in the second file, because they have similar header names until a certain position. For this i extracted the header information out of R1 to the point, that they are similar in R2.

Concrete: An example for the extracted R1 Header (Let's call it "Forward.txt") is

M05640:9:000000000-B6W7Y:1:1101:15044:1347
M05640:9:000000000-B6W7Y:1:1101:15768:1360
M05640:9:000000000-B6W7Y:1:1101:14907:1368
M05640:9:000000000-B6W7Y:1:1101:14347:1398
...

The R2 file (Let's call it "Reverse.fasta") looks like this

>M05640:9:000000000-B6W7Y:1:1101:14170:1443_2:N:0:0
GGCTAGCATTTTCGGAGTTCCTCATCTGGACAG
>M05640:9:000000000-B6W7Y:1:1101:14362:1443_2:N:0:0
GTCAGCCCGTTCGCCCATCGCCGTCTCAAGAGTGATC
>M05640:9:000000000-B6W7Y:1:1101:14362:1443_2:N:0:0
GTCAGCCCGTTCGCCCATCGCCGTCTCAAGAGTGAT
>M05640:9:000000000-B6W7Y:1:1101:13983:1444_2:N:0:0
TTCCGGCGAGCGTGGACGGCTTGTGCATGGGGAACTT
....

Aim: I want to collect all the sequences + header information from Reverse.fasta based on the header information from forward.txt.

What was done so far: I searched a real long time, and found a python script which deals with that problem, but I can't adapt it perfectly, to solve my problem. (https://stackoverflow.com/questions/15352219/extract-sequences-from-a-fasta-file-based-on-entries-in-a-separate-file) I adapted it:

f2 = open('Forward.txt','r')
f1 = open('Reverse.fasta','r')
f3 = open('FindRV.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('_')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

Problem: I Have multiple demultiplexed Forward.txt files with absolute same structure but sometimes the output file has quite some sequences (like I hoped for) in it and sometimes there are zero. For the files with zero output, I manually searched for the sequence header in geneious11.1.4 and found sequences, so it's not a problem that there aren't any, it seems to be a problem of the code itself.

Hope: Iam no coder and tried and searched a long time for the right solution but now i'am desperate enough to write here ^^ hope anyone understands my issue and maybe has a solution.

Greetings

Miseq sequence identifier header paired reads • 2.3k views

ADD COMMENT • link updated 6.4 years ago by Dattatray Mongad ▴ 380 • written 6.4 years ago by dennis.guettler • 0

score 0 · Answer 1 · 2018-08-16

0

Entering edit mode

6.4 years ago

Dattatray Mongad ▴ 380

use fastq-pair, it gives the only reads which are in forward and reverse files. So, rather than making forward.txt(headers), make forward.fastq(containg reads for which you want reverse reads to be extracted) and run the program.

This will make four files in the directory:

forward.fastq.paired.fq
forward.fastq.single.fq
reverse.fastq.paired.fq
reverse.fastq.single.fq

What your seeking will be in 3rd file.

ADD COMMENT • link 6.4 years ago by Dattatray Mongad ▴ 380

0

Entering edit mode

Thanks for another option. the internet is so full of tools that one sometimes get lost :) I will try this

ADD REPLY • link 6.3 years ago by dennis.guettler • 0