searching for sequences in a fastq file
2
0
Entering edit mode
8.3 years ago

I have a list of 10 base-pair sequences say ACTTTTTTTT,TGGTTACTGA,.. in a textfile called id.txt. Using biopython,I want to iterate through a fastq file (say reads.fastq) and look for reads that contain these sequences and give me a fastq file with these reads only. Any suggestions?

next-gen • 5.7k views
ADD COMMENT
3
Entering edit mode

using simple grep you can extract.

for i in `cat text_file`; do grep -B 1 -A 2 $i fastq_file >>Extracted_reads.fastq; done;

text_file should have all your 10bp sequences one per line. if you want to separate reads for every pattern,

for i in `cat text_file`; do grep -B 1 -A 2 $i fastq_file >$i.fastq; done;
ADD REPLY
1
Entering edit mode
8.3 years ago

Using the BBMap package, you can do this:

bbduk.sh in=reads.fq outm=matched.fastq literal=ACTTTTTTTT,TGGTTACTGA mm=f k=10

Or, if you have the sequences in a fasta file:

bbduk.sh in=reads.fq outm=matched.fastq ref=patterns.fasta mm=f k=10
ADD COMMENT
1
Entering edit mode
8.3 years ago

Although I would recommend the grep solution, you asked for biopython. Perhaps your question fits in a larger project (which might be a reason to go for a larger script). Since you suggest biopython I assume you already have a basic knowledge and I won't write out your script :-) So your strategy, and a bit of pseudocode:

parse the fastq (see biopython cookbook), and check for every sequence in your fastq whether you can find a match, easy as pie(thon) with seqrecord your fastq record

with searchlist your sequences you want to select for (read from file, beware of line terminators)

with open('ids.txt') as inputfile:
    searchlist = [item.strip() for item in inputfile.readlines()]

Followed by something like:

for seqrecord in records:
    if [item for item in searchlist if seqrecord.seq.count(item)]: #List comprehensions are awesome, empty list = False
        outlist.append(seqrecord)

If you need additional help, tell me what doesn't work and I'll be happy to have a look.

ADD COMMENT
0
Entering edit mode

Thanks this works. I am trying to modify the seqrecord.id by appending the corresponding id to it after an underscore. Example the read id in the fastq file should look like

@M00770:337:000000000-AN0WE:1:2105:12937:3123 1:N:0:0CGTTACACAATCC_ACTTTTTTTT

I want to know what I should modify in the above code? My current code is

    id_array = []
    for read_record in SeqIO.parse("in.fastq", "fastq"):
            if [item for item in searchlist if read_record.seq.count(item)]:
                    print >> read1_out_handle, read_record.format("fastq")
                    id_array.appendread_record.id)

I want to modify read_record.id before it gets appended to id_array.

ADD REPLY
0
Entering edit mode

Hi,

you probably want to do something like:

for read_record in SeqIO.parse("in.fastq", "fastq"):
     matchlist = [item for item in searchlist if read_record.seq.count(item)]            
     if matchlist:
          read_record.id = read_record.id + '_' + '-'.join(matchlist) #every match in the matchlist gets joined with a '-'
          print >> read1_out_handle, read_record.format("fastq")
          id_array.appendread_record.id)

Be aware that also read_record.description exists, maybe you have to modify that or set that to an empty string to get the format you want.

I'm not sure what you are doing with print >> read1_out_handle, read_record.format("fastq") but it looks ugly like hell and very unpythonic. assuming you created that handle with e.g. read1_out_handle = open('outputfile.fastq', 'w'), I'd suggest you use read1_out_handle.write(read_record.format('fastq')) Besides looking better and being easier to read, that's going to work on python3 (and probably forever). But that's just my opinion ;-)

ADD REPLY
0
Entering edit mode

I understand your code, but I would like to do one more thing. My ids.txt file has two fields split by an underscore: example : fastq-id_identifier. In my in.fastq, I would like to choose reads which have the fastq-id (first part of each line in ids.txt without the identifier) and then change the read_id of these chosen reads by appending the corresponding identifier after "_". How can I modify searchlist/matchlist for this. Thanks

example ids.txt : M00770:337:000000000-AN0WE:1:2105:12937:3123_GCGCTCATACGC

M00770:337:000000000-AN0WE:1:2105:19458:3137_CTGGCTCCAAGT

M00770:337:000000000-AN0WE:1:2105:14047:3137_TGCGGTAACTAC

M00770:337:000000000-AN0WE:1:2105:12611:3158_CCACCTCCACAG strong text M00770:337:000000000-AN0WE:1:2105:14219:3159_CGTGGCTAGGCC

example in.fastq: @M00770:337:000000000-AN0WE:1:2105:12937:3123 2:N:0:0CGTTACACAATCC NNNTCNNGNNTGTNANNANNCTANNTTATCGTNCCGAAATGGGATCGCGATAAGCTATCTATAACAAAAAAAAAA +

11##>##11A#1##1##0A0##A0B0000#A0////AA10ABAEA/////11AA2BAEF22@1100/>////

@M00770:337:000000000-AN0WE:1:2105:19458:3137 2:N:0:0CGTTACACAATCC NNNTCNNCNNTGCNTNNTNNCTCNNTTCCTCCNCCTTGTCGCGCTCCCGCTCTGCTCTCTATCCCATCACACTAT +

>1##1##111#1##1##0A0##0000010#00///11////AA///////11@11@12@2011101111>01

@M00770:337:000000000-AN0WE:1:2105:14047:3137 2:N:0:0CGTTACACAATCC NNNTCNNGNNTGTNANNANNCTCNNTGCTGAGNTAGCAATTATTGAAGGCATGAAGTTTGATAGAGGATATATTT +

>1##1##111#1##1##0A0##0000010#00///ABB122A2111/B101112DA11D22110001A2DBE

@M00770:337:000000000-AN0WE:1:2105:12611:3158 2:N:0:0CGTTACACAATCC NNNTCNNTNNTGTNTNNCNNCTTNNTAGTCTCNCTCTTCTCGGATCGCGCTCTGCTCTCTCTAACCCCCAACTAC +

>1##1##111#1##1##0A0##A0000DB#000//121///AA////A//1AD1DB11121B0//////011

@M00770:337:000000000-AN0WE:1:2105:14219:3159 2:N:0:0CGTTACACAATCC NNNTCNNGNNTGTNTNNCNNCTCNNCCAACTCNCCACAACTACAACAAACACAACTCCTTTCCCCCCTCCCCCCC +

11##1##111#1##1##000##0000001#00/////111111//0/00///1111121B10//////////

Expected output example :

@M00770:337:000000000-AN0WE:1:2105:12937:3123 2:N:0:0CGTTACACAATCC_GCGCTCATACGC NNNTCNNGNNTGTNANNANNCTANNTTATCGTNCCGAAATGGGATCGCGATAAGCTATCTATAACAAAAAAAAAA +

11##>##11A#1##1##0A0##A0B0000#A0////AA10ABAEA/////11AA2BAEF22@1100/>////

@M00770:337:000000000-AN0WE:1:2105:19458:3137 2:N:0:0CGTTACACAATCC_CTGGCTCCAAGT NNNTCNNCNNTGCNTNNTNNCTCNNTTCCTCCNCCTTGTCGCGCTCCCGCTCTGCTCTCTATCCCATCACACTAT +

>1##1##111#1##1##0A0##0000010#00///11////AA///////11@11@12@2011101111>01

ADD REPLY

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6