Question

searching for sequences in a fastq file

0

Entering edit mode

8.4 years ago

sumithrasank75 ▴ 140

I have a list of 10 base-pair sequences say ACTTTTTTTT,TGGTTACTGA,.. in a textfile called id.txt. Using biopython,I want to iterate through a fastq file (say reads.fastq) and look for reads that contain these sequences and give me a fastq file with these reads only. Any suggestions?

next-gen • 5.7k views

ADD COMMENT • link updated 8.4 years ago by WouterDeCoster 47k • written 8.4 years ago by sumithrasank75 ▴ 140

3

Entering edit mode

using simple grep you can extract.

for i in `cat text_file`; do grep -B 1 -A 2 $i fastq_file >>Extracted_reads.fastq; done;

text_file should have all your 10bp sequences one per line. if you want to separate reads for every pattern,

for i in `cat text_file`; do grep -B 1 -A 2 $i fastq_file >$i.fastq; done;

ADD REPLY • link 8.4 years ago by Prasad ★ 1.6k

score 1 · Answer 1 · 2016-08-04

1

Entering edit mode

8.4 years ago

Brian Bushnell 20k

Using the BBMap package, you can do this:

bbduk.sh in=reads.fq outm=matched.fastq literal=ACTTTTTTTT,TGGTTACTGA mm=f k=10

Or, if you have the sequences in a fasta file:

bbduk.sh in=reads.fq outm=matched.fastq ref=patterns.fasta mm=f k=10

ADD COMMENT • link 8.4 years ago by Brian Bushnell 20k

score 1 · Answer 2 · 2016-08-04

1

Entering edit mode

8.4 years ago

WouterDeCoster 47k

Although I would recommend the grep solution, you asked for biopython. Perhaps your question fits in a larger project (which might be a reason to go for a larger script). Since you suggest biopython I assume you already have a basic knowledge and I won't write out your script :-) So your strategy, and a bit of pseudocode:

parse the fastq (see biopython cookbook), and check for every sequence in your fastq whether you can find a match, easy as pie(thon) with seqrecord your fastq record

with searchlist your sequences you want to select for (read from file, beware of line terminators)

with open('ids.txt') as inputfile:
    searchlist = [item.strip() for item in inputfile.readlines()]

Followed by something like:

for seqrecord in records:
    if [item for item in searchlist if seqrecord.seq.count(item)]: #List comprehensions are awesome, empty list = False
        outlist.append(seqrecord)

If you need additional help, tell me what doesn't work and I'll be happy to have a look.

ADD COMMENT • link 8.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks this works. I am trying to modify the seqrecord.id by appending the corresponding id to it after an underscore. Example the read id in the fastq file should look like

@M00770:337:000000000-AN0WE:1:2105:12937:3123 1:N:0:0CGTTACACAATCC_ACTTTTTTTT

I want to know what I should modify in the above code? My current code is

    id_array = []
    for read_record in SeqIO.parse("in.fastq", "fastq"):
            if [item for item in searchlist if read_record.seq.count(item)]:
                    print >> read1_out_handle, read_record.format("fastq")
                    id_array.appendread_record.id)

I want to modify read_record.id before it gets appended to id_array.

ADD REPLY • link 8.4 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

Hi,

you probably want to do something like:

for read_record in SeqIO.parse("in.fastq", "fastq"):
     matchlist = [item for item in searchlist if read_record.seq.count(item)]            
     if matchlist:
          read_record.id = read_record.id + '_' + '-'.join(matchlist) #every match in the matchlist gets joined with a '-'
          print >> read1_out_handle, read_record.format("fastq")
          id_array.appendread_record.id)

Be aware that also read_record.description exists, maybe you have to modify that or set that to an empty string to get the format you want.

I'm not sure what you are doing with print >> read1_out_handle, read_record.format("fastq") but it looks ugly like hell and very unpythonic. assuming you created that handle with e.g. read1_out_handle = open('outputfile.fastq', 'w'), I'd suggest you use read1_out_handle.write(read_record.format('fastq')) Besides looking better and being easier to read, that's going to work on python3 (and probably forever). But that's just my opinion ;-)

ADD REPLY • link 8.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I understand your code, but I would like to do one more thing. My ids.txt file has two fields split by an underscore: example : fastq-id_identifier. In my in.fastq, I would like to choose reads which have the fastq-id (first part of each line in ids.txt without the identifier) and then change the read_id of these chosen reads by appending the corresponding identifier after "_". How can I modify searchlist/matchlist for this. Thanks

example ids.txt : M00770:337:000000000-AN0WE:1:2105:12937:3123_GCGCTCATACGC

M00770:337:000000000-AN0WE:1:2105:19458:3137_CTGGCTCCAAGT

M00770:337:000000000-AN0WE:1:2105:14047:3137_TGCGGTAACTAC

M00770:337:000000000-AN0WE:1:2105:12611:3158_CCACCTCCACAG strong text M00770:337:000000000-AN0WE:1:2105:14219:3159_CGTGGCTAGGCC

example in.fastq: @M00770:337:000000000-AN0WE:1:2105:12937:3123 2:N:0:0CGTTACACAATCC NNNTCNNGNNTGTNANNANNCTANNTTATCGTNCCGAAATGGGATCGCGATAAGCTATCTATAACAAAAAAAAAA +

11##>##11A#1##1##0A0##A0B0000#A0////AA10ABAEA/////11AA2BAEF22@1100/>////

@M00770:337:000000000-AN0WE:1:2105:19458:3137 2:N:0:0CGTTACACAATCC NNNTCNNCNNTGCNTNNTNNCTCNNTTCCTCCNCCTTGTCGCGCTCCCGCTCTGCTCTCTATCCCATCACACTAT +

>1##1##111#1##1##0A0##0000010#00///11////AA///////11@11@12@2011101111>01

@M00770:337:000000000-AN0WE:1:2105:14047:3137 2:N:0:0CGTTACACAATCC NNNTCNNGNNTGTNANNANNCTCNNTGCTGAGNTAGCAATTATTGAAGGCATGAAGTTTGATAGAGGATATATTT +

>1##1##111#1##1##0A0##0000010#00///ABB122A2111/B101112DA11D22110001A2DBE

@M00770:337:000000000-AN0WE:1:2105:12611:3158 2:N:0:0CGTTACACAATCC NNNTCNNTNNTGTNTNNCNNCTTNNTAGTCTCNCTCTTCTCGGATCGCGCTCTGCTCTCTCTAACCCCCAACTAC +

>1##1##111#1##1##0A0##A0000DB#000//121///AA////A//1AD1DB11121B0//////011

@M00770:337:000000000-AN0WE:1:2105:14219:3159 2:N:0:0CGTTACACAATCC NNNTCNNGNNTGTNTNNCNNCTCNNCCAACTCNCCACAACTACAACAAACACAACTCCTTTCCCCCCTCCCCCCC +

11##1##111#1##1##000##0000001#00/////111111//0/00///1111121B10//////////

Expected output example :

@M00770:337:000000000-AN0WE:1:2105:12937:3123 2:N:0:0CGTTACACAATCC_GCGCTCATACGC NNNTCNNGNNTGTNANNANNCTANNTTATCGTNCCGAAATGGGATCGCGATAAGCTATCTATAACAAAAAAAAAA +

11##>##11A#1##1##0A0##A0B0000#A0////AA10ABAEA/////11AA2BAEF22@1100/>////

@M00770:337:000000000-AN0WE:1:2105:19458:3137 2:N:0:0CGTTACACAATCC_CTGGCTCCAAGT NNNTCNNCNNTGCNTNNTNNCTCNNTTCCTCCNCCTTGTCGCGCTCCCGCTCTGCTCTCTATCCCATCACACTAT +

>1##1##111#1##1##0A0##0000010#00///11////AA///////11@11@12@2011101111>01

ADD REPLY • link 8.4 years ago by sumithrasank75 ▴ 140