Extract fastq sequences based on date/time (which is in the header)
3
1
Entering edit mode
6.2 years ago
a.b.g ▴ 10

I have a series of fastq files (with up to 4000 reads in each) that I want to parse based on the time of sequencing.

So in the fastq header, date/time is listed as "start_time=2017-10-09T18:54:24z"

If I wanted to extract all sequences between 18:00 hours and 20:00 hours, is there a tool I can use to find and extract them?

sequence fastq • 3.9k views
ADD COMMENT
0
Entering edit mode

Can you post a couple of examples of full/complete headers? Is this nanopore data?

ADD REPLY
0
Entering edit mode

I am not aware of any tools to handle this. If you have some Python-programming experience I would be glad to help you.

ADD REPLY
5
Entering edit mode
6.2 years ago

I wrote a python script now.

Execute as python timefilt.py part1.fastq.gz --time_from 2017-10-09T18:00:00Z --time_to 2017-10-09T20:00:00Z

ADD COMMENT
0
Entering edit mode

That's fantastic WouterDeCoster, thank you very much. Just recently got started on Python so should be able to use it.

Question: how would I use this on files that aren't gzipped?

ADD REPLY
1
Entering edit mode

Well, there is not really a reason to not gzip your fastq files :)

That said, you would just have to edit this line:

for record in SeqIO.parse(gzip.open(args.fastq, 'rt'), "fastq"):

to

for record in SeqIO.parse(args.fastq, "fastq"):
ADD REPLY
0
Entering edit mode

Thanks WouterDeCouster. I got it to work.

One final naive question. This prints to the screen doesn't it? Would there also be an easy way of writing it to a new file? (unless it does it already and I'm missing it)

Thanks again. You've helped a real newbie.

ADD REPLY
2
Entering edit mode

Use redirect python timefilt.py part1.fastq.gz --time_from 2017-10-09T18:00:00Z --time_to 2017-10-09T20:00:00Z > new.fastq or if you want to keep the file compressed python timefilt.py part1.fastq.gz --time_from 2017-10-09T18:00:00Z --time_to 2017-10-09T20:00:00Z | gzip > new.fastq.gz

ADD REPLY
0
Entering edit mode

Should have known that! Thanks, you've both been a great help and I'm using it already.

I'm currently using cat to merge the fastqs to run this on multiple files, I can't think this is the most efficient method. Might there be an easy way to run this on multiple files to produce one fastq output?

ADD REPLY
0
Entering edit mode

Why not process all fastq's independently and then merge the resulting files into one. That would brute force parallelize this process saving time.

ADD REPLY
0
Entering edit mode

Alternatively we could modify the code to make it read from stdin.

ADD REPLY
0
Entering edit mode

Would that be difficult to do?

ADD REPLY
0
Entering edit mode

Oh not at all :) you should read from sys.stdin rather than from the file.

For inspiration you can take a look at https://github.com/wdecoster/nanofilt

ADD REPLY
0
Entering edit mode

Sorry, I've just come back to this. Please excuse my limited programming knowledge (I'm in the process of learning) but what would be the easiest way of using this to loop through multiple files? Unfortunately not fully sure how to use sys.stdin

ADD REPLY
3
Entering edit mode
6.2 years ago
GenoMax 147k

Try this.

grep -A 3 --no-group-separator -E 'start_time=2017-10-09T18|start_time=2017-10-09T19' your.fastq > seq_you_want.fastq

If your files are gzipped then:

zcat your_file.fq.gz | grep -A 3 --no-group-separator -E 'start_time=2017-10-09T18|start_time=2017-10-09T19' > seq_you_want.fastq
ADD COMMENT
0
Entering edit mode

Thanks genomax, it is indeed nanopore data.

Thanks for the simple solution. Worked quite well for looking at only 2 hours of data.

ADD REPLY
1
Entering edit mode
22 months ago
michael ▴ 10

I wrote a tool for this: https://github.com/mbhall88/ontime

You can specify your time range as a date/timestamp, or as duration from the start/end of sequencing.

For example

$ ontime --from 1h30m --to -2h in.fq

will extract reads sequenced after the first hour and a half and before the last two hours

ADD COMMENT

Login before adding your answer.

Traffic: 2472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6