Extract fastq sequences based on date/time (which is in the header)

1

Entering edit mode

7.0 years ago

a.b.g ▴ 10

I have a series of fastq files (with up to 4000 reads in each) that I want to parse based on the time of sequencing.

So in the fastq header, date/time is listed as "start_time=2017-10-09T18:54:24z"

If I wanted to extract all sequences between 18:00 hours and 20:00 hours, is there a tool I can use to find and extract them?

sequence fastq • 4.7k views

ADD COMMENT • link updated 2.7 years ago by michael ▴ 20 • written 7.0 years ago by a.b.g ▴ 10

0

Entering edit mode

Can you post a couple of examples of full/complete headers? Is this nanopore data?

ADD REPLY • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

I am not aware of any tools to handle this. If you have some Python-programming experience I would be glad to help you.

ADD REPLY • link 7.0 years ago by WouterDeCoster 48k

5

Entering edit mode

7.0 years ago

WouterDeCoster 48k

I wrote a python script now.

Execute as python timefilt.py part1.fastq.gz --time_from 2017-10-09T18:00:00Z --time_to 2017-10-09T20:00:00Z

	from argparse import ArgumentParser
	from Bio import SeqIO
	import gzip
	from dateutil.parser import parse as dparse


	def main():
	args = get_args()
	time_from = dparse(args.time_from)
	time_to = dparse(args.time_to)
	for record in SeqIO.parse(gzip.open(args.fastq, 'rt'), "fastq"):
	if filter_time(record.description, time_from, time_to):
	print(record.format("fastq"), end="")


	def get_args():
	parser = ArgumentParser(description="Filter nanopore data based on time")
	parser.add_argument("fastq", help="input gzip compressed fastq file")
	parser.add_argument("--time_from", help="filtering reads after this", required=True)
	parser.add_argument("--time_to", help="filtering reads before this", required=True)
	return parser.parse_args()


	def filter_time(descr, tfrom, tto):
	"e.g. runid=2582f1be82a0c29a01c4c852b672559257416ff6 read=12 ch=2153 start_time=2017-10-13T11:54:09Z"
	time = dparse([i for i in descr.split() if i.startswith('start_time')][0].split('=')[1])
	return True if tfrom <= time <= tto else False


	if __name__ == '__main__':
	main()

view raw nanopore_timefilt.py hosted with ❤ by GitHub

ADD COMMENT • link 7.0 years ago by WouterDeCoster 48k

0

Entering edit mode

That's fantastic WouterDeCoster, thank you very much. Just recently got started on Python so should be able to use it.

Question: how would I use this on files that aren't gzipped?

ADD REPLY • link 7.0 years ago by a.b.g ▴ 10

1

Entering edit mode

Well, there is not really a reason to not gzip your fastq files :)

That said, you would just have to edit this line:

for record in SeqIO.parse(gzip.open(args.fastq, 'rt'), "fastq"):

to

for record in SeqIO.parse(args.fastq, "fastq"):

ADD REPLY • link 7.0 years ago by WouterDeCoster 48k

0

Entering edit mode

Thanks WouterDeCouster. I got it to work.

One final naive question. This prints to the screen doesn't it? Would there also be an easy way of writing it to a new file? (unless it does it already and I'm missing it)

Thanks again. You've helped a real newbie.

ADD REPLY • link 7.0 years ago by a.b.g ▴ 10

2

Entering edit mode

Use redirect python timefilt.py part1.fastq.gz --time_from 2017-10-09T18:00:00Z --time_to 2017-10-09T20:00:00Z > new.fastq or if you want to keep the file compressed python timefilt.py part1.fastq.gz --time_from 2017-10-09T18:00:00Z --time_to 2017-10-09T20:00:00Z | gzip > new.fastq.gz

ADD REPLY • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

Should have known that! Thanks, you've both been a great help and I'm using it already.

I'm currently using cat to merge the fastqs to run this on multiple files, I can't think this is the most efficient method. Might there be an easy way to run this on multiple files to produce one fastq output?

ADD REPLY • link 7.0 years ago by a.b.g ▴ 10

0

Entering edit mode

Why not process all fastq's independently and then merge the resulting files into one. That would brute force parallelize this process saving time.

ADD REPLY • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

Alternatively we could modify the code to make it read from stdin.

ADD REPLY • link 7.0 years ago by WouterDeCoster 48k

0

Entering edit mode

Would that be difficult to do?

ADD REPLY • link 7.0 years ago by a.b.g ▴ 10

0

Entering edit mode

Oh not at all :) you should read from sys.stdin rather than from the file.

For inspiration you can take a look at https://github.com/wdecoster/nanofilt

ADD REPLY • link 7.0 years ago by WouterDeCoster 48k

0

Entering edit mode

Sorry, I've just come back to this. Please excuse my limited programming knowledge (I'm in the process of learning) but what would be the easiest way of using this to loop through multiple files? Unfortunately not fully sure how to use sys.stdin

ADD REPLY • link 7.0 years ago by a.b.g ▴ 10

3

Entering edit mode

7.0 years ago

GenoMax 153k

Try this.

grep -A 3 --no-group-separator -E 'start_time=2017-10-09T18|start_time=2017-10-09T19' your.fastq > seq_you_want.fastq

If your files are gzipped then:

zcat your_file.fq.gz | grep -A 3 --no-group-separator -E 'start_time=2017-10-09T18|start_time=2017-10-09T19' > seq_you_want.fastq

ADD COMMENT • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

Thanks genomax, it is indeed nanopore data.

Thanks for the simple solution. Worked quite well for looking at only 2 hours of data.

ADD REPLY • link 7.0 years ago by a.b.g ▴ 10

1

Entering edit mode

2.7 years ago

michael ▴ 20

I wrote a tool for this: https://github.com/mbhall88/ontime

You can specify your time range as a date/timestamp, or as duration from the start/end of sequencing.

For example

$ ontime --from 1h30m --to -2h in.fq

will extract reads sequenced after the first hour and a half and before the last two hours

ADD COMMENT • link 2.7 years ago by michael ▴ 20

Login before adding your answer.