Question

Write Script For Selection Of Fastq File With Sanger Format

3

Entering edit mode

14.2 years ago

Bioscientist ★ 1.7k

Now I have many FASTQ files, and I need to select those with sanger format. Advisor told me: zcat *.fastq.gz |grep "J" (seems there's no "J" in the information of sanger format; while you can always find "J" in other format, though I'm not sure about this) If the output is NOTHING, then it's sanger format that we want; if there's anything in the output, then it should be discarded.

Question is how I can write the script to finish this task? I'm beginner and never write code before.thanks!

fastq • 3.7k views

ADD COMMENT • link updated 14.2 years ago by brentp 24k • written 14.2 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

I'm unclear on the question. Your advisor gave you a series of bash commands that will tell you whether the file is Phred+33 or Phred+64. That should be all you need. Are you trying to figure out how to parse the output?

BTW, for info on different formats, I like the [?]Wikipedia page on fastq format[?]. That's where your advisor got his "J" trick from.

ADD REPLY • link 14.2 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

I'm unclear on the question. Your advisor gave you a series of bash commands that will tell you whether the file is Phred+33 or Phred+64. That should be all you need. Are you trying to figure out how to parse the output? BTW, for info on different formats, I like the Wikipedia page on fastq format, http://en.wikipedia.org/wiki/FASTQ_format. That's where your advisor got his "J" trick from.

ADD REPLY • link 14.2 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

thx mitch, yes you are right. I'm trying to figure out how to parse output based on the trick advisor told me; ie., I hope sb can give me clues for the SCRIPT for parsing the output. I understand the logic but don't know how to write script.

ADD REPLY • link 14.2 years ago by Bioscientist ★ 1.7k

score 7 · Answer 1 · 2011-06-22

7

Entering edit mode

14.2 years ago

brentp 24k

For kicks, I implemented this here.

It uses the information from the helpful wikipedia page about the FASTQ format. It assumes the input is from STDIN and that it's only getting quality lines, so, for FASTQ, you'll call it like:

awk 'NR % 4 == 0' your.fastq | python guess-encoding.py

Where the awk command takes every 4th line (the quality line) from your file. By default, it will run until it reads the entire file, or narrows down the file-type to a single possible encoding; you can make it check 1000 qual lines like this:

awk 'NR % 4 == 0' your.fastq | python guess-encoding.py -n 1000

Usually that's enough to determine if it's sanger/non-sanger. Or you could just edit RANGES to include only 'Sanger' and 'Illumina'.

ADD COMMENT • link 14.2 years ago by brentp 24k

0

Entering edit mode

Neat script! One suggestion if you used islice you would not need to pipe it through awk.

stream = islice(sys.stdin, 3, None, 4)

ADD REPLY • link 14.2 years ago by Istvan Albert 103k

0

Entering edit mode

Nice script indeed. But you don't even need the islice trick, either. Why not just add:

if i % 4 == 0: continue

At the start of your for loop in the python script (and, obviously, divide i/4 in your "if done" test)?

ADD REPLY • link 14.2 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

Yeah, I originally had % 4 == 0 and that's probably the most user friendly. But then what if you just have a couple chars? As it is now, you can just do echo 'XZA' | python guess-encoding.py. Plus, I think it's probably a lot faster to have awk cut down the number of lines to 1/4th