Hi,
I'm looking for an easy and efficient way to get basic information from a given fastq file programmatically, like number of sequences, average sequence length, average GC%... I adapted this code but it's actually very slow for big fastq files. I was considering parsing FastQC's results but there must be an easier way? I've googled it for hours but I can't find anything.
Thanks a lot!
EDIT
I forgot to mention I'm trying to put that into a python function, so here's what I've got right now:
def fastq_stats(fastq_file):
os.system("reformat.sh in=" + fastq_file + " gchist=temp")
with open('temp', 'r') as f:
first_line = f.readline()
gc_rate = float(first_line.split()[1])
os.system("rm temp")
read_count = int(sum(1 for line in open(fastq_file))/4)
return (read_count, gc_rate)
It's a bit ugly imo, but it will do for now. I'm still interested by better options! I'm coding in python and R mostly, and this is part of a pipeline that generates FastQC reports automatically, hence my quest for a ready-to-use FastQC parser.
To know number of seqs quickly, you can use on Linux machine
wc -l fastq_file | awk '{$1/4; print}'