Question

Extract basic info from fastq files in an efficient way?

1

Entering edit mode

7.6 years ago

rioualen ▴ 750

Hi,

I'm looking for an easy and efficient way to get basic information from a given fastq file programmatically, like number of sequences, average sequence length, average GC%... I adapted this code but it's actually very slow for big fastq files. I was considering parsing FastQC's results but there must be an easier way? I've googled it for hours but I can't find anything.

Thanks a lot!

EDIT

I forgot to mention I'm trying to put that into a python function, so here's what I've got right now:

def fastq_stats(fastq_file):
    os.system("reformat.sh in=" + fastq_file + " gchist=temp")
    with open('temp', 'r') as f:
        first_line = f.readline()
        gc_rate = float(first_line.split()[1])
    os.system("rm temp")
    read_count = int(sum(1 for line in open(fastq_file))/4)

    return (read_count, gc_rate)

It's a bit ugly imo, but it will do for now. I'm still interested by better options! I'm coding in python and R mostly, and this is part of a pipeline that generates FastQC reports automatically, hence my quest for a ready-to-use FastQC parser.

fastq fastqc • 14k views

ADD COMMENT • link updated 7.6 years ago by h.mon 35k • written 7.6 years ago by rioualen ▴ 750

2

Entering edit mode

To know number of seqs quickly, you can use on Linux machine wc -l fastq_file | awk '{$1/4; print}'

ADD REPLY • link 7.6 years ago by Renesh ★ 2.2k

score 6 · Answer 1 · 2017-09-27

reformat.sh from BBMap suite can produce the following summaries. reformat.sh in1=R1.fq.gz in2=R2.fq.gz bhist=filename

bhist=<file>            Base composition histogram by position.
qhist=<file>            Quality histogram by position.
qchist=<file>           Count of bases with each quality value.
aqhist=<file>           Histogram of average read quality.
bqhist=<file>           Quality histogram designed for box plots.
lhist=<file>            Read length histogram.
gchist=<file>           Read GC content histogram.
gcbins=100              Number gchist bins.  Set to 'auto' to use read length.
gcplot=f                Add a graphical representation to the gchist.

It will also write a summary to stderr that you can capture to a file.

Input:                          17037614 reads                  851456901 bases
Output:                         17037614 reads (100.00%)        851456901 bases (100.00%)

Time:                           14.574 seconds.
Reads Processed:      17037k    1169.08k reads/sec
Bases Processed:        851m    58.42m bases/sec

It will be fast :-)

score 5 · Answer 2 · 2017-09-28

fastx_quality_stats from fastxtoolkit:

$ gzip -dc hcc1395_normal_rep1_r1.fastq.gz | fastx_quality_stats | head
column  count   min max sum mean    Q1  med Q3  IQR lW  rW  A_Count C_Count G_Count T_Count N_Count Max_count
1   331958  2   37  10278936    30.96   32  32  32  0   32  32  28945   104041  174760  24163   49  331958

seqkit stats:

$ time(seqkit stats hcc1395_normal_rep1_r1.fastq.gz)
file                             format  type  num_seqs     sum_len  min_len  avg_len  max_len
hcc1395_normal_rep1_r1.fastq.gz  FASTQ   DNA    331,958  50,125,658      151      151      151

real    0m0.500s
user    0m0.280s
sys 0m0.052s

fastqutils from NGSutils:

    $ fastqutils stats hcc1395_normal_rep1_r1.fastq.gz > ~/Desktop/test.out

Output:

Space:  basespace
Pairing:    Fragmented
Quality scale:  Illumina
Number of reads:    331958
Length distribution
Mean:   151.0
StdDev: 0.0
Min:    151
25 percentile:  151
Median: 151
75 percentile:  151
Max:    151
Total:  331958

Quality distribution
pos mean    stdev   min 25pct   50pct   75pct   max count
1   30.9645678068   3.68287083487   2   32  32  32  37  331958
2   31.4920592364   2.91843166829   2   32  32  32  37  331958
.
.
.
150 35.7938052404   8.68052946923   7   32  37  42  42  331958
151 35.3446369721   8.77621250147   7   32  37  42  42  331958

Average quality string
?@DEEIIJJJJJJJJJJJJJJJJJJIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGHIIIIIIIIIIIIIIIIIIIIIIIIIHIHHHHHHHHHHHHHHGGGGGGGGGGGGGGFFFFFFFFEEEEEEEEDD

size of the file:

$ du -ah hcc1395_normal_rep1_r1.fastq.gz 
26M hcc1395_normal_rep1_r1.fastq.gz

score 1 · Answer 3 · 2017-09-28

1

Entering edit mode

7.6 years ago

h.mon 35k

MultiQC will parse FastQC files for you, and will generate some terribly good-looking graphics as a bonus.

ADD COMMENT • link 7.6 years ago by h.mon 35k

0

Entering edit mode

I know MultiQC is great, but I'm interested in using these numbers as variables in my pipeline :-)

ADD REPLY • link 7.6 years ago by rioualen ▴ 750

0

Entering edit mode

Under the multiqc_data/ folder (default name), there are tab-separated text files with all the metrics collected. These files are easily parsed with awk / python / perl.

ADD REPLY • link 7.6 years ago by h.mon 35k

0

Entering edit mode

Oh yeah that's right, though it implies to run MultiQC in the first place. Thank you!

ADD REPLY • link 7.6 years ago by rioualen ▴ 750