Calculatio median in a Fastq file using bioawk/awk
2
1
Entering edit mode
3.1 years ago
zo • 0

I need to calculate the median in the fastq file, how can I do this with awk or bioawk?

awk median bioawk fastq • 1.6k views
ADD COMMENT
0
Entering edit mode

median of?

ADD REPLY
0
Entering edit mode

median length of reads in the fastq file.

ADD REPLY
0
Entering edit mode

You can use fastqc for that!

ADD REPLY
4
Entering edit mode
3.1 years ago

Straight up awk soliution would be:

cat *.fq  | awk 'NR % 4 == 0 { print length($1) } ' | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'

I would do it with bioawk and datamash

cat *.fq  | bioawk -c fastx '{ print(length($seq)) }' | datamash median 1

you can also implement a better median in awk as shown here:

that works like this:

 cat *.fq  | bioawk -c fastx '{ print(length($seq)) }' | bioawk ' { a[i++]=$1; } \
                       END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
ADD COMMENT
3
Entering edit mode
3.1 years ago

with seqkit

$ seqkit stats -a test.fq

In output, Q2 is median value. If you want to extract fastq name, and median, try

$ seqkit stats test.fq -aT | cut -f1,10

with awk and st (available in brew, conda):

$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --summary

 min    q1  median  q3  max
    151 151 151 151 151

$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --median 
151

Ministat is available in system repos for many GNU-Linux and BSD distros and you can try:

$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq | ministat -A |tr -s " "| cut -f5 -d " "

Median
151
ADD COMMENT

Login before adding your answer.

Traffic: 1359 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6