Entering edit mode
3.0 years ago
zo
•
0
I need to calculate the median in the fastq file, how can I do this with awk or bioawk?
I need to calculate the median in the fastq file, how can I do this with awk or bioawk?
Straight up awk
soliution would be:
cat *.fq | awk 'NR % 4 == 0 { print length($1) } ' | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
I would do it with bioawk
and datamash
cat *.fq | bioawk -c fastx '{ print(length($seq)) }' | datamash median 1
you can also implement a better median in awk as shown here:
that works like this:
cat *.fq | bioawk -c fastx '{ print(length($seq)) }' | bioawk ' { a[i++]=$1; } \
END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
with seqkit
$ seqkit stats -a test.fq
In output, Q2 is median value. If you want to extract fastq name, and median, try
$ seqkit stats test.fq -aT | cut -f1,10
with awk and st (available in brew, conda):
$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --summary
min q1 median q3 max
151 151 151 151 151
$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq |st --median
151
Ministat is available in system repos for many GNU-Linux and BSD distros and you can try:
$ awk 'NR % 4==2 {print length ($1)}' hcc1395_tumor_rep1_r1.fastq | ministat -A |tr -s " "| cut -f5 -d " "
Median
151
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
median of?
median length of reads in the fastq file.
You can use
fastqc
for that!