Question: Mean and SD read length from a range of fastq files
2
2
Entering edit mode
7.7 years ago
eoin ▴ 30

Hi all,

I'm trying to write some code to generate mean read length data from a range of fastq files. awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' HG1.fastq > readLength.txt

i've got as far as here from looking through other posts and trying to improve but i'm stuck on a couple of things. This command only works on a single file and will report the length of each read within that file separately.

I want to run a single command so the mean and Standard Dev of read lengths from all .fastq files within a folder are reported in a single .txt file, one sample per line. I gues SD might be difficult to calculate in a command so even just the mean read length.

e.g.the first 5 files in my folder are: ru1.fastq ru2.fastq hg3.fastq hg25.fastq ru7.fastq

obviously i'm a bit of a novice at this so all help would be appreciated !!

thanks a lot

fastq awk sed sequencing • 10.0k views
ADD COMMENT
0
Entering edit mode

Hello I'm so grateful for your help. I tried running your pipeline on a metagenomic file in the fastq format, and I got a very high stddev. Can you help me with this? total 84855052 avg=160.729787 stddev=2220.571952

ADD REPLY
8
Entering edit mode
7.7 years ago

Using awk:

awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}'  *.fastq
ADD COMMENT
0
Entering edit mode

thank you for your answer Pierre!!

This code will give us the total number of reads, their mean, and SD. But i think I wasn't clear in my post;

what I need is to get the mean and SD read length from each individual .fastq file within in the directory. So an example of the output

  • File Name Mean SD
  • ru1.fastq 235.4 -- 31.3
  • hg25.fastq 241.5 -- 35.5

and so on

sorry if I wasn't clear earlier and thank you so much for your response!!

ADD REPLY
3
Entering edit mode
for F in *.fasq
do 
echo  "$F   "
awk '(... same... script)' $F
done
ADD REPLY
0
Entering edit mode

Perfect. thank you so much. I had tried a loop but left out echo

thanks a lot

ADD REPLY
0
Entering edit mode

please mark this question as answered (green tick on the left)

ADD REPLY
0
Entering edit mode

Hello I'm so grateful for your help. I tried running your pipeline on a metagenomic file in the fastq format, and I got a very high stddev. Can you help me with this? total 84855052 avg=160.729787 stddev=2220.571952

ADD REPLY
0
Entering edit mode

The given algorithm is producing the variance instead the standard deviation. In awk, sqrt(x) gives the square root of x. So, you may change the last printf line by something like this:

printf("total %d avg=%f stddev=%f\n",n,m,sqrt(sq/n-m*m));
ADD REPLY
0
Entering edit mode
3.2 years ago
Ric • 0

seqtk comp FASTQ_FILE_HERE |cut -f 2 > temp.txt

and then

datamash mean 1 sstdev 1 < temp.txt

ADD COMMENT

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6