Hi,
I’m looking for a tool or method to calculate the number of reads for different lengths in a .fq.gz file. Specifically, I want to evaluate how many reads drop out after applying a specific cutoff length to filter the reads.
Thanks
Hi,
I’m looking for a tool or method to calculate the number of reads for different lengths in a .fq.gz file. Specifically, I want to evaluate how many reads drop out after applying a specific cutoff length to filter the reads.
Thanks
Try seqkit.
Just call seqkit seq
with options -M/--max-len
and/or -m/--min-len
and pipe it to seqkit stats
.
For example, select reads shorter than 40 bp and count them:
seqkit seq -M 39 your.fastq.gz | seqkit stats
It will produce a pretty table like this:
file format type num_seqs sum_len min_len avg_len max_len
- FASTQ DNA 309 11,426 35 37 39
a tool or method to calculate the number of reads for different lengths in a .fq.gz file.
You could also run FastQC on the file. Once done look inside the zip file that FastQC produces. This contains data used for generating FastQC report. In that zip archive there is a file called fastqc_data.txt
which contains data used for various FastQC plots. A section in that file gives you exact sequence lengths and number of reads. An example excerpt below.
>>Sequence Length Distribution warn
#Length Count
50 1.0
51 1.0
52 3.0
53 1.0
54 0.0
55 1.0
56 1.0
57 0.0
58 1.0
59 2.0
60 1.0
61 2.0
62 4.0
63 1.0
64 1.0
65 5.0
66 6.0
Plain and simple bash:
gzip -cd your.fastq.gz | awk 'NR%4==2 {print length}' | sort -k1,1n | uniq -c
That will count for every read (that is the 2st, 6th, 10th and so on line in the fastq which is the sequence lines) the length of the sequence and then sorts and summarizes via uniq. For example, for a fastq file that underwent adapter trimming:
# Subset to 100k reads for speed
zcat my.fastq.gz | head -n 400000 | awk 'NR%4==2 {print length}' | sort -k1,1n | uniq -c
1 22
2 26
2 27
5 28
6 29
1 30
3 31
8 32
11 33
6 34
16 35
37 36
45 37
46 38
67 39
103 40
116 41
127 42
173 43
204 44
195 45
226 46
213 47
210 48
214 49
263 50
250 51
290 52
320 53
297 54
310 55
309 56
303 57
267 58
262 59
280 60
252 61
267 62
326 63
297 64
312 65
339 66
342 67
282 68
299 69
366 70
394 71
805 72
2464 73
88067 76
in the terminal, using gnuplot:
$ gunzip -c in.fastq.gz | paste - - - - | awk -F '\t' '{print length($2);}' | sort | uniq -c | awk '{printf("%s\t%d\n",$2,$1);}' | sort -n | gnuplot -e "set terminal dumb 120 30;set nokey ; plot '-' using 2:xtic(1) with lines"
600 ++--+--+---+--+---+--+---+--+---+--+---+---+--+---+--+---+--+---+--+---+---+--+---+--+---+--+---+--+---+--+--++
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + *
| *
| *|
500 ++ *+
| *|
| * |
| * |
400 ++ *++
| * |
| * |
| * |
300 ++ * ++
| * |
| * |
| * |
| * |
200 ++ * ++
| * |
| * |
| * |
100 ++ * ++
| * |
| *** |
****+ + + + + + + + + + + + + + + + + + + + + + + + + + + *** + + +
0 ++--**********************************************************************************************-+---+--+--++
35 51 61 63 64 65 66 67 70 73 74 75 76 77 78 80 81 83 85 88 89 90 91 92 93 94 95 96 97 98 99 100
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
lovely to see pure bash in action :)
however, are we not missing the 6th line using this approach?
I made an edit, accidentally did
%8 and not %4.