Question

Analyze read lengths in Illumina data

0

Entering edit mode

12 weeks ago

María José ▴ 10

Hi,

I’m looking for a tool or method to calculate the number of reads for different lengths in a .fq.gz file. Specifically, I want to evaluate how many reads drop out after applying a specific cutoff length to filter the reads.

Thanks

human illumina libraries • 700 views

ADD COMMENT • link updated 12 weeks ago by Pierre Lindenbaum 166k • written 12 weeks ago by María José ▴ 10

5

Entering edit mode

12 weeks ago

ATpoint 87k

Plain and simple bash:

gzip -cd your.fastq.gz | awk 'NR%4==2 {print length}' | sort -k1,1n | uniq -c

That will count for every read (that is the 2st, 6th, 10th and so on line in the fastq which is the sequence lines) the length of the sequence and then sorts and summarizes via uniq. For example, for a fastq file that underwent adapter trimming:

    # Subset to 100k reads for speed
    zcat my.fastq.gz | head -n 400000 | awk 'NR%4==2 {print length}' | sort -k1,1n | uniq -c
      1 22
      2 26
      2 27
      5 28
      6 29
      1 30
      3 31
      8 32
     11 33
      6 34
     16 35
     37 36
     45 37
     46 38
     67 39
    103 40
    116 41
    127 42
    173 43
    204 44
    195 45
    226 46
    213 47
    210 48
    214 49
    263 50
    250 51
    290 52
    320 53
    297 54
    310 55
    309 56
    303 57
    267 58
    262 59
    280 60
    252 61
    267 62
    326 63
    297 64
    312 65
    339 66
    342 67
    282 68
    299 69
    366 70
    394 71
    805 72
   2464 73
   88067 76

ADD COMMENT • link 12 weeks ago by ATpoint 87k

1

Entering edit mode

lovely to see pure bash in action :)

however, are we not missing the 6th line using this approach?

ADD REPLY • link 12 weeks ago by lieven.sterck 15k

1

Entering edit mode

I made an edit, accidentally did %8 and not %4.

ADD REPLY • link 12 weeks ago by ATpoint 87k

1

Entering edit mode

12 weeks ago

Pierre Lindenbaum 166k

in the terminal, using gnuplot:

$ gunzip -c in.fastq.gz  | paste - - - - | awk -F '\t' '{print length($2);}' |  sort | uniq  -c | awk '{printf("%s\t%d\n",$2,$1);}'  | sort -n | gnuplot -e "set terminal dumb 120 30;set nokey ;  plot '-' using 2:xtic(1) with lines"



  600 ++--+--+---+--+---+--+---+--+---+--+---+---+--+---+--+---+--+---+--+---+---+--+---+--+---+--+---+--+---+--+--++
      +   +  +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   +  +   +  +   *
      |                                                                                                             *
      |                                                                                                            *|
  500 ++                                                                                                           *+
      |                                                                                                            *|
      |                                                                                                           * |
      |                                                                                                           * |
  400 ++                                                                                                          *++
      |                                                                                                           * |
      |                                                                                                          *  |
      |                                                                                                          *  |
  300 ++                                                                                                         * ++
      |                                                                                                         *   |
      |                                                                                                         *   |
      |                                                                                                         *   |
      |                                                                                                        *    |
  200 ++                                                                                                       *   ++
      |                                                                                                        *    |
      |                                                                                                       *     |
      |                                                                                                       *     |
  100 ++                                                                                                      *    ++
      |                                                                                                      *      |
      |                                                                                                    ***      |
      ****+  +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   + ***  +  +   +
    0 ++--**********************************************************************************************-+---+--+--++
     35  51 61  63 64  65 66  67 70  73 74  75  76 77  78 80  81 83  85 88  89  90 91  92 93  94 95  96 97  98 99  100

ADD COMMENT • link 12 weeks ago by Pierre Lindenbaum 166k

score 3 · Accepted Answer · 2025-01-30

Try seqkit.

Just call seqkit seq with options -M/--max-len and/or -m/--min-len and pipe it to seqkit stats.

For example, select reads shorter than 40 bp and count them:

seqkit seq -M 39 your.fastq.gz | seqkit stats

It will produce a pretty table like this:

file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTQ   DNA        309   11,426       35       37       39

score 1 · Accepted Answer · 2025-01-30

a tool or method to calculate the number of reads for different lengths in a .fq.gz file.

You could also run FastQC on the file. Once done look inside the zip file that FastQC produces. This contains data used for generating FastQC report. In that zip archive there is a file called fastqc_data.txt which contains data used for various FastQC plots. A section in that file gives you exact sequence lengths and number of reads. An example excerpt below.

>>Sequence Length Distribution  warn
#Length Count
50  1.0
51  1.0
52  3.0
53  1.0
54  0.0
55  1.0
56  1.0
57  0.0
58  1.0
59  2.0
60  1.0
61  2.0
62  4.0
63  1.0
64  1.0
65  5.0
66  6.0