Analyze read lengths in Illumina data
4
0
Entering edit mode
22 days ago
María José ▴ 10

Hi,

I’m looking for a tool or method to calculate the number of reads for different lengths in a .fq.gz file. Specifically, I want to evaluate how many reads drop out after applying a specific cutoff length to filter the reads.

Thanks

human illumina libraries • 506 views
ADD COMMENT
3
Entering edit mode
22 days ago

Try seqkit.

Just call seqkit seq with options -M/--max-len and/or -m/--min-len and pipe it to seqkit stats.

For example, select reads shorter than 40 bp and count them:

seqkit seq -M 39 your.fastq.gz | seqkit stats

It will produce a pretty table like this:

file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTQ   DNA        309   11,426       35       37       39
ADD COMMENT
1
Entering edit mode
22 days ago
GenoMax 149k

a tool or method to calculate the number of reads for different lengths in a .fq.gz file.

You could also run FastQC on the file. Once done look inside the zip file that FastQC produces. This contains data used for generating FastQC report. In that zip archive there is a file called fastqc_data.txt which contains data used for various FastQC plots. A section in that file gives you exact sequence lengths and number of reads. An example excerpt below.

>>Sequence Length Distribution  warn
#Length Count
50  1.0
51  1.0
52  3.0
53  1.0
54  0.0
55  1.0
56  1.0
57  0.0
58  1.0
59  2.0
60  1.0
61  2.0
62  4.0
63  1.0
64  1.0
65  5.0
66  6.0
ADD COMMENT
5
Entering edit mode
22 days ago
ATpoint 87k

Plain and simple bash:

gzip -cd your.fastq.gz | awk 'NR%4==2 {print length}' | sort -k1,1n | uniq -c

That will count for every read (that is the 2st, 6th, 10th and so on line in the fastq which is the sequence lines) the length of the sequence and then sorts and summarizes via uniq. For example, for a fastq file that underwent adapter trimming:

    # Subset to 100k reads for speed
    zcat my.fastq.gz | head -n 400000 | awk 'NR%4==2 {print length}' | sort -k1,1n | uniq -c
      1 22
      2 26
      2 27
      5 28
      6 29
      1 30
      3 31
      8 32
     11 33
      6 34
     16 35
     37 36
     45 37
     46 38
     67 39
    103 40
    116 41
    127 42
    173 43
    204 44
    195 45
    226 46
    213 47
    210 48
    214 49
    263 50
    250 51
    290 52
    320 53
    297 54
    310 55
    309 56
    303 57
    267 58
    262 59
    280 60
    252 61
    267 62
    326 63
    297 64
    312 65
    339 66
    342 67
    282 68
    299 69
    366 70
    394 71
    805 72
   2464 73
   88067 76
ADD COMMENT
1
Entering edit mode

lovely to see pure bash in action :)

however, are we not missing the 6th line using this approach?

ADD REPLY
1
Entering edit mode

I made an edit, accidentally did %8 and not %4.

ADD REPLY
1
Entering edit mode
22 days ago

in the terminal, using gnuplot:

$ gunzip -c in.fastq.gz  | paste - - - - | awk -F '\t' '{print length($2);}' |  sort | uniq  -c | awk '{printf("%s\t%d\n",$2,$1);}'  | sort -n | gnuplot -e "set terminal dumb 120 30;set nokey ;  plot '-' using 2:xtic(1) with lines"



  600 ++--+--+---+--+---+--+---+--+---+--+---+---+--+---+--+---+--+---+--+---+---+--+---+--+---+--+---+--+---+--+--++
      +   +  +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   +  +   +  +   *
      |                                                                                                             *
      |                                                                                                            *|
  500 ++                                                                                                           *+
      |                                                                                                            *|
      |                                                                                                           * |
      |                                                                                                           * |
  400 ++                                                                                                          *++
      |                                                                                                           * |
      |                                                                                                          *  |
      |                                                                                                          *  |
  300 ++                                                                                                         * ++
      |                                                                                                         *   |
      |                                                                                                         *   |
      |                                                                                                         *   |
      |                                                                                                        *    |
  200 ++                                                                                                       *   ++
      |                                                                                                        *    |
      |                                                                                                       *     |
      |                                                                                                       *     |
  100 ++                                                                                                      *    ++
      |                                                                                                      *      |
      |                                                                                                    ***      |
      ****+  +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   +  +   +   +  +   +  +   +  +   + ***  +  +   +
    0 ++--**********************************************************************************************-+---+--+--++
     35  51 61  63 64  65 66  67 70  73 74  75  76 77  78 80  81 83  85 88  89  90 91  92 93  94 95  96 97  98 99  100
ADD COMMENT

Login before adding your answer.

Traffic: 2400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6