Question

Number of reads in Fastq and Bowtie

0

Entering edit mode

5.8 years ago

yp19 ▴ 70

Hi guys, When I check the number of reads in my fastq (paired end library), I get 4071779 paired reads.

I do this using "wc -l fwd_paired.fastq" and then dividing the number by 4. or just by checking the FASTQC report basic statistics.

When I try to align these reads using Bowtie, the first line of the summary indicates i only have 2761561 paired reads

2761561 reads; of these:
  2761561 (100.00%) were paired; of these:
    2682638 (97.14%) aligned concordantly 0 times
    6280 (0.23%) aligned concordantly exactly 1 time
    72643 (2.63%) aligned concordantly >1 times
    ----
    2682638 pairs aligned concordantly 0 times; of these:
      121 (0.00%) aligned discordantly 1 time
    ----
    2682517 pairs aligned 0 times concordantly or discordantly; of these:
      5365034 mates make up the pairs; of these:
        4990100 (93.01%) aligned 0 times
        17463 (0.33%) aligned exactly 1 time
        357471 (6.66%) aligned >1 times
9.65% overall alignment rate

Does anyone know why bowtie is reporting 2761561 when I have 4071779? I also get warning messages with the bowtie run that look like this:

Warning: skipping mate #2 of read 'MN00234:32:000H2LGW3:1:23103:22530:9439 2:N:0:GCCAAT' because length (1) <= # seed mismatches (0) Warning: skipping mate #2 of read 'MN00234:32:000H2LGW3:1:23103:22530:9439 2:N:0:GCCAAT' because it was < 2 characters long

but there are only 100 of these.

any ideas?

Thanks a lot.

sequencing genome bowtie fastq • 1.5k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by yp19 ▴ 70

0

Entering edit mode

Did you trim the fastqs? If yes, how did you trim them?

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Yes, I am using trimmomatic, code below:

java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar PE -phred33 $dir/$R1 $dir/$R2 fwd_paired.fastq fwd_unpaired.fastq rev_paired.fastq rev_unpaired.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:25

ADD REPLY • link 5.8 years ago by yp19 ▴ 70

score 3 · Accepted Answer · 2019-02-21

3

Entering edit mode

5.8 years ago

h.mon 35k

You are not using MINLEN with Trimmomatic, so you end up with really short (or even zero length) sequences, and bowtie is discarding the pairs where at least of of the sequences is shorter than the seed length.

In addition, unless you have a specific reason to hard-trim 3 bases from beginning and end of sequences, I would advise you to avoid doing so.

ADD COMMENT • link 5.8 years ago by h.mon 35k

0

Entering edit mode

That makes sense, thanks a lot for the explanation! I will look into adding a MINLEN. Is there a recommended value for de novo genome assemblies? I'm thinking of starting off with 3 since the seed length default for bowtie appears to be 2

ADD REPLY • link 5.8 years ago by yp19 ▴ 70

1

Entering edit mode

A MINLEN of 3 is useless, it would map to hundreds or thousands of locations. The minimum I use depends on the project, quality of reads, sequencing depth, and other factors, but as a rule of thumb, I will never go bellow 35bp - unless, of course, if working with miRNA. See this discussion:

How to choose Trimmomatic's parameter 'MINLEN '?

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Thanks so much for the information & the link!

ADD REPLY • link 5.8 years ago by yp19 ▴ 70