Question

Demultiplex of V4 reads

0

Entering edit mode

6.4 years ago

agata88 ▴ 870

Hi all!

I have reads in R1 and R2 files, and I1 file including 12nc indexes (also fastq). In order to divide reads into separate files, I've performed merging step.

I used join_paired_ends.py program with following command:

join_paired_end.py -f <R1> -r <R2> -o demultiplexed/ -m fasts-join -b <I1> -p 15

It ended up with almost 94% of joined reads. Next, i've performed split_libraries_fastq.py script. When I looked at the histogram the number of reads per sequence length was very diverse:

Length  Count
249.0   14545440
.
.
.
489.0   1467

The amplification was performed for V4 region which length is around 291 nc, the sequencing was 2x250bp. So my question is, how come I have 1467 reads with almost 500 length? Is this a contamination?

Should I discard all read longer than 300bp for further analysis? What do you think about it?

Thanks in advance! Best, Agata

miseq demultiplex 16S • 1.2k views

ADD COMMENT • link updated 6.4 years ago by h.mon 35k • written 6.4 years ago by agata88 ▴ 870

score 0 · Answer 1 · 2018-06-25

You have 0.01% of 489 bp reads, compared to only 249 bp reads. If you consider the interval from, say, 249-333 bp, this percentage will probably be even smaller. This is pretty minimal, and some contamination / artifacts is usual for next generation sequencing.

You could provide more stringent parameters for join_paired_end.py. You used the tool defaults, which is to use fastq-join, and fastq-join default minimum overlap is 6 bases, if I am not mistaken.