Question

Reasons for identical sequence ID in fastq files

0

Entering edit mode

7.0 years ago

lshepard ▴ 480

Hi,

I am currently analyzing some data where I noticed the fastq files contained identical sequence IDs (ex: NB501818:81:HWF5FBGX3:2:13210:10005:15702).

For example, here a small section from the output of a query sorted unaligned bam file (for simplicity, I am showing only the ID + sequence):

NB501818:81:HWF5FBGX3:2:13210:10005:15702  CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT     
NB501818:81:HWF5FBGX3:2:13210:10005:15702  CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT     
NB501818:81:HWF5FBGX3:2:13210:10005:15702  CTGTAACAGAAGACAAGTACGAAATACTGCAATCTGTCGACGATGCTGCGATTGTGATAAAAAACACAAAAAAAAC    
NB501818:81:HWF5FBGX3:2:13210:10005:15702  CTGTAACAGAAGACAAGTACGAAATACTGCAATCTGTCGACGATGCTGCGATTGTGATAAAAAACACAAAAAAAAC

Any ideas of why this could happen? While these do have the same sequence, obviously this is not the reason for same read names. Some of my search has pointed to an issue in the tiles within the flowcell, but I wanted to hear some suggestion from this community.

Thanks!

Note: For clarify, here are the first two IDs, from the fastq file:

@NB501818:81:HWF5FBGX3:2:13210:10005:15702 1:N:0:TCCTGAGC
CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT
--
@NB501818:81:HWF5FBGX3:2:13210:10005:15702 1:N:0:TCCTGA
CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT

next-gen • 3.3k views

ADD COMMENT • link updated 7.0 years ago by Devon Ryan 104k • written 7.0 years ago by lshepard ▴ 480

0

Entering edit mode

For example, here a small section from the output of a query sorted unaligned bam file (for simplicity, I am showing only the ID + sequence):

same read mapped at different genomic positions ? show us the locations and the sam flags.

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre, as I have said, these are unaligned bam files, thus no location, the flags are either 77, 77, 141, 141 in that order. Thus, these repetition are directly from the fastq files.

ADD REPLY • link 7.0 years ago by lshepard ▴ 480

0

Entering edit mode

Just in case you wanted to see the entire thing:

NB501818:81:HWF5FBGX3:2:13210:10005:15702       77      *       0       0       *       *       0       0       CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT     AAAAAEEEEEEEEEEEEEEEAAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEE/EAEEEEEEEEAEEEEEEE     RG:Z:A
NB501818:81:HWF5FBGX3:2:13210:10005:15702       77      *       0       0       *       *       0       0       CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT     AAAAAEEEEEEEEEEEEEEEAAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEE/EAEEEEEEEEAEEEEEEE     RG:Z:A
NB501818:81:HWF5FBGX3:2:13210:10005:15702       141     *       0       0       *       *       0       0       CTGTAACAGAAGACAAGTACGAAATACTGCAATCTGTCGACGATGCTGCGATTGTGATAAAAAACACAAAAAAAAC    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEE/E    RG:Z:A
NB501818:81:HWF5FBGX3:2:13210:10005:15702       141     *       0       0       *       *       0       0       CTGTAACAGAAGACAAGTACGAAATACTGCAATCTGTCGACGATGCTGCGATTGTGATAAAAAACACAAAAAAAAC    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEE/E    RG:Z:A

ADD REPLY • link 7.0 years ago by lshepard ▴ 480

0

Entering edit mode

Are these your data or did you download it from somewhere?

ADD REPLY • link 7.0 years ago by ATpoint 85k

0

Entering edit mode

Yes, they were not downloaded.

ADD REPLY • link 7.0 years ago by lshepard ▴ 480

1

Entering edit mode

at this point I would say that the bam is just broken... you know, sometimes, strange things happen...

Text justification + VCF file : Don't. Please. ftp://ftp.hgsc.bcm.edu/VWFVariantsStudy/VWF_filtered%20out%20pseudogene.vcf 😭 pic.twitter.com/Di2SiP3haF
— Pierre Lindenbaum (@yokofakun) November 7, 2017

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Not sure I follow? There are applications which you need unmapped bam files, so this is what is expected out of it. The point is that this is a representation from what is present in the raw fastq files.

ADD REPLY • link 7.0 years ago by lshepard ▴ 480

1

Entering edit mode

you shouldn't find the same read twice.

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes, that is exactly what I thought. Do you have any suggestions from what could have caused this? I have already contacted the core, and like I mentioned, some of my search indicated that it could be an issue with the tiles. Any other hypothesis would be welcomed!

ADD REPLY • link 7.0 years ago by lshepard ▴ 480

0

Entering edit mode

To clarify, here is the same IDs (from the first two lines) from the Fastq file if it helps.

@NB501818:81:HWF5FBGX3:2:13210:10005:15702 1:N:0:TCCTGAGC
CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT
--
@NB501818:81:HWF5FBGX3:2:13210:10005:15702 1:N:0:TCCTGA
CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT

ADD REPLY • link 7.0 years ago by lshepard ▴ 480

3

Entering edit mode

This data must have been reprocessed with two separate --use-base-masks for the index read with bcl2fastq (see truncated index sequence in header). As to why both are in the same file is a mystery.

ADD REPLY • link 7.0 years ago by GenoMax 147k

0

Entering edit mode

That's a good point!

ADD REPLY • link 7.0 years ago by lshepard ▴ 480

score 2 · Accepted Answer · 2017-11-17

To summarize the above discussion:

This seems to be due to whoever processed the data running bcl2fastq twice, with different masking settings (in one case using the full 8 base index and the second masking the last two bases). Presumably they did this because there were problems during the index read.
For some unknown reason, they concatenated the results from these two demultiplexing runs (for what it's worth, you can't get bcl2fastq to do this for you, you'd have to manually concatenate the files).
One should remove the duplicate reads. My guess is that the reads with the partially masked index should be used (otherwise you're presumably throwing away a lot of the data).

There's probably some awk magic for step 3, but here's a bit of python:

#!/usr/bin/env python
import sys

while True:
    try:
        # Read in an entry
        l1 = sys.stdin.next()
        l2 = sys.stdin.next()
        l3 = sys.stdin.next()
        l4 = sys.stdin.next()

        # check the length of the index
        if len(l1.split(" ")[1].split(":")[-1]) == 6:
            sys.stdout.write(l1)
            sys.stdout.write(l2)
            sys.stdout.write(l3)
            sys.stdout.write(l4)
    except:
        break

If you saved that as foo.py, then you'd run it with zcat sample_R1.fastq.gz | ./foo.py | gzip -c > sample_R1.filtered.gz.