Hi,
I am currently analyzing some data where I noticed the fastq files contained identical sequence IDs (ex: NB501818:81:HWF5FBGX3:2:13210:10005:15702).
For example, here a small section from the output of a query sorted unaligned bam file (for simplicity, I am showing only the ID + sequence):
NB501818:81:HWF5FBGX3:2:13210:10005:15702 CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT
NB501818:81:HWF5FBGX3:2:13210:10005:15702 CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT
NB501818:81:HWF5FBGX3:2:13210:10005:15702 CTGTAACAGAAGACAAGTACGAAATACTGCAATCTGTCGACGATGCTGCGATTGTGATAAAAAACACAAAAAAAAC
NB501818:81:HWF5FBGX3:2:13210:10005:15702 CTGTAACAGAAGACAAGTACGAAATACTGCAATCTGTCGACGATGCTGCGATTGTGATAAAAAACACAAAAAAAAC
Any ideas of why this could happen? While these do have the same sequence, obviously this is not the reason for same read names. Some of my search has pointed to an issue in the tiles within the flowcell, but I wanted to hear some suggestion from this community.
Thanks!
Note: For clarify, here are the first two IDs, from the fastq file:
@NB501818:81:HWF5FBGX3:2:13210:10005:15702 1:N:0:TCCTGAGC
CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT
--
@NB501818:81:HWF5FBGX3:2:13210:10005:15702 1:N:0:TCCTGA
CAGCACTTTTGGAACAGGTTTTTTTTGTGTTTTTTATCACAATCGCAGCATCGTCGACAGATTGCAGTATTTCGT
same read mapped at different genomic positions ? show us the locations and the sam flags.
Hi Pierre, as I have said, these are unaligned bam files, thus no location, the flags are either 77, 77, 141, 141 in that order. Thus, these repetition are directly from the fastq files.
Just in case you wanted to see the entire thing:
Are these your data or did you download it from somewhere?
Yes, they were not downloaded.
at this point I would say that the bam is just broken... you know, sometimes, strange things happen...
Not sure I follow? There are applications which you need unmapped bam files, so this is what is expected out of it. The point is that this is a representation from what is present in the raw fastq files.
you shouldn't find the same read twice.
Yes, that is exactly what I thought. Do you have any suggestions from what could have caused this? I have already contacted the core, and like I mentioned, some of my search indicated that it could be an issue with the tiles. Any other hypothesis would be welcomed!
To clarify, here is the same IDs (from the first two lines) from the Fastq file if it helps.
This data must have been reprocessed with two separate
--use-base-masks
for the index read withbcl2fastq
(see truncated index sequence in header). As to why both are in the same file is a mystery.That's a good point!