Duplicate/identical reads in fastq file
0
0
Entering edit mode
6.0 years ago

I got paired end sequencing results from core. I was checking for duplicated reads as part of QC. I got read entries like this from R1 file (headers):

@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC
@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC

@NB501140:187:HHVLCBGX9:2:23206:9999:7430 1:N:0:CACCTTAC
@NB501140:187:HHVLCBGX9:2:23206:9999:7430 1:N:0:CACCTTAC

Is this duplication possible?

Further querying file, printed following read information:

 zgrep -iw -A 3 "NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC" R1.fastq.gz 

@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC
AGCATCAGAGGCACCCACCTGAGGAAAGTCCTCGCTGTCCATGGCCTGCAGAGTCTGGTTGGCTGTCTCCAGAAGCTCCTCGGAGCTCTCCAGGGCCCGCGTGCAGGCAGCCAGCTGGTTCTGTGGATACCAGGCACCAAAGGAGGGGACA
+
AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAAEEEAAEEEEAEEEAEE<E/
--
@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC
AGCATCAGAGGCACCCACCTGAGGAAAGTCCTCGCTGTCCATGGCCTGCAGAGTCTGGTTGGCTGTCTCCAGAAGCTCCTCGGAGCTCTCCAGGGCCCGCGTGCAGGCAGCCAGCTGGTTCTGTGGATACCAGGCACCAAAGGAGGGGACA
+
AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAAEEEAAEEEEAEEEAEE<E/
sequencing fastq • 8.1k views
ADD COMMENT
1
Entering edit mode

How many are there , just check using seqkit

zcat your_fastq_R1.fq.gz | seqkit rmdup -s -i -m -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt

Source : https://bioinf.shenwei.me/seqkit/usage/#duplicate

ADD REPLY
1
Entering edit mode

~ 25k sequences are like that. @ Vijay Lakhujani

ADD REPLY
1
Entering edit mode

So, what you are concerned about is identical headers + identical sequence or ANY ONE of these 2?

ADD REPLY
0
Entering edit mode

All of them. Al three are exactly identical (read name, read sequence and base qualities) between duplicated sequences, which is kind of hard to believe. I wanted to know if this is fairly common as I haven't come across identically duplicated reads, in the past Vijay Lakhujani

ADD REPLY
0
Entering edit mode

Well, I for one have never encountered this before

looks very dodgy to me as well

ADD REPLY
1
Entering edit mode

Thanks Vijay Lakhujani I have used this for duplicate read identification. Since I had duplicate read names i used '-n' instead '-s'.

$ seqkit rmdup R1.fastq.gz -n -o clean.fastqc.gz -D duplicates.txt
ADD REPLY
0
Entering edit mode

Hi!

I also used the same command, however, it's producing a truncated fastq file for some reason.

    10KIVNT_S1_L001_R1_001.fastq.gz: unexpected end of file
@A00804:41:HNJ53DSXX:1:2125:14055:13244 1:N:0:TCCGGAGA+CCTATTCT
CAGGATGTGGAGTGCAGCAGTCCCACGGAGATCTGTTAACTCTGTGACTCTGAGCATGTACAGAACATAATCTCTGAGCCCATTCTCTCTTCTGAGAGAAAAAATAAACACACCTATTCAATGTTTTTGCTGCTTAATAAGTGATAATCTC
+
FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFF,FFFFFF
@A00804:41:HNJ53DSXX:1:2125:14470:13244 1:N:0:TCCGGAGA+CCTATCCT
GTGTGGCATTTCCAATCAGCATTGATAGATGTAATTAGGGTTCCACTCTACAAATTGTTCACAGTGAGGGTAAGAAGGTAAACAAGAATCAAAGGGTCTTGTCTGTGATGAAGATTCATTTTTCCCCTGGGAAAAGGGAAAATAACTGGA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFF:FFFFFFFFF:FFFFFFFF::FFFF,FFFFFFFFFF
@A00804:41:HNJ53DSXX:1:2125:14959:13244 1:N:0:TCCGGAGA+CCTATCCT
TTTAGTAGATTTAGTAATACTATCTTCCCAAGCCCTAAAATGCTCAAAACCTGCCAGCCAAAAATGGTGAGGAGGGACAGATAGGAACTCTGTGTGGCACTTGGTTATTAGCCTGG
ADD REPLY
1
Entering edit mode

You very likely have a corrupted data file to begin with. See if you can download a fresh copy. Ask for md5sum values to confirm you have a complete copy.

ADD REPLY
0
Entering edit mode

Hi Genomax. Thanks for your input. We have sent a request to check if there was some issue at the demultiplexing step. However, I have a doubt that maybe I am missing something. Isn't it the job of markduplicates to mark such duplicates and not throw an error. In the case of PCR duplicates, isn't it like PCR duplicates are assigned the same read name. If not, how and why the read names of PCR duplicates different from their primary alignment?

ADD REPLY
0
Entering edit mode

isn't it like PCR duplicates are assigned the same read name.

There should be no duplicate read names in a normal illumina read file. Each sequence that passed QC cam from a unique cluster that originated from a single library fragment. PCR duplicates are generally identified by identical end-to-end sequence in both pairs of reads.

ADD REPLY
0
Entering edit mode

I've never seen this. Are the duplicated reads randomly placed? Or duplicates immediately follow the original? Are the reads repeated on R2 as well?

ADD REPLY
0
Entering edit mode

I guess this thread can be closed. I talked with core and they agreed to relook at the data. Thanks @ Vijay Lakhujani lieven.sterck h.mon

ADD REPLY
0
Entering edit mode

This definitely looks like data/file corruption. If the original files don't have this problem then you would need to get a fresh copy (ask for md5sums to compare). Otherwise the core will have to re-run demultiplexing and regenerate the data.

ADD REPLY
0
Entering edit mode

Hi, I am facing the same issue. Is there a turnaround for this? I am unable to resolve this issue.

Because of two reads having same header I am facing issue at Markduplicates step.

ADD REPLY

Login before adding your answer.

Traffic: 1870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6