Entering edit mode
4.1 years ago
C4
▴
30
Hi I would like to remove all the duplicates that show up in the fastqc report for my DNA libraries. I am using Clumpify from bbmap which doesn't seem to remove all the duplicates with -dedup=T. I would like to remove all PCR duplicates, optical duplicates, and biological duplicates from my paired-end fastq files. What is the best way to do it? Thank you!
What are biological duplicates in FASTQ files?
"Technical duplicates arising from PCR artefacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected." I took this from the explanation of duplicates in fastqc (link below) [1] I would like to basically remove all the duplicates seen in fastqc report, and clumpify doesn't seem to be working so well. Please let me know if something was unclear, I am fairly new to this [1]:https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html
Are you sure you should be removing biological duplicates (which I guess are reads duplicated because they represent meaningful relative abundance of underlying biological sequences) is something you should be doing with ChIP-Seq data?
It makes sense to exclude such duplicates in Whole Genome/Exome Sequencing, but not in RNAseq, for example. Not all of FASTQC's "errors" are to be acted on and ameliorated in every case. In some cases, they can be explained by perfectly rational experimental reasons.
The library complexity of my ATAC-seq (not RNA-seq) is quite low, and since we sequenced quite deeply, it has led to many duplications. The overall biological duplicates present is actually very low as compared to PCR duplicates. Therefore, I would like to remove all the duplicates from fastq files that showed up on fastqc. Please let me know a tool that can efficiently remove PCR duplicates from paired-end fastq files? I tried clumpify which doesn't seem to be removing all of them, at least in my experience. Thanks for the advise.
I would first
markDuplicates
frompicard
suite, standard values should be fine here.Then you can use
samtools view
to keep only reads that are not duplicated.With
-F
flag you are excluding the reads marked as duplicates, not mapped, not in proper pairs.With
-f
flag you are including the reads mapped in proper pairs.See explanation of sam flags here
Note: You could also remove the duplicates directly from picard by setting the
REMOVE_DUPLICATES=TRUE
option. However, I prefer to do it with samtools.Hope it helps!
I appreciate this, but was hoping to remove duplicates from fastqs. Any suggestions for that?
How did you reach that conclusion? What is the
clumpify.sh
command you are using?Hi, I tested the output fastq using fastqc and saw that some reads were removed by clumpify but not all of them. This was my command for 100bp R1/R2:
Should I be merging the paired-end fastq files before clumpify using bbmerge? Please let me know if I could make any changes to my command. Thanks!
clumpify
is looking at the sequence of reads end-to-end when it is deciding if something is a sequence duplicate. In case of paired-end reads it looks at both reads in order to make that decision during comparisons. I hope you were supplying R1 and R2 files toin1=
andin2=
option (it is not clear from your command line). What is the length of your reads? Usingk=90
sounds overly long. Is there a reason to change that default?I would say try following first and then post the stats you get here.
By default
clumpify
allows 2 errors but you can make the comparison strict by usinghdist=0
option.FastQC does not look at entire dataset when it does QC estimates so you can't rely on it and say that clumpify is not working. Can you post a screenshot of what your pre-clumpify data looks like?
In case you have not seen this blog post from authors of FastQC about sequence duplication you may want to take a look.
Hi geomax, thanks a lot! I am using k=90 as my reads are 100bps, so as to remove more exact duplicates. This is my fastqc duplication before clumpify: https://ibb.co/yRtkNY8 This is after clumpify: https://ibb.co/Bck0Gtm I will also post the stats that I get after using cumpify with default
The value of
k
should be smaller than 1/2 length of your reads otherwise how would the program be able to identify k-mers to do initial matching.to do that you should just add
hdist=0
to your command. That will only allow perfect sequence matches.Would you suggest using bbmerge prior to using clumpify dedup for paired-end fastq files. I take this from here, where they suggest using bbmerge and concatenating merged and unmerged fastq files before clumpify, since when paired-end reads are provided, clumpify clumps only based on R1 https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/clumpify-guide/ I have noticed that bbmerged paired-end fastqs work like a charm for duplicate removal, as comapred to provding paired-end files with in1 and in2 streams. However, I would not know how to unmerge the merged fastq files, is there a tool for that? Perhaps, I could cat the two paired-end fastqs? (This is because I realized clumpify would only mark a read as duplicate if it is present in both read pairs). Thanks so much for your advise!
If your reads are able to merge then by all means follow Brian's advice on the guide page you linked above.
While there isn't a named tool that would do that but the following workaround should work.
grep "^@Seq_ID" merged file | sed 's/@//' > merged_headers
.grep "^@Seq_ID" clumped file | sed 's/@//' > clumped_headers
.merged_headers
andclumped_header
by usingcomm
. Something likecomm -12 merged_headers clumped_headers > file_with_headers_that_were_common
.filterbyname.sh
usingnames=file_with_headers_that_were_common
.Hi GenoMax, thanks a lot for this. Your work-around works really well, however I am losing many reads, because of which my mapping% is decreasing. I believe that is just because I do not have very good data. Could you tell me why you suggested finding common between merged and clumped? Why couldn't I just use all the clumped fastq headers (including merged and unmerged)?
Since you had originally asked
As we can't "unmerge" merged data we have to go back to the original data and get the reads from there. By doing an intersection of
clumped
andmerged
headers we can find out which headers made it to final clumped data.My understanding was that dedupe would work better if I bbmerge R1+R2. However, wouldn't clumped_headers have all the R1 and R2 fastq headers that got clumped and deduped. So, I could just use those clumped_headers to filter out R1 and R2, instead of only merged+clumped headers.
If you tried that and it works then no problem. Since you were going to merge "bbmerged" reads and singletons I was being careful.