Entering edit mode
7.0 years ago
Nicolas Rosewick
11k
Hi,
I've a specific enriched DNA-seq library to analyze ( 2x76 bp sequenced on a NextSeq500).
The library is defined as :
R1 R2
==============>-----------------<===========#####@@@@@
=== : DNA fragment (should correctly align to the genome)
### : barcode
@@@ : some random sequence we introduce to increase the library complexity
Important things to know :
- barcode and the random sequence have always the same length (12 and 14 respectivelly)
- Each pair of reads have different barcode (only PCR duplicates should have same barcode and read sequences)
My goal is to remove the barcode and the random sequence from R2 but also from R1 as R1 and R2 could overlap if the DNA fragment to sequence is small (less than 2x76 = 152 bp).
Example of R1 and R2 overlapping. In this case R1 contains sequence from the barcode
R1 =====================>
||||
R2 <===========#####@@@@@
Is there some tool to handle such cases. My first idea would be to write some R script to extract the barcode and random sequence and to align them against R1 in a local manner..
Not what you are asking for, but chances are that you don't actually have to remove this and can just align it, and it will get soft-clipped.
Yes I know but it would be nice to have clean reads for further analysis ;)
I think you can use cutadapt, if I'm not mistaken it'll remove the #### and following nts from R1
yes but in this case each read will have a different adapter to trim.
You can give only the #### sequence as an input to cutadapt and allow it to be anywhere along the sequence and request only the following sequence.
yes but each read will have a different #### sequence .
Oh, I skipped this part when first reading :). Good chances you'll end up coding it.