I'm looking for a convenient tool, to demultiplex my Illumina PE data. Particularly to extract pairs with a certain sequence in the forward read and other certain sequence in the reverse one. Could you advise me please? For example: Initially, we have two fastq files with forward and reverse reads
Forvard reads sequences:
NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNAGAGCGTATATGCCGAGNNNNNNNN
NNNNNNAGTCCGTATATGGGGAGNNNNNNNN
Reverse reads sequences:
NNNNNNNNNGAGATGGACTACTCACNNNNNN
NNNNNNNNNGAGATGGATTACTCACNNNNNN
NNNNNNNNNGAGAAGGACTACTCACNNNNNN
So, i'd like to extract for futher analysis only pair
NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNNNNGAGATGGACTACTCACNNNNNN
Since in the forward read is AGTCCGTATATGCCGAG tag and there is GAGATGGACTACTCAC tag in the reverse read. Now i need only 100% match.
Hi Denis,
It is always useful to provide examples of input and desired output to clarify exactly what you are trying to achieve? Are you looking to select a subset of reads with a certain string? Have you looked at related posts on this forum?
Extract specific reads from FASTQ files based on subsequence
Count and location of strings in fastq file reads
Hi Sej,
I've updated my post to address your points. Thanks!
You can use prinseq tool with
-custom-params
with the specific string that you are looking for.Hello Denis,
thanks for adding an example. But your example doesn't look like your real input, as this is neither fasta nor fastq. Furthermore what has the task you are trying to solve to do with demultiplexing?
What I read out of your description is, that you're trying to remove duplicate sequences. This can be done for example with seqkit:
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.fin swimmer
Hi fin swimmer!
Many thaks for your reply and post editing. No. I'm working with Illumina amplicon data. So i'd like to extract pairs that contain PCR primers and discard all the other read pairs.
Are these Illumina barcodes or internal barcodes/sequences?
It's a custom internal PCR primers.
Are the primers more or less always in the same place? I wondering if you can use something like umi_tools or a variant of our demultiplexing script for RELACS data to handle this.
Yes, sure. The primers are at the 5' end of forward and reverse reads.
Then the options I mentioned should work (possibly with some tweaks) too.