Entering edit mode
5.7 years ago
genya35
▴
50
Hello,
I have a two adapter trimmed Illumina fastq files (read1 and read2). I would like to identify the primer sequence. I have some idea what the primer sequences are however, they appears to be degenerate. So far I've found over 200 variations for the primer sequence in reads and looking for a tool that would group them and produce degenerate primer sequence.
Next, I plan to separate the reads into separate files based on the primer sequence using bbtools.
Please suggest an approach/tool to use.
Can you give some details on what you mean by "primer sequence". What kind of assay are you using?
it's an amplicon based assay. By primer sequence I mean "NNCTGGGTCCGCCAAGCT". Thanks
I have a bunch of sequences that I'm trying to figure out at which base they are ambiguous. I wonder if there is a tool that can help with that.
@Brian had provided this solution in past. Change the number 20 depending on size of primer you expect.
Following will trim all but first 20 bases (I assume your primer is at the beginning of read)?
This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort (in excel if you wish)
If primers are 20 bp long they should be easily identifiable.
If this works let me know and I will move it to an answer.
This was really helpful, however, I ended up with 1560 unique sequences. Is there a tool that would group them based on the similarity?
Use unix
sort
to sort the first column containing the sequence. That said you did not get a few clear contenders (with lots of counts) as compared to rest? Are you explaining what you need accurately?@genomax There is too much variation in the first 3 bases of each read. Is there a way to tell reformat.sh to ignore the first three bases and look at the base 4 to 23? thanks
You can add
ftl=2
to remove first three bases and extendftr=22
to still get 20 bases total in firstreformat.sh
command above. Then follow the same process.Just in case both those options don't work right together you could do something like (two consecutive rounds of trimming)