Hello everyone,
Bioinformatics novice, here looking for help. I'm using Galaxy to try and remove adaptors from sequencing reads but it's tricky and I would like some advice on approach. Here's the experiment.
50bp PE reads. The 5' end of read 1 contains adaptor then 3x G. The 5' end of read 2 contains 15x T derived from polyadenylation during the library prep. I would like to trim the G's off the 5' end of read 1 and the T's from the 5' end of read 2. In addition, for any reads shorter than 50 bp, the 3' end of read 2 will contain 3x C (complement of the 3xG) and the 3' end of read 1 will have 15x A (the complement of the T's). Is there an additional trick to remove these instances too?
Thanks for any help!
It always helps to post data instead of explaining the problem. Post some example reads and expected output. It can be done via CLI. Similar (for eg cutadapt, bbduk) tools are available in galaxy.
So, reads will take the following format:
I have 50 bp paired-end reads and want to remove the ADAPTOR - GGG from the start and the AAAAAAAAAAAAAAA - ADAPTOR from the 3' end to leave the bit in the middle. Unfortunately, I'm only able to use Galaxy (have very limited programming knowledge).
Is this paired-end data? And do you have 2 FASTQ files (R1 and R2)? In that case you can upload both files to galaxy and use cutadapt or fastp on paired-end mode. I think people here need that info to be able to give a good answer.
Thank you. Yes, I have two files per sample (read 1 and read 2).
Is there real sequence in your read where you have added the word
ADAPTOR
above?You can use
bbduk.sh
from BBMap suite in two pass mode like this on the command line.With the command
This will produce