Hi everyone, I'm having some problems trying to figure out what sequence of adapter should I enter as input in cutadapt or trimmomatic to trim them from my fastqs.
I have a set of fastqs, each of them with a set of reads of 51 bp, comencing with an N and then a series of letters corresponding to the read. I have also the information about the index sequence in each fastq, after demultiplexing, and two sequences determining the primers used. For instance, this is the information about one fastqc I have:
@700470R:449:HVHH7BCXX:2:1101:1406:1948 1:N:0:GTGAAA
NGCAGCATTGTACAGGGCTATGAAGATCGGAAGAGCACACGTCTGAACTCC
+
#<DDDEHIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIHIIIII
@700470R:449:HVHH7BCXX:2:1101:1814:1992 1:N:0:GTGAAA
NCCGGGTGCCGTAGGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCA
+
#<DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIH<DFGHIIIII
@700470R:449:HVHH7BCXX:2:1101:2184:1885 1:N:0:GTGAAA
NGGGGAGGTGGAGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTG
+
#<DDD<<CGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
The index sequence is, as determined in the header, GTGAAA. I also have information about the SR primer, which is:
5 ́AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTGGGA3 ́
and the Index primer, which is:
5 ́CAAGCAGAAGACGGCATACGAGATNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT3 ́
Substituting the NNNNNN with the index sequence provided in the header of the corresponding fastq, I would obtain the barcoded adapter used for sequencing, if I'm not wrong.
So here is where I start getting lost. After doing fastqc analysis, I got a list with a bunch of sequences in the overrepresented sequences, corresponding to Illumina Multiplexing PCR primer, as if there were different adapters withing the whole fastq in the same file.
So, here is my question:
¿What sequence should I include in cutadapt program to trim in this case, for instance? Should I include more than one? In my oppinion I should include the Index primer sequence substituting the NNNNNN with the index sequence (barcode), for each fastq, but I'm not sure whether this is correct or not, and whether I should include more sequences or not. Also I'm not sure about what parameters I should include to run cutadapt. I assume that I should add the variables -a and -g to include the adaptor sequence in both sides to be trimmed, or if just adding -a would work. Also wondering about Error Tolerance (-e) in matching letters in adapters (don't know what by default value is included if no specification is added). Also wondering about using Wildcards NNNNN as universal adapter or just creating a list for each barcode used in each sample fastq to be included as adapter variable. Also wondering if using Quality trimming would be usefull, although the average quality base call in each read is very high (over 30). And also wondering if ussing --trim-n option to trim possible flanking Ns in my reads...
As you all see... quite lost I am...
i'd just use trimmomatic because it has built in multi thread support and just chuck out all the highly over represented sequences and if the reads are too short as a result chuck them out too. then do some basic quality trimming as well & chucking out. you can just read the manual and play around, it is a well written tool.