Hi all, I have some amplicon sequencing data from a company that has been endlessly frustrating to work with...it is a long story, but they basically dumped my sequencing data on a server without any information about what the adapter sequences were, etc. and then skedaddled.
So I went and looked at the sequences and found what I thought were the adapter sequences by hand/eyeball, and just checked that those adapter sequences I'd found were indeed in (almost) every read after the primer reverse complement at the 3' end of the purported amplified sequence. I did this separately for R1 and R2 because after the first parts of the adapter sequences (which are identical, and also match illumina adapter sequences I found online) the adapter sequences appear to diverge.
Because many of my actual amplified reads were so short, the reads had the adapter sequences and then poly-A tails and a ton of noise at the 3' end of my reads, so I didn't want to merge my paired-end reads until I'd trimmed them.
I used cutadapt to trim my sequence, and no matter what I do - use only the identical beginning part of the adapter for both, use the full 'adapters' (different for R1 and R2; I put this in quotations because it's just me eyeballing what I think it is), use 50% or 75% of the full 'adapters', even use the reverse complements of the primers - I get uneven results for R1 and R2. As in, I have a different distribution of sequence lengths for R1 and R2 after trimming. I would expect some small differences but in all of my fastq files, the R2 sequences are longer after trimming.
Has anyone dealt with this before/can suggest a better strategy? I'm thinking I might just lop off the ends of my sequences to get rid of some of the noise, merge the paired ends and then try trimming....
Thanks - that worked out really well!
Stay with BBTools and use
bbmap.sh
to align your data. You will be pleased with the results :)