Question

cutadapt and paired-end reads when you don't know your adapter sequences...

0

Entering edit mode

7.6 years ago

beegrackle ▴ 90

Hi all, I have some amplicon sequencing data from a company that has been endlessly frustrating to work with...it is a long story, but they basically dumped my sequencing data on a server without any information about what the adapter sequences were, etc. and then skedaddled.

So I went and looked at the sequences and found what I thought were the adapter sequences by hand/eyeball, and just checked that those adapter sequences I'd found were indeed in (almost) every read after the primer reverse complement at the 3' end of the purported amplified sequence. I did this separately for R1 and R2 because after the first parts of the adapter sequences (which are identical, and also match illumina adapter sequences I found online) the adapter sequences appear to diverge.

Because many of my actual amplified reads were so short, the reads had the adapter sequences and then poly-A tails and a ton of noise at the 3' end of my reads, so I didn't want to merge my paired-end reads until I'd trimmed them.

I used cutadapt to trim my sequence, and no matter what I do - use only the identical beginning part of the adapter for both, use the full 'adapters' (different for R1 and R2; I put this in quotations because it's just me eyeballing what I think it is), use 50% or 75% of the full 'adapters', even use the reverse complements of the primers - I get uneven results for R1 and R2. As in, I have a different distribution of sequence lengths for R1 and R2 after trimming. I would expect some small differences but in all of my fastq files, the R2 sequences are longer after trimming.

Has anyone dealt with this before/can suggest a better strategy? I'm thinking I might just lop off the ends of my sequences to get rid of some of the noise, merge the paired ends and then try trimming....

next-gen sequencing • 4.8k views

ADD COMMENT • link updated 7.6 years ago by BioinfGuru ★ 2.1k • written 7.6 years ago by beegrackle ▴ 90

1

Entering edit mode

7.6 years ago

BioinfGuru ★ 2.1k

Don't fastqc + multiqc both return overrepresented sequences? That will tell you the adapter sequence exactly.

ADD COMMENT • link 7.6 years ago by BioinfGuru ★ 2.1k

0

Entering edit mode

Unfortunately - no. Instead all my overrepresented sequences are my forward primer (or reverse primer) and a following 10-bp sequence. Which is rather dodgy, I admit.

ADD REPLY • link 7.6 years ago by beegrackle ▴ 90

0

Entering edit mode

I wouldnt be suprised if the company didnt provide a qc report with the data they gave you... money back...not good enough at all.

ADD REPLY • link 7.6 years ago by BioinfGuru ★ 2.1k

score 2 · Accepted Answer · 2017-04-06

2

Entering edit mode

7.6 years ago

h.mon 35k

You can use bbduk (from bbtools) with the flags tbo (trim adapters based on where paired reads overlap) and tpe (when kmer right-trimming, trim both reads to the minimum length of either).

ADD COMMENT • link 7.6 years ago by h.mon 35k

0

Entering edit mode

Thanks - that worked out really well!

ADD REPLY • link 7.6 years ago by beegrackle ▴ 90

0

Entering edit mode

Stay with BBTools and use bbmap.sh to align your data. You will be pleased with the results :)

ADD REPLY • link 7.6 years ago by GenoMax 147k