Question

Identifying primer sequences from raw FASTQ files

0

Entering edit mode

5 months ago

Chandini • 0

Hello everyone.

I am trying to automate a 16s metagenome analysis workflow such that the user needs to provide nothing but the fastq files to the workflow. The analysis requires primer sequences for cutadapt, qiime2 classification and its downstream analysis. I want to automate this process so hopefully the user need not provide this in the input. The experiment is 16S rRNA amplicon sequencing using Illumina.

Does anyone know of any tools that can identify a forward and reverse primer sequence (consensus building with IUPAC codes) from FASTQ files, when even the primer sequence lengths are not known?

primers • 858 views

ADD COMMENT • link 4 months ago by Chandini • 0

2

Entering edit mode

Running a tool such as fastqc would that work? should give you an indication of adapters/overrepresented seqs/ ....

alternatively many of the read trimming tools have built-in "adapter" recognition functionality, perhaps they can also fish out the primer sequences?

ADD REPLY • link 5 months ago by lieven.sterck 15k

0

Entering edit mode

You can start with a list of known primer-pairs (e.g. wikipedia and PMC8544895) and your FastQ files with each combination. You can store the cutadapt output in region-specific folders (like v3v4, v6-v8 ) and store those reads not having this pair in a temporary folder (using the --untrimmed-output and --untrimmed-paired-output option ) in order to use the untrimmed pairs as input for the next known primer pairs.

If there is a substantial amount of reads not having any of the known primers, you can investigate manually and then add your findings to the list of primers.

ADD REPLY • link 5 months ago by michael.ante ★ 4.0k

0

Entering edit mode

Thank you Michael! Yes I think I should also try to go with looking for known primers in the fastq files; I tried to write a python script to generate a consensus but it did not give me the exact primer sequences expected, due to potential variations in the reads ( I had also not removed low quality bases, etc ). I do not want the consensus, just the exact primers. I'll try your method.

ADD REPLY • link 5 months ago by Chandini • 0

0

Entering edit mode

The primers don't need to be unambiguous sequences. Often to improve binding, they are given with ambiguous DNA-letters, which in the end means that they are a mixture of similar sequences differing in these positions.

For instance, a 'Y' represents 'C' or 'T'; ACYCG is writen as one sequence, but the actual primer is composed of ACCCG and ACTCG-molecules.

You can use sequence logo plots to better visualise the sequences and their composition.

ADD REPLY • link 5 months ago by michael.ante ★ 4.0k

0

Entering edit mode

Thanks a lot for your help, I did limit myself to common primers used in amplicon sequencing and went about it the way you had suggested. I wrote a python script and it is working now.

ADD REPLY • link 4 months ago by Chandini • 0