Question

How to search for primer sequences in fastq files generated after amplicon sequencing

0

Entering edit mode

3.8 years ago

Haseeb • 0

Hi all,

I need some help with grep or any other command that will help do the job. I am very new to the command line. Any help is appreciated, thank you.

I recently did some amplicon sequencing of a multiplexed PCR reaction. I used nearly 90 primer pairs to multiplex a PCR reaction to generate amplicons. Sequencing libraries of these amplicons were made and the libraries read on a MiSeq instrument. 4 such reactions, differing in some primer pairs were used for sequencing. I now have the fastq files. Now i want to see the representation of each primer product in the fastq file, do decide which primer pool I should proceed with for my actual experiments. The MiSeq run was single-end and so I want to look for the forward primer sequence in the resultant fastq files.

I have been using grep to get answers but i only know how to do it individually

 grep -c ^AAAGTGTGTGGGGATGATATGG ./*.fastq

c for count ^ to search for string at the beginning of the sequence

The results that I get from this is

./myfastq1.fastq:number
./myfastq2.fastq:number
./myfastq3.fastq:number
./myfastq4.fastq:number

Then I take the number and paste it in an excel file. I know- terrible!!!

I have been searching for help similar to what i need but with no positive outcome.

My request here is:

I have a tab delimited file forwardprimers.txt with; (col1) primer name (col2) primer sequence, for 90 primers

I have 4 fastq files to query these primer sequences.

Is there a way to query the sequences in primer file with fastq file and get the counts for each primer name in a new output file. Thank you.

Primers MiSeq grep Amplicon search Sequencing Fastq • 3.9k views

ADD COMMENT • link updated 3.8 years ago by Istvan Albert 102k • written 3.8 years ago by Haseeb • 0

1

Entering edit mode

Use a for loop to go through your forwardprimers.txt file and then use bbduk.sh from BBMap suite in filter mode to simply get the stats for each run. You could provide output files if you actually want to parse out reads that contain that primer. A guide to use bbduk.sh is available.

Looks like there is openPrimeR (LINK) that also should help.

ADD REPLY • link 3.8 years ago by GenoMax 152k

score 1 · Answer 1 · 2021-10-03

If you already know the command then this is simply a matter of parallelization. You can write nested for loop in bash as GenoMax suggests, where each loop reads a file, but I would recommend learning GNU parallel as it can produce a much more elegant solution.

Put the name of your primers and names of you files into two separate files then run the command to see the output:

parallel 'echo {1},{2},`grep -c {1} {2}`' :::: primers.txt :::: filelist.txt

for an example input of two primers and two files the above prints:

ATG,file1.fq,2
AAA,file1.fq,3
ATG,file2.fq,1
AAA,file2.fq,0

It produces a useful output that shows both the primer, the filename, and the count.

Learn more about GNU parallel here:

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them