Question

How to confirm forward primer matches each read without removing primer sequence?

0

Entering edit mode

9.2 years ago

gavinmdouglas ▴ 10

While filtering illumina MiSeq data I want to check whether the forward primer matches the beginning of each read (these are single-end reads). Lots of programs let you do this, but then they trim off the primer sequence. I'd actually like to keep the primer sequences in, but throw out any reads that don't have the primer sequence.

*Edit: note that degenerate primers were used, so there are multiple sequences that could be matched.

Does anyone know of a way to do this with fastq files? Ideally I'd like to use Trimmomatic's ILLUMINACLIP command, but I haven't figured out a way to stop it from clipping the sequence.

The mothur command trim.seqs does this for fasta files (when keepforward=T), but I'd rather not have to waste time converting between fastq and fasta since this is a pipeline I will need to re-run a lot.

Thanks in advance!

Gavin

filtering sequencing read • 6.9k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by gavinmdouglas ▴ 10

2

Entering edit mode

Why not use one of the barcode demultiplexing tools, with the degenerate primer sequences as your barcodes? You could cat all of the matched reads together at the end.

Alternatively, I'm 95% certain that BBMap can do the trick. After all, it seems to do everything else ;-).

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by harold.smith.tarheel ★ 5.0k

2

Entering edit mode

As a matter of fact, it can!

e.g. for the adapter sequence GGACTGANNCGA

bbduk.sh in=reads.fq outm=matching.fq outu=nonmatching.fq literal=GGACTGANNCGA k=12 restrictleft=12 mm=f rcomp=f copyundefined

Here, literal is the literal sequence to look for; k should be set to the exact length of that sequence, restrictleft is optional but in this case will tell it to only look for matches in the first 12 bases (so it will not accept something where the primer is in the middle, for example); rcomp=f tells it to not look for reverse-complemented sequence; and copyundefined tells it to make copies of the literal sequence with all possible combinations of the degenerate bases.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

1

Entering edit mode

Thanks, this is perfect!

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by gavinmdouglas ▴ 10

0

Entering edit mode

Aha, I knew it!

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by harold.smith.tarheel ★ 5.0k

Ram · Answer 1 · 2015-10-29

Here is a potential python script. Use biopython to read in your FASTQ file, and then loop through each record and compare to your primer sequence. This is sort of lifted out of our UFITS package for processing NGS fungal amplicon sequences, where we do something similar, i.e. find primers and trim - but in your case you can just find them and print if match.

import sys
from Bio import SeqIO

#create dictionary with IUPAC codes
IUPAC = {}
IUPAC['A'] = "A"
IUPAC['C'] = "C"
IUPAC['G'] = "G"
IUPAC['T'] = "T"
IUPAC['M'] = "AC"
IUPAC['R'] = "AG"
IUPAC['W'] = "AT"
IUPAC['S'] = "CG"
IUPAC['Y'] = "CT"
IUPAC['K'] = "GT"
IUPAC['V'] = "ACG"
IUPAC['H'] = "ACT"
IUPAC['D'] = "AGT"
IUPAC['B'] = "CGT"
IUPAC['X'] = "GATC"
IUPAC['N'] = "GATC"

def MatchLetter(a, b):
    global IUPAC
    try:
        sa = IUPAC[a.upper()]
    except:
        return False
    try:
        sb = IUPAC[b.upper()]
    except:
        return False
    for ca in sa:
        if ca in sb:
            return True
    return False

def MatchPrefix(Seq, Primer):
    L = len(Seq)
    n = len(Primer)
    if L < n:
        n = L
    Diffs = 0
    for i in range(0, n):
        if not MatchLetter(Seq[i], Primer[i]):
            Diffs += 1
    return Diffs

primer = 'AGGCAATTRCTWGGA'

handle = open(sys.argv[1], 'rU')
SeqRecords = SeqIO.parse(handle, 'fastq')
for rec in SeqRecords:
    Seq = str(rec.seq)
    Diffs = MatchPrefix(Seq, primer) #count diffs between Seq and primer
    if Diffs < 2:  #if Diffs less than some threshold, then just print record
        SeqIO.write(rec, sys.stdout, 'fastq')

Ram · Answer 2 · 2015-10-29

1

Entering edit mode

9.2 years ago

Renesh ★ 2.2k

As I understand your question correctly, you want to filter out the reads without primer sequences.

For this you can use, grep command on Linux/Unix to achieve it

grep your_primer_sequence fastq_file > output_file

Use grep --help to explore more options for grep command

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Renesh ★ 2.2k

0

Entering edit mode

Thanks for your response. I should have mentioned that degenerate primers were used, so it is not only one primer sequence to scan for. Also, I want to make sure the primer matches at the beginning of the read (not just anywhere in the sequence) and of course the command as you have typed it would leave out the non-sequence lines of the fastq as well, so I don't think that would work.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by gavinmdouglas ▴ 10

1

Entering edit mode

You can loop over all sequences and modify the GREP extracting the line before the hit and then two following ones. While grepping the primer-sequence with '^', you'll get only those starting with it:

for i in $(cat primers.txt); do grep -A 2 -B 1 ^${i} myfastq.fq > $i.fastq ; done

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Okay, I don't think so any program will do like as per your requirement. here you have to write your own code.

ADD REPLY • link 9.2 years ago by Renesh ★ 2.2k

Ram · Answer 3 · 2015-10-30

The following command may help,

zcat infile.fastq.gz | paste - - - - | awk '{if ($0~/\tSEQUENCE/)print}' | awk -F '\t' '{print $1"\n"$2"\n"$3"\n"$4}' > outfile.fastq

The paste command with four hyphen makes four lines into one line separated by tab.

The second column in each line will be the sequence and tab precedes the sequence.

so \tSEQUENCE gives the fastq records which starts with primer sequence.

The last fastq print command prints back in the standard fastq format.

|| symbol can be used in the pattern matching step to include multiple primer sequences.

Hope it helps!