Question

Retrieve and count variable barcodes from pooled sequencing fastq

3

Entering edit mode

3.0 years ago

yura.grabovska ▴ 660

I have a large fastq file with 100-base reads from a pooled barcoding experiment. This is not data I generated so I have limited options.

The barcodes are 21-mer and there are up to 100,000 different barcodes in the FASTQ.

The barcode is flanked by two static sequences from the constructs such that:

     Flanking 1      Barcode               Flanking 2 
[...]AAAGGACGAAACACC NNNNNNNNNNNNNNNNNNNNN GTTTCAGAGCTATGC[...]

I need to parse all the reads of the FASTQ file looking for all possible barcodes that fit this pattern of sequence and either count them, or output only the barcodes so I can count them manually.

I can use bbduk.sh to match a single barcode to the data, and I also looked at PoolQ but I can't generate a reference which PoolQ seems to require.

I only have the information included here and the FASTQ file.

Can someone suggest a tool to do this?

NGS fastq barcodes • 1.9k views

ADD COMMENT • link updated 3.0 years ago by shenwei356 8.7k • written 3.0 years ago by yura.grabovska ▴ 660

score 2 · Answer 1 · 2021-11-24

You can also use bbduk (GUIDE) for this. Something like this will work

bbduk.sh -Xmx2g in=read.fq.gz literal=AAAGGACGAAACACC ktrim=l restrictleft=NN1 k=10 out=stdout.fq | bbduk.sh -Xmx2g in=stdin.fq out=trimmed.fa literal=GTTTCAGAGCTATGC ktrim=r restrictright=NN2 k=10

OR

bbduk.sh -Xmx2g in=read.fq.gz literal=AAAGGACGAAACACC,GTTTCAGAGCTATGC ktrim=lr k=10 out=trimmed.fa

At this point your barcodes should be in fasta format. You can simply grep -v headers | sort | uniq -c to count them.

Replace NN1/NN2 with values equal to base pairs you want bbduk to restrict its search in.

score 2 · Answer 2 · 2021-11-24

seqkit amplicon can help to extract the barcode between two constant sequences, using region m:-n, where m = len(flank1)+1, and n=len(flank2)+1.

# reverse complement of GTTTCAGAGCTATGC
$ echo -en ">s\nGTTTCAGAGCTATGC\n" | seqkit seq -r -p -v -t dna | seqkit seq -s
GCATAGCTCTGAAAC

$ cat t.fq
@s1
tttAAAGGACGAAACACC_NNNNNNNNNNNNNNNNNNNNN_GTTTCAGAGCTATGCggg
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@s2
gggAAAGGACGAAACACC_NNNNNNNNNNNNNNNNNNNN_GTTTCAGAGCTATGCaaa
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

$ seqkit amplicon -F AAAGGACGAAACACC -R GCATAGCTCTGAAAC -r 16:-16 t.fq \
    | seqkit seq -s
[INFO] 1 primer pair loaded
_NNNNNNNNNNNNNNNNNNNNN_
_NNNNNNNNNNNNNNNNNNNN_

It also allows mismatch via -m, and then you can increase -j/--threads to accelerate matching.