Question

Extract unique barcodes and find their frequency

0

Entering edit mode

4.9 years ago

kspata ▴ 90

Hi All,

I have paired end 250 sequencing data for a sample. The read count is around 2 million. The data has 70-mer barcodes which are embedded in common upstream and downstream region.

I have to analyze how many unique barcodes are present in the sample and their frequency relative to total number of reads. So far, I have mapped the reads to the reference which has N's in them for the barcode region. I found some common sequences which may be the barcodes.

I then merged forward and reverse reads with minimum overlap of 50 and grepped the observed barcode sequence. Although, I am sure of this approach is correct.

Is there any better way to perform this kind of analysis?

Help would be appreciated.

Thanks in advance !!

ngs illumina mapping • 1.4k views

ADD COMMENT • link 4.9 years ago by kspata ▴ 90

0

Entering edit mode

Are the barcodes in exactly the same location in all reads (e.g. basepair 1-30)? If so you could cut those regions out (use bbduk.sh from BBMap suite) and then sort | unique -c then to get the counts.

ADD REPLY • link 4.9 years ago by GenoMax 153k

0

Entering edit mode

Thanks for replying. I know the location of the barcodes in the reference which is from base pair 807-896. I do not know their sequence or location in the reads.

Will bbduk work in this way?

ADD REPLY • link 4.9 years ago by kspata ▴ 90

0

Entering edit mode

You are going to need to use a custom solution for this situation. A couple of options.

Sort/index your alignment files. Retrieve SAM alignment lines that are aligning to the region (+/- N bases) on interest. You could then look at the CIGAR strings of each alignment and figure out which bases you will need to excise from original reads.

Or retrieve reads that are mapping in the region by samtools view and then do a multiple-sequence alignment to identify the section you are interested in against the reduced reference.

ADD REPLY • link 4.9 years ago by GenoMax 153k