Hi All,
I have paired end 250 sequencing data for a sample. The read count is around 2 million. The data has 70-mer barcodes which are embedded in common upstream and downstream region.
I have to analyze how many unique barcodes are present in the sample and their frequency relative to total number of reads. So far, I have mapped the reads to the reference which has N's in them for the barcode region. I found some common sequences which may be the barcodes.
I then merged forward and reverse reads with minimum overlap of 50 and grepped the observed barcode sequence. Although, I am sure of this approach is correct.
Is there any better way to perform this kind of analysis?
Help would be appreciated.
Thanks in advance !!
Are the barcodes in exactly the same location in all reads (e.g. basepair 1-30)? If so you could cut those regions out (use
bbduk.sh
from BBMap suite) and thensort | unique -c
then to get the counts.Thanks for replying. I know the location of the barcodes in the reference which is from base pair 807-896. I do not know their sequence or location in the reads.
Will bbduk work in this way?
You are going to need to use a custom solution for this situation. A couple of options.
Sort/index your alignment files. Retrieve SAM alignment lines that are aligning to the region (+/- N bases) on interest. You could then look at the CIGAR strings of each alignment and figure out which bases you will need to excise from original reads.
Or retrieve reads that are mapping in the region by
samtools view
and then do a multiple-sequence alignment to identify the section you are interested in against the reduced reference.