Question

Processing barcoded sequencing data

0

Entering edit mode

7.5 years ago

bioplanet ▴ 60

Hi all,

I have recently obtained some Illumina sequencing data (amplicon-seq) where each of the sequences that were amplified is characterised by a distinct (unique) barcode (an 8-mer).

With the help also from people here at Biostars, I collected the different barcodes. My question now is -if anyone has done it or has seen a paper that does this kind of analysis- the following:

If you do not know the barcodes beforehand, and, like me, you end up with, say, 5000 different 8mers. Obviously, some of them were very frequent and some not. However, my only way of approaching how to "decide" if the given 8nt combination I extracted is an actual barcode or not is to base it on frequency and qPCR integrations estimations.

But is this actually correct? I will add 2 examples:

First cell line gave me 40 barcodes which had a frequency of 60,000 - 130,000. Then, then next one in line had a frequency of 1,500 for example. I then said, "ok, since the drop from 60,000 to 1,500 is very big, probably all 8mers with frequency less than 60,000 are PCR artifacts. Do you think this is correct? I mean, I would like to see if there is some kind of publication where they describe how to select/set cut-offs.
Another experiment was more ambiguous, since there the frequency of the barcodes was dropping out "smoothly", like 100,000 - 90,000 ... 10,000, 8,000 ... So no huge jump. What do you do there then?

Any help/idea/publication that you might have come across would be of valuable help!

sequencing • 1.8k views

ADD COMMENT • link updated 7.5 years ago by Sparrow_kop ▴ 260 • written 7.5 years ago by bioplanet ▴ 60

0

Entering edit mode

What are those barcodes exactly, unique molecular identifiers?

ADD REPLY • link 7.5 years ago by VHahaut ★ 1.2k

score 0 · Answer 1 · 2017-11-02

0

Entering edit mode

7.5 years ago

Sparrow_kop ▴ 260

Hi, the main analysis strategy about random barcode is, firstly , grouping the reads to every family according to the certain barcode sequence, secondly, in every family, consolidating to a consensus read after filter out the sequence error with the help of barcode. The you can use the consensus reads to map to analysis.

ADD COMMENT • link 7.5 years ago by Sparrow_kop ▴ 260

0

Entering edit mode

Hello and thank you very much for your answers. My barcodes are random 8mers that I do not know beforehand. They are tagged to each sequence that is ordered and gets amplified (I am sorry if I am not explaining it perfectly, I am new to this).

They key thing is that nobody knows these 8mers beforehand.

Regarding Sparrow's answer, I am not sure I get it completely...So far what I have done is basically map all my reads against my reference and isolate the respective region to my NNNNNNNN region in the construct (which obviously refers to the barcodes).

And what I have is basically an Excel sheet with some thousands of barcodes and their respective frequency in my sequencing reads.

What do you mean I should do next? If, for example, I have barcode ACCTAATT that is found in 10,000 reads, take all reads and create a consensus sequence out of them? And how do I use this to decide if another barcode that has been found only in 500 of my sequencing reads is a true barcode or not?

Also, is there some paper you might have come across where they do it and maybe you can refer me to?

Thank you very much!

ADD REPLY • link 7.5 years ago by bioplanet ▴ 60

1

Entering edit mode

Hi, in fact you need not to know the random barcode sequence beforehand. It is a tag used to track and identify the raw dna fragment sequence. The figure below illustrate this under the hood.

the figure

Also , the question 'What do you mean I should do next? If, for example, I have barcode ACCTAATT that is found in 10,000 reads, take all reads and create a consensus sequence out of them? And how do I use this to decide if another barcode that has been found only in 500 of my sequencing reads is a true barcode or not?', there are many dna fragment before we construct the library, so each fragment will be added a random barcode whether the sequence is same or not, so if a certain barcode has many copy, the barcode is true.

please first read the paper in the link , I hope this could be helpful to you :-)

ADD REPLY • link 7.5 years ago by Sparrow_kop ▴ 260