Question

How Can I Find Out Unknown Illumina Barcode.

6

Entering edit mode

10.9 years ago

samsara ▴ 630

I have fastq file with undetermined reads from Illumina HiSeq. The reads in fastq file contains illumina barcode. I do now know what barcode has been used in the lab and there is no way i could find it out. Is it possible to find it out from the fastq file I received. Are there any tools that show most repeated sub-sequence in each reads of fastq file ?

[14-01-2014] Edit1: I have ilumina base call files (raw data from sequencer). I have possibility to convert .bcl files to .fastq using bcl2fastq provided by Illumina.

illumina hiseq fastq reads • 22k views

ADD COMMENT • link updated 7.8 years ago by elgart ▴ 30 • written 10.9 years ago by samsara ▴ 630

3

Entering edit mode

10.9 years ago

IV ★ 1.3k

I think that you'd be thrilled to see what minion can do for you.

http://ebi.edu.au/ftp/contrib/enrightlab/kraken/reaper/src/reaper-13-100/doc/minion.html

You can run it as follows:

minion search-adapter -do $millions -show 10 -i $file

Where $millions is the number of reads you want to use for the sequence analysis and $file is the filename to analyze.

Cheers,

IV

ADD COMMENT • link 10.9 years ago by IV ★ 1.3k

1

Entering edit mode

10.9 years ago

Steve Lianoglou 5.2k

If it isn't a custom barcode adapter/barcode, simply running FastQC on the fastq file might be all that you need.

The program is distributed with a "contimants.txt" file that lists many of the known adapters from different library prep methods. FastQC looks for over-represented sequences in the fastq file. If they match to any of the known adapters there, it will report that. If there is no match, you will still see the over-represented sequence, which you might consider stripping out anyway (perhaps you might want to blast the unknown sequence against the reference genome to ensure that you aren't removing any "signal" from you assay).

ADD COMMENT • link 10.9 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

I run FastQC to check over-represented sequences, before posting this question, but FastQC did not give any over-represented sequences.

ADD REPLY • link 10.9 years ago by samsara ▴ 630

0

Entering edit mode

minion is a lot more sensitive, especially if you use all the available sequences in the analysis.

For instance, if you have 44.3M sequences in your file, you can run:

minion search-adapter -do 44 -show 10 -i sequences.fastq

Fastqc utilizes 1M reads for its analysis. The only problem with minion is that afterwards you have to manually inspect the 10 overrepresented sequences.

ADD REPLY • link 10.9 years ago by IV ★ 1.3k

0

Entering edit mode

7.8 years ago

elgart ▴ 30

I found that for Nextera-type barcodes (where there are two different barcodes on Forward and Reverse reads which together identify the sample) none of the above solutions work. So here is my take on it (python 2.7 code without dependencies):

ADD COMMENT • link 7.8 years ago by elgart ▴ 30

0

Entering edit mode

Are you sure? What does you undetermined FASTQ look like? You should be able to us same grep-based command (the chosen answer) on all the different barcode configurations. For dual-index Nextera, they would come out as XXXXXXXX-XXXXXXXX (similar to single index, but with a dash).

ADD REPLY • link 7.8 years ago by igor 13k

Ram · Accepted Answer · 2014-01-13

9

Entering edit mode

10.9 years ago

BruceB ▴ 340

Assuming the demultiplexing was carried out specifying the index sequence location, then this could be used:

grep ^@HISEQ lane1_Undetermined_L001_R1_001.fastq | cut -d":" -f10 | uniq -c | sort -nr > barcodes.txt

This outputs a text file containing every barcode in the FASTQ file, with a count of how many times it is seen. The most frequent is at the top of the list. Could help you find the barcode?

ADD COMMENT • link 10.9 years ago by BruceB ▴ 340

2

Entering edit mode

I think you should sort before count uniq, then sort again to see the most frequent at the top of the list.

grep ^@HISEQ lane1_Undetermined_L001_R1_001.fastq | cut -d":" -f10 | sort -r | uniq -c | sort -nrk1,1 > barcodes.txt

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by henryvuong ▴ 810