How Can I Find Out Unknown Illumina Barcode.
4
6
Entering edit mode
11.0 years ago
samsara ▴ 630

I have fastq file with undetermined reads from Illumina HiSeq. The reads in fastq file contains illumina barcode. I do now know what barcode has been used in the lab and there is no way i could find it out. Is it possible to find it out from the fastq file I received. Are there any tools that show most repeated sub-sequence in each reads of fastq file ?

[14-01-2014] Edit1: I have ilumina base call files (raw data from sequencer). I have possibility to convert .bcl files to .fastq using bcl2fastq provided by Illumina.

illumina hiseq fastq reads • 22k views
ADD COMMENT
9
Entering edit mode
11.0 years ago
BruceB ▴ 340

Assuming the demultiplexing was carried out specifying the index sequence location, then this could be used:

grep ^@HISEQ lane1_Undetermined_L001_R1_001.fastq | cut -d":" -f10 | uniq -c | sort -nr > barcodes.txt

This outputs a text file containing every barcode in the FASTQ file, with a count of how many times it is seen. The most frequent is at the top of the list. Could help you find the barcode?

ADD COMMENT
2
Entering edit mode

I think you should sort before count uniq, then sort again to see the most frequent at the top of the list.

grep ^@HISEQ lane1_Undetermined_L001_R1_001.fastq | cut -d":" -f10 | sort -r | uniq -c | sort -nrk1,1 > barcodes.txt
ADD REPLY
3
Entering edit mode
11.0 years ago
IV ★ 1.3k

I think that you'd be thrilled to see what minion can do for you.

http://ebi.edu.au/ftp/contrib/enrightlab/kraken/reaper/src/reaper-13-100/doc/minion.html

You can run it as follows:

minion search-adapter -do $millions -show 10 -i $file

Where $millions is the number of reads you want to use for the sequence analysis and $file is the filename to analyze.

Cheers,

IV

ADD COMMENT
1
Entering edit mode
11.0 years ago

If it isn't a custom barcode adapter/barcode, simply running FastQC on the fastq file might be all that you need.

The program is distributed with a "contimants.txt" file that lists many of the known adapters from different library prep methods. FastQC looks for over-represented sequences in the fastq file. If they match to any of the known adapters there, it will report that. If there is no match, you will still see the over-represented sequence, which you might consider stripping out anyway (perhaps you might want to blast the unknown sequence against the reference genome to ensure that you aren't removing any "signal" from you assay).

ADD COMMENT
0
Entering edit mode

I run FastQC to check over-represented sequences, before posting this question, but FastQC did not give any over-represented sequences.

ADD REPLY
0
Entering edit mode

minion is a lot more sensitive, especially if you use all the available sequences in the analysis.

For instance, if you have 44.3M sequences in your file, you can run:

minion search-adapter -do 44 -show 10 -i sequences.fastq

Fastqc utilizes 1M reads for its analysis. The only problem with minion is that afterwards you have to manually inspect the 10 overrepresented sequences.

ADD REPLY
0
Entering edit mode
7.9 years ago
elgart ▴ 30

I found that for Nextera-type barcodes (where there are two different barcodes on Forward and Reverse reads which together identify the sample) none of the above solutions work. So here is my take on it (python 2.7 code without dependencies):

ADD COMMENT
0
Entering edit mode

Are you sure? What does you undetermined FASTQ look like? You should be able to us same grep-based command (the chosen answer) on all the different barcode configurations. For dual-index Nextera, they would come out as XXXXXXXX-XXXXXXXX (similar to single index, but with a dash).

ADD REPLY

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6