How Can I Find Out Unknown Illumina Barcode.

Entering edit mode

11.7 years ago

samsara ▴ 630

I have fastq file with undetermined reads from Illumina HiSeq. The reads in fastq file contains illumina barcode. I do now know what barcode has been used in the lab and there is no way i could find it out. Is it possible to find it out from the fastq file I received. Are there any tools that show most repeated sub-sequence in each reads of fastq file ?

[14-01-2014] Edit1: I have ilumina base call files (raw data from sequencer). I have possibility to convert .bcl files to .fastq using bcl2fastq provided by Illumina.

illumina hiseq fastq reads • 24k views

ADD COMMENT • link updated 8.6 years ago by elgart ▴ 30 • written 11.7 years ago by samsara ▴ 630

Entering edit mode

11.7 years ago

BruceB ▴ 350

Assuming the demultiplexing was carried out specifying the index sequence location, then this could be used:

grep ^@HISEQ lane1_Undetermined_L001_R1_001.fastq | cut -d":" -f10 | uniq -c | sort -nr > barcodes.txt

This outputs a text file containing every barcode in the FASTQ file, with a count of how many times it is seen. The most frequent is at the top of the list. Could help you find the barcode?

ADD COMMENT • link 11.7 years ago by BruceB ▴ 350

Entering edit mode

I think you should sort before count uniq, then sort again to see the most frequent at the top of the list.

grep ^@HISEQ lane1_Undetermined_L001_R1_001.fastq | cut -d":" -f10 | sort -r | uniq -c | sort -nrk1,1 > barcodes.txt

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 9.9 years ago by henryvuong ▴ 810

Entering edit mode

11.7 years ago

IV ★ 1.3k

I think that you'd be thrilled to see what minion can do for you.

http://ebi.edu.au/ftp/contrib/enrightlab/kraken/reaper/src/reaper-13-100/doc/minion.html

You can run it as follows:

minion search-adapter -do $millions -show 10 -i $file

Where $millions is the number of reads you want to use for the sequence analysis and $file is the filename to analyze.

Cheers,

ADD COMMENT • link 11.7 years ago by IV ★ 1.3k

Entering edit mode

11.7 years ago

Steve Lianoglou 5.2k

If it isn't a custom barcode adapter/barcode, simply running FastQC on the fastq file might be all that you need.

The program is distributed with a "contimants.txt" file that lists many of the known adapters from different library prep methods. FastQC looks for over-represented sequences in the fastq file. If they match to any of the known adapters there, it will report that. If there is no match, you will still see the over-represented sequence, which you might consider stripping out anyway (perhaps you might want to blast the unknown sequence against the reference genome to ensure that you aren't removing any "signal" from you assay).

ADD COMMENT • link 11.7 years ago by Steve Lianoglou 5.2k

Entering edit mode

I run FastQC to check over-represented sequences, before posting this question, but FastQC did not give any over-represented sequences.

ADD REPLY • link 11.7 years ago by samsara ▴ 630

Entering edit mode

minion is a lot more sensitive, especially if you use all the available sequences in the analysis.

For instance, if you have 44.3M sequences in your file, you can run:

minion search-adapter -do 44 -show 10 -i sequences.fastq

Fastqc utilizes 1M reads for its analysis. The only problem with minion is that afterwards you have to manually inspect the 10 overrepresented sequences.

ADD REPLY • link 11.7 years ago by IV ★ 1.3k

Entering edit mode

8.6 years ago

elgart ▴ 30

I found that for Nextera-type barcodes (where there are two different barcodes on Forward and Reverse reads which together identify the sample) none of the above solutions work. So here is my take on it (python 2.7 code without dependencies):

	from __future__ import print_function
	from itertools import islice
	import sys

	Nextera_S_barcodes=['CTCTCTAT','GCGTAAGA','TATCCTCT','AGAGTAGA','GTAAGGAG','ACTGCATA','AAGGAGTA','CTAAGCCT']
	Nextera_N_barcodes=['TAAGGCGA','CGTACTAG','AGGCAGAA','TCCTGAGC','GGACTCCT','TAGGCATG','CTCTCTAC','CAGAGAGG',
	'GCTACGCT','CGAGGCTG','AAGAGGCA','GTAGAGGA']

	def eprint(args, *kwargs):
	print(args, file=sys.stderr, *kwargs)

	def levenshtein(a, b):
	n, m = len(a), len(b)
	if n > m:
	# Make sure n <= m, to use O(min(n,m)) space
	a, b = b, a
	n, m = m, n

	current = range(n + 1)
	for i in range(1, m + 1):
	previous, current = current, [i] + [0] * n
	for j in range(1, n + 1):
	add, delete = previous[j] + 1, current[j - 1] + 1
	change = previous[j - 1]
	if a[j - 1] != b[i - 1]:
	change = change + 1
	current[j] = min(add, delete, change)

	return current[n]

	try:
	from Levenshtein import distance
	lev_dist=distance
	eprint("Levenshtein module found, using it for speedup")
	except:
	lev_dist=levenshtein
	eprint("Levenshtein module not found, using slow python alternative")


	def ScanIndexFiles(I1,I2,barcodelength):
	data={}
	tRecords=0
	with open(I1) as i1, open(I2) as i2:
	while True:
	next_record_1 = list(islice(i1, 4))
	next_record_2 = list(islice(i2, 4))
	if not next_record_1:
	break
	if not next_record_2:
	print("New record in file 1, but not in file 2. Record %d"%(tRecords))
	return
	if next_record_1[0][0]!='@':
	print("Supposed to be New record in file 1, but is not. Record %d"%(tRecords))
	return
	if next_record_2[0][0] != '@':
	print("Supposed to be New record in file 2, but is not. Record %d" % (tRecords))
	return
	tBc="%s-%s"%(next_record_1[1][:barcodelength],next_record_2[1][:barcodelength])
	if not data.has_key(tBc):
	data[tBc]=0
	data[tBc] += 1
	tRecords+=1
	if not (tRecords%10000):
	eprint("Processed %d sequences"%(tRecords))
	return data

	def CollapseSimilar(data,dist,minoccur):
	ndata={}
	ndatal={}
	ndatar={}
	cnt=0
	eprint("Collapsing %d barcodes into similar by distance %d"%(len(data),dist))
	for w in sorted(data, key=data.get, reverse=True):
	if data[w]>minoccur:
	tl=w.split('-')[0]
	tr=w.split('-')[1]
	il=""
	for l in ndatal.keys():
	if lev_dist(l,tl)<=dist:
	if il=="":
	il=l
	else: #if found more than one similar
	il="" #its like didn't find anything - can't collapse
	break
	ir = ""
	for r in ndatar.keys():
	if lev_dist(r, tr) <= dist:
	if ir == "":
	ir = r
	else: # if found more than one similar
	ir = "" # its like didn't find anything - can't collapse
	break
	if ((ir!="")and(il!="")):
	bc="%s-%s"%(il,ir)
	if not ndata.has_key(bc):
	ndata[bc]={}
	ndata[bc]["count"]=data[w]
	ndata[bc]["mismatches"] = {}
	ndatal[il] = 1
	ndatar[ir] = 1
	else:
	ndata[bc]["count"] += data[w]
	ndata[bc]["mismatches"][w]=data[w]
	else:
	bc=w
	if ndata.has_key(bc):
	print("Encountered an unexpected key (%s), probably algorithm bug "%(bc))
	return
	ndata[bc]={}
	ndata[bc]["count"]=data[w]
	ndata[bc]["mismatches"]={}
	ndatal[tl] = 1
	ndatar[tr] = 1
	cnt+=1
	if not (cnt % 100):
	eprint("Processed %f %%" % (100.0*cnt/len(data)))
	return ndata

	def read_processedfile(fname):
	data={}
	with open(fname) as f:
	for l in f:
	if ((l.startswith("Processed"))or(l.startswith("Done."))or(l=="\n\n\nCollapsed\n\n")):
	continue
	else:
	d=l.strip().split(' ')
	if ((len(d)!=2)or(len(d[0].split('-'))!=2)or(len(d[0].split('-')[0])!=len(d[0].split('-')[1]))):
	print("Unknown format")
	return
	else:
	data[d[0]]=int(d[1])
	return data

	if __name__ == "__main__":

	if len(sys.argv)<5:
	print("usage: Generate_runscripts_v1.py IndexFile1 IndexFile2 barcode_length allowedmutations min_occurrences")
	exit()
	I1f = sys.argv[1]
	I2f = sys.argv[2]
	NexteraBarcodeL = int(sys.argv[3])
	okmut= int(sys.argv[4])
	min_occur = int(sys.argv[5])

	data=ScanIndexFiles(I1f,I2f,NexteraBarcodeL)
	for w in sorted(data, key=data.get, reverse=True):
	if data[w]>min_occur:
	print(w, data[w])
	else:
	break
	print("\n\n\nCollapsed\n\n")

	coldata=CollapseSimilar(data,okmut,min_occur)
	sd={}
	for i in coldata.keys():
	sd[i]=coldata[i]["count"]
	for w in sorted(sd, key=sd.get, reverse=True):
	ts=""
	for i in sorted(coldata[w]["mismatches"], key=coldata[w]["mismatches"].get, reverse=True):
	ts="%s\t%s\t%d"%(ts,i,coldata[w]["mismatches"][i])
	print("%s\t%d\t%d\t%s"%(w,coldata[w]["count"],len(coldata[w]["mismatches"]),ts))

	eprint("Done.")

view raw guess_Nextera_Barcodes_v1.py hosted with ❤ by GitHub

ADD COMMENT • link 8.6 years ago by elgart ▴ 30

Entering edit mode

Are you sure? What does you undetermined FASTQ look like? You should be able to us same grep-based command (the chosen answer) on all the different barcode configurations. For dual-index Nextera, they would come out as XXXXXXXX-XXXXXXXX (similar to single index, but with a dash).

ADD REPLY • link 8.6 years ago by igor 13k