Question

Demultiplex Illumina With Barcodes On Identifier Line

4

Entering edit mode

13.0 years ago

Israel Barrantes ▴ 790

I got a set of Illumina files which are barcoded in the sequence identifier instead (barcode is not part of the sequence), therefore we cannot use fastx_barcode_splitter.pl or similar scripts. Example:

@HWI-ST132_459:6:2208:20745:200766#AGTTCC/1
CCCAGGGGGTTGCTAGGTTGAAAGAGAAGAACTAAGCTTAAATTTGTTGTACATTGTATATAATTACAAAGTGTTATGTTATTATTATTAAAAAAAAAAA
+
ca^WcZX[D_T]GQI^]^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST132_459:6:2208:21328:200860#AGTTCC/1
CATTTTGGTGGGTTGTGGTTTTGGGGGGTTTGTTGTTGGGTTTTATAAGGTGGTTTTTTTTAATAAGTAAAAATAAAAAAAAAAATTAAGAATAAAAAAA
+
]TPKODYF[TSHWUQRRGZV`N_Y`c\abc]]D_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Anybody knows a program that is able to split/demultiplex these datasets?

illumina barcode • 17k views

ADD COMMENT • link updated 13.0 years ago by Iris • 0 • written 13.0 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

Are the barcodes and read names delimited by something? It looks like the "_" character separates the barcoding and read names?

ADD REPLY • link 13.0 years ago by Damian Kao 16k

0

Entering edit mode

The barcode in the above mentioned example is AGTTCC, always after the number sign (#) and before the slash, although in some other datasets it comes as the six last characters of the identifier line

ADD REPLY • link 13.0 years ago by Israel Barrantes ▴ 790

score 5 · Answer 1 · 2011-12-15

Try this python script:

import sys

inFile = open(sys.argv[1],'r')

header = ''
data = ''

outFile = {}

for line in inFile:
    if line[0] == "@":
        if header != '':
            barcode = header.split('#')[-1].split('/')[0]
            if not outFile.has_key(barcode):
                outFile[barcode] = open(barcode + ".fastq",'a')

            outFile[barcode].write(header + "\n" + data)

        header = line.strip()
        data = ''
    else:
        data += line

barcode = header.split('#')[-1].split('/')[0]
if not outFile.has_key(barcode):
    outFile[barcode] = open(barcode + ".fastq",'a')

outFile[barcode].write(header + "\n" + data)

Save as yourName.py. Use by: python yourName.py file.fastq. You might have to mess with the barcode parsing code. Right now, it parses the barcode by taking anything between the last '#' and "/" character. It should work if there is no "/" character also.

Ram · Answer 2 · 2011-12-15

4

Entering edit mode

13.0 years ago

Jeremy Leipzig 22k

Fastq-tools allows you to grep sequence identifiers:

fastq-grep -i AGTTCC mysequence.fq > AGTTCC_seqs.fq

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.0 years ago by Jeremy Leipzig 22k

1

Entering edit mode

+1, thanks for the heads up on this utility set.

ADD REPLY • link 13.0 years ago by Casey Bergman 18k

1

Entering edit mode

nice, but, using it on this problem requires knowing in advance the barcode's you are trying to split on. No?

See my perl solution below for data driven approach.

ADD REPLY • link 13.0 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

fqgrep is another utility that can be used to grep for sequence identifiers. It also accounts for mismatches too.

ADD REPLY • link 11.6 years ago by indraniel • 0

score 4 · Answer 3 · 2011-12-16

The following perl 1-liner will split its standard input into newly created files named after the its input file and the barcodes appearing on its standard input.

perl -p -e 'open(STDOUT,">>","$1_$ARGV") if m|^@.*\#([ATGC]+)\/\d+|' your.fastq

will split your.fastq into files named, i.e.

AGTTCC_your.fastq
AGAGGA_your.fastq
etc, based on its input.

score 3 · Answer 4 · 2011-12-15

You could use the awk program below to generate a fastq file that has the barcodes appended to each sequence (and quality string). Put the awk code into prog.awk then run it as:

awk -f prog.awk < data.fq

Here is the awk file:

/^@HWI-ST/ {
    # extract the barcode as the sequence stored
    # between # and /
    split($0,parts,"#")
    split(parts[2], code, "/")
    barcode = code[1] 
    print 
    next;
}

/^+$/ {
    print
    next
}

{
    print barcode "" $0
}

Ram · Answer 5 · 2011-12-15

3

Entering edit mode

13.0 years ago

Martin A Hansen 3.0k

You can do this with Biopieces:

read_fastq -i test.fq |                      # read in the fastq entries
split_vals -k SEQ_NAME -d '#' |              # split the sequence name on #
split_vals -k SEQ_NAME_1 -d '/' -K BARCODE | # split again on / to get the barcode
write_fastq_files -d Files -k BARCODE -x     # write files in Files dir using barcodes as names

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.0 years ago by Martin A Hansen 3.0k