Demultiplex Illumina With Barcodes On Identifier Line
6
4
Entering edit mode
12.9 years ago

I got a set of Illumina files which are barcoded in the sequence identifier instead (barcode is not part of the sequence), therefore we cannot use fastx_barcode_splitter.pl or similar scripts. Example:

@HWI-ST132_459:6:2208:20745:200766#AGTTCC/1
CCCAGGGGGTTGCTAGGTTGAAAGAGAAGAACTAAGCTTAAATTTGTTGTACATTGTATATAATTACAAAGTGTTATGTTATTATTATTAAAAAAAAAAA
+
ca^WcZX[D_T]GQI^]^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST132_459:6:2208:21328:200860#AGTTCC/1
CATTTTGGTGGGTTGTGGTTTTGGGGGGTTTGTTGTTGGGTTTTATAAGGTGGTTTTTTTTAATAAGTAAAAATAAAAAAAAAAATTAAGAATAAAAAAA
+
]TPKODYF[TSHWUQRRGZV`N_Y`c\abc]]D_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Anybody knows a program that is able to split/demultiplex these datasets?

illumina barcode • 17k views
ADD COMMENT
0
Entering edit mode

Are the barcodes and read names delimited by something? It looks like the "_" character separates the barcoding and read names?

ADD REPLY
0
Entering edit mode

The barcode in the above mentioned example is AGTTCC, always after the number sign (#) and before the slash, although in some other datasets it comes as the six last characters of the identifier line

ADD REPLY
5
Entering edit mode
12.9 years ago

Try this python script:

import sys

inFile = open(sys.argv[1],'r')

header = ''
data = ''

outFile = {}

for line in inFile:
    if line[0] == "@":
        if header != '':
            barcode = header.split('#')[-1].split('/')[0]
            if not outFile.has_key(barcode):
                outFile[barcode] = open(barcode + ".fastq",'a')

            outFile[barcode].write(header + "\n" + data)

        header = line.strip()
        data = ''
    else:
        data += line

barcode = header.split('#')[-1].split('/')[0]
if not outFile.has_key(barcode):
    outFile[barcode] = open(barcode + ".fastq",'a')

outFile[barcode].write(header + "\n" + data)

Save as yourName.py. Use by: python yourName.py file.fastq. You might have to mess with the barcode parsing code. Right now, it parses the barcode by taking anything between the last '#' and "/" character. It should work if there is no "/" character also.

ADD COMMENT
0
Entering edit mode

Minor modification to the above script, to allow for the case where @ is the first character on the quality line:

import sys

inFile = open(sys.argv[1],'rU')

header = ''
data = ''

outFile = {}

for i,line in enumerate(inFile):
    if (line[0] == "@" and i % 4 == 0):
        if header != '':
            barcode = header.split('#')[-1].split('/')[0]
            if not outFile.has_key(barcode):
                outFile[barcode] = open(barcode + ".fastq",'a')

            outFile[barcode].write(header + "\n" + data)

        header = line.strip()
        data = ''
    else:
        data += line

barcode = header.split('#')[-1].split('/')[0]
if not outFile.has_key(barcode):
    outFile[barcode] = open(barcode + ".fastq",'a')

outFile[barcode].write(header + "\n" + data)
ADD REPLY
4
Entering edit mode
12.9 years ago

Fastq-tools allows you to grep sequence identifiers:

fastq-grep -i AGTTCC mysequence.fq > AGTTCC_seqs.fq
ADD COMMENT
1
Entering edit mode

+1, thanks for the heads up on this utility set.

ADD REPLY
1
Entering edit mode

nice, but, using it on this problem requires knowing in advance the barcode's you are trying to split on. No?

See my perl solution below for data driven approach.

ADD REPLY
0
Entering edit mode

fqgrep is another utility that can be used to grep for sequence identifiers. It also accounts for mismatches too.

ADD REPLY
4
Entering edit mode
12.9 years ago
Malcolm.Cook ★ 1.5k

The following perl 1-liner will split its standard input into newly created files named after the its input file and the barcodes appearing on its standard input.

perl -p -e 'open(STDOUT,">>","$1_$ARGV") if m|^@.*\#([ATGC]+)\/\d+|' your.fastq

will split your.fastq into files named, i.e.

  • AGTTCC_your.fastq
  • AGAGGA_your.fastq
  • etc, based on its input.
ADD COMMENT
3
Entering edit mode
12.9 years ago

You could use the awk program below to generate a fastq file that has the barcodes appended to each sequence (and quality string). Put the awk code into prog.awk then run it as:

awk -f prog.awk < data.fq

Here is the awk file:

/^@HWI-ST/ {
    # extract the barcode as the sequence stored
    # between # and /
    split($0,parts,"#")
    split(parts[2], code, "/")
    barcode = code[1] 
    print 
    next;
}

/^+$/ {
    print
    next
}

{
    print barcode "" $0
}
ADD COMMENT
3
Entering edit mode
12.9 years ago

You can do this with Biopieces:

read_fastq -i test.fq |                      # read in the fastq entries
split_vals -k SEQ_NAME -d '#' |              # split the sequence name on #
split_vals -k SEQ_NAME_1 -d '/' -K BARCODE | # split again on / to get the barcode
write_fastq_files -d Files -k BARCODE -x     # write files in Files dir using barcodes as names
ADD COMMENT

Login before adding your answer.

Traffic: 955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6