Question

Splitting a fasta file on the basis of header barcodes

1

Entering edit mode

7.4 years ago

eoin ▴ 30

Hi folks,

Having a bit of a brain fart, I'm sure there's a very simple solution to this: I have a fasta file containing reads from 48 different samples, and containing a barcode in the header line:

>10_13 M01383:135:000000000-A7LW3:1:1101:16875:1408 1:N:0:1 orig_bc=GTACATACCGGT new_bc=GTACATACCGGT bc_diffs=0
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACT

I'm trying to split this into three separate files based on this particular experiment, lets say f1.fa, f2.fa, and f3.fa. I have a list of all the barcodes and the sample each relates to.

I've been playing with awk but to no avail, is there either a bit of code for this or a useful tool ?

TIA

eóin

RNA-Seq DNA-seq fasta • 3.8k views

ADD COMMENT • link updated 7.4 years ago by glihm ▴ 660 • written 7.4 years ago by eoin ▴ 30

0

Entering edit mode

BBmap demuxbyname.sh almost does that, it may do if you modify your header.

ADD REPLY • link 7.4 years ago by h.mon 35k

0

Entering edit mode

demuxbyname.sh would need a modified header to work in suffix mode (where it would automatically create one file per barcode without providing a list of barcodes, which is convenient when you don't know the barcodes). But if you do know the barcodes, you can list all 48 of them and run it in substring mode, like this:

demuxbyname.sh in=samples.fa out=out_%.fa substringmode names=GTACATACCGGT,AAAAAAAAAAA

For reference - if the header is has standard Illumina headers that end with the barcode, you generate one output file per barcode like this:

demuxbyname.sh in=all.fq.gz delimiter=: suffixmode out=%.fq.gz

That works for reads named like this:

@A00178:23:H2Y3GDMXX:1:1101:1344:1000 1:N:0:CGTACTAG+CTAAGCCT

...and would create a file named "CGTACTAG+CTAAGCCT.fq.gz".

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

Rolling something in Biopython wouldn't be too painful. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

score 1 · Answer 1 · 2017-07-11

1

Entering edit mode

7.4 years ago

Pierre Lindenbaum 164k

assuming you want the new_barcode and i'ts always at the same place:

awk -F '[ =]' '/^>/{f=sprintf("%s.fa",$7);} { print $0 >> f;}' input.fa

ADD COMMENT • link 7.4 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2017-07-11

0

Entering edit mode