Splitting a fasta file on the basis of header barcodes
2
1
Entering edit mode
7.5 years ago
eoin ▴ 30

Hi folks,

Having a bit of a brain fart, I'm sure there's a very simple solution to this: I have a fasta file containing reads from 48 different samples, and containing a barcode in the header line:

>10_13 M01383:135:000000000-A7LW3:1:1101:16875:1408 1:N:0:1 orig_bc=GTACATACCGGT new_bc=GTACATACCGGT bc_diffs=0
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACT

I'm trying to split this into three separate files based on this particular experiment, lets say f1.fa, f2.fa, and f3.fa. I have a list of all the barcodes and the sample each relates to.

I've been playing with awk but to no avail, is there either a bit of code for this or a useful tool ?

TIA

eóin

RNA-Seq DNA-seq fasta • 3.9k views
ADD COMMENT
0
Entering edit mode

BBmap demuxbyname.sh almost does that, it may do if you modify your header.

ADD REPLY
0
Entering edit mode

demuxbyname.sh would need a modified header to work in suffix mode (where it would automatically create one file per barcode without providing a list of barcodes, which is convenient when you don't know the barcodes). But if you do know the barcodes, you can list all 48 of them and run it in substring mode, like this:

demuxbyname.sh in=samples.fa out=out_%.fa substringmode names=GTACATACCGGT,AAAAAAAAAAA

For reference - if the header is has standard Illumina headers that end with the barcode, you generate one output file per barcode like this:

demuxbyname.sh in=all.fq.gz delimiter=: suffixmode out=%.fq.gz

That works for reads named like this:

@A00178:23:H2Y3GDMXX:1:1101:1344:1000 1:N:0:CGTACTAG+CTAAGCCT

...and would create a file named "CGTACTAG+CTAAGCCT.fq.gz".

ADD REPLY
0
Entering edit mode

Rolling something in Biopython wouldn't be too painful. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.

ADD REPLY
1
Entering edit mode
7.5 years ago

assuming you want the new_barcode and i'ts always at the same place:

awk -F '[ =]' '/^>/{f=sprintf("%s.fa",$7);} { print $0 >> f;}' input.fa
ADD COMMENT
0
Entering edit mode
7.5 years ago
glihm ▴ 660

FASTX Barcode Splitter: sounds like your solution ! ;)

Documentation and Download.

ADD COMMENT

Login before adding your answer.

Traffic: 3702 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6