Hi folks,
Having a bit of a brain fart, I'm sure there's a very simple solution to this: I have a fasta file containing reads from 48 different samples, and containing a barcode in the header line:
>10_13 M01383:135:000000000-A7LW3:1:1101:16875:1408 1:N:0:1 orig_bc=GTACATACCGGT new_bc=GTACATACCGGT bc_diffs=0
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACT
I'm trying to split this into three separate files based on this particular experiment, lets say f1.fa, f2.fa, and f3.fa. I have a list of all the barcodes and the sample each relates to.
I've been playing with awk but to no avail, is there either a bit of code for this or a useful tool ?
TIA
eóin
BBmap
demuxbyname.sh
almost does that, it may do if you modify your header.demuxbyname.sh would need a modified header to work in suffix mode (where it would automatically create one file per barcode without providing a list of barcodes, which is convenient when you don't know the barcodes). But if you do know the barcodes, you can list all 48 of them and run it in substring mode, like this:
For reference - if the header is has standard Illumina headers that end with the barcode, you generate one output file per barcode like this:
That works for reads named like this:
...and would create a file named "CGTACTAG+CTAAGCCT.fq.gz".
Rolling something in Biopython wouldn't be too painful. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.