Entering edit mode
2.6 years ago
genomes_and_MGEs
▴
10
Hi everyone,
When I want to append the filename to the contig header in a multi-fasta file, I usually use
for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=$F out=${N}_mod.fasta prefix=$F addprefix=t ; done
However, this doesn't work in genbank files. When I want to split muti-genbank files, I use
cat > splitgbk.py
from Bio import SeqIO
import sys
for rec in SeqIO.parse(sys.stdin, "genbank"): SeqIO.write([rec], open(rec.id + ".gbk", "w"), "genbank")
for F in *.gbff; do python splitgbk.py < $F ; done
This generates multiple *.gbk files, with the structure "accession_number.gbk". However, I would like to have the filename appended before the accession number, so that each spllited genbank file has the structure ""filename_accession.gbk". Can you guys help me out? Thanks!
how many contigs does a file have? Can you post an example with input and expected output?
If you want to append file name to ID / header a fasta file, try following with a small file:
Thanks for the reply. However, my goal is not to append filename to the header of a fasta file, but to append filename to the accession number of a genbank file. For example: if I split the genome with filename GCF_000007805.1_ASM780v1.gbk, I'll have 3 replicons: NC_004578.1.gbk, NC_004633.1.gbk, NC_004632.1.gbk. My goal is to produce the following output: GCF_000007805.1_ASM780v1_NC_004578.1.gbk, GCF_000007805.1_ASM780v1_NC_004633.1.gbk, GCF_000007805.1_ASM780v1_NC_004632.1.gbk.
can you post a small example and expected output?