I've encountered an issue while indexing reference genomes - the file doesn't contain any chromosome numbers.
Several FASTA reference files from NCBI have sequence header information in the following format:
gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37. p13 Primary Assembly
The sequence header has the chromosome number embedded within. I tried to use the fasta file as a part of my sequencing pipeline but have encountered an issue: the aligners (BowTie2, Isaac, etc.) produce an index file which contains the contig names instead of chromosome number. Due to this the resultant VCF files produced using this indexed reference file also contains the contig data as the chromosome name.
It also seems that the reference FASTA files when used from UCSC contain the sequence header in the correct format such as: chr 1
and these UCSC FASTA files can be used without any issue (or need to reformat) via my sequencing pipeline (produces VCF files with chromosome numbers)
There are, however, many NCBI FASTA files I'd like to index as reference files that are not provided by UCSC - is there a way to format a FASTA file so that the chromosome number is extracted from the FASTA and reformatted and the FASTA file updated so that it can be used for indexing? (i.e. is in the UCSC style?)
Not tested:
cat ref.fa | perl -pe 's/>.*chromosome (\S+).*/>chr$1/' > renamed.fa