Question

Issue indexing reference genome FASTA files - index file does not contain chromosome numbers?

0

Entering edit mode

10.1 years ago

Irene@Sequencing.com ▴ 280

I've encountered an issue while indexing reference genomes - the file doesn't contain any chromosome numbers.

Several FASTA reference files from NCBI have sequence header information in the following format:

gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37. p13 Primary Assembly

The sequence header has the chromosome number embedded within. I tried to use the fasta file as a part of my sequencing pipeline but have encountered an issue: the aligners (BowTie2, Isaac, etc.) produce an index file which contains the contig names instead of chromosome number. Due to this the resultant VCF files produced using this indexed reference file also contains the contig data as the chromosome name.

It also seems that the reference FASTA files when used from UCSC contain the sequence header in the correct format such as: chr 1

and these UCSC FASTA files can be used without any issue (or need to reformat) via my sequencing pipeline (produces VCF files with chromosome numbers)

There are, however, many NCBI FASTA files I'd like to index as reference files that are not provided by UCSC - is there a way to format a FASTA file so that the chromosome number is extracted from the FASTA and reformatted and the FASTA file updated so that it can be used for indexing? (i.e. is in the UCSC style?)

FASTA Reference • 5.2k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Irene@Sequencing.com ▴ 280

1

Entering edit mode

Not tested: cat ref.fa | perl -pe 's/>.*chromosome (\S+).*/>chr$1/' > renamed.fa

ADD REPLY • link 10.1 years ago by lh3 33k

Ram · Answer 1 · 2015-04-28

I'm not sure I understand you correctly, does your aligner report gi|224514618|ref|NT_077402.2| as the hit?

If so, that's one way FASTA files are defined - BioPython, for example, distinguishes "names" and "IDs" for sequences in FASTA files. Everything before the space is the ID, everything after is the name, so for example the ID for your sequence is gi|224514618|ref|NT_077402.2|, the name is Homo sapiens chromosome 1 genomic contig, GRCh37. p13 Primary Assembly. By default (and I don't know right now how to override this, there doesn't seem to be an option for this) bowtie2-build just uses the ID.

The UCSC downloads have a much simpler ID and no name, the first human chromosome for example just has >chr1, no space.

The laziest way to override this behavior is to replace spaces by _, then it will report the entire name. Or you can write a small program that replaces the entire ID/name by the chromosome number, is that one always in the same position?