Entering edit mode
5.9 years ago
bioinfo456
▴
150
Where can I download the required reference genome from? I noticed it is available in .2bit format in UCSC. And when I downloaded it from the NCBI I'm getting the following error while using bedtools getfasta. Please help.
WARNING. chromosome (22) was not found in the FASTA file. Skipping.
Does your BED file happen to have 22 instead of chr22, so no "chr" prefix?
Even after prefixing, I'm having the same error. I believe it has something to do the the reference genome. Do you reckon? Also I noticed that the index file generated for the reference genome is very different, here's a snippet :
Your chromosomes have the NCBI contig identifier, not chromosome names. All the files you use across a pipeline need to follow the same naming convention for chromosomes. Software is dumb, it has no idea that
NC_00001.10
andchr1
mean the same thing in your current context.Obviously, the identifiers are not standard chromosome names, neither something like 22 or chr22 but the contig names from NCBI. This non-sense nomenclature (at least in terms of standard usage as reference genome) is part of why I at all costs try to avoid anything that comes from RefSeq-related pages. I would get the genome from the UCSC:
and then use
cat
to combine all chromosomes to one genome file, followed by indexing and running bedtools. Note that UCSC has a nomenclature with chr prefix so you'll have to add this to your BED file.Alright thanks for the clarification both of you. Regarding indexing, it's going to happen automatically when I run bedtools getfasta, right?
Try running it or reading the documentation. If an index is created, well and good. If not, create one. Please do not ask for such easily accessible information before trying your best.
I'm more of a fan of the ensembl genome versions. In my view, they have better and more thorough annotation and a more well documented release schedule. Just my 2 cents.