Entering edit mode
2.2 years ago
magnolia
▴
30
Hi,
Assemblies downloaded from NCBI (GCF_000001405.25 for GRCh37.13 for example) have RefSeq-Accn (NC_000001.10, NC_000002.11, NC_000003.11) as chromosome names.
I want to change the names with Sequence-Name (1, 2, 3) and UCSC-style-name (chr1, chr2, chr3).
Is there a reliable method to do this?
Thank you in advance!
sort both files (ncbi file and chrom-change.tsv) on chromosome names and use
join
Thank you. I'm not sure how to apply this to fasta file but I'll try to figure it out.
ha , it's a fasta file. I thought it was a TSV file. Then you could use
sed -f pattern.txt < in.fa
with pattern.txt:This is wonderful, thank you so much! By the way, second ^ was added to new name. So I removed.
Kinda related question: Is there source other than NCBI that I can download GRCh37.p13 that has 'normal' chromosome names?
The reason I'm looking for the latest version is that PAR regions are missing on chromosome Y in previous versions.
Not from NCBI since they always use NC* nomenclature. You can download the assembly from UCSC which should have the
Chr
names. Take a look at the notes on the page to understand slight differences in Mitochondrial genomes in UCSC assembly.https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/
Thank you. Too bad that I can't get latest GRCh37 from anywhere else.
That is the latest release at UCSC. It is in https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/latest/
Thank you! Yeah actually it seems like they have the things I'm looking for. I just can't be sure to use 'hg' assemblies for everything. If I have a txt file that contains chromosome, position, genotype which is from a bam mapped to GRCh37, is it safe to use with UCSC assemblies?
Should be. Patches never change chromosome co-ordinates. They remain stable for each major genome release.
Great to hear! Thanks a lot for your help.