NCBI Reference Fasta Tar Issues
2
0
Entering edit mode
2.8 years ago
jensen.416 • 0

I am trying to extract an older assembly of the c57bl/6 mouse genome (GRCm38, NCBI link )

I download the genome.fna and genome.gtf files, but when I try to untar them

$ tar -xvf <genome.fna.tar>

I get only the README and a report.txt that says the following:

HistoryId: MCID_623bc5f046334b04741fec30
QueryKey: 1
ReleaseType: RefSeq
FileType: GENOME_FASTA
Flat: true

Query title: Select 1 document(s)

Search results count: 1
Filtered out 1 entries that do not have the requested ReleaseType, or are suppressed.
Entries to download: 0

GENOME_FASTA files in archive: 0
Total size (bytes): 44529
Total time: 225 milliseconds

I have not been able to find any information on the ReleaseType or how the reference genome file could be suppressed. Has anyone had a similar issue?

fasta NCBI tar • 1.7k views
ADD COMMENT
2
Entering edit mode
2.8 years ago
GenoMax 148k

I assume you are using the Download Assembly button at top right of that page. It is indeed not working at the moment.

Get the files you need directly from here: https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Mus_musculus/all_assembly_versions/. There are multiple patch versions but you could simply get the latest for GRCm38: https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Mus_musculus/all_assembly_versions/GCF_000001635.26_GRCm38.p6/

ADD COMMENT
0
Entering edit mode

Thank you so much!

ADD REPLY
1
Entering edit mode
2.8 years ago
vkkodali_ncbi ★ 3.8k

Indeed the 'Download Assembly' button on this page does not download a FASTA file for the original GRCm38 assembly. This is because of the changes that were made to the FTP structure since the assembly was made public ten years ago. Any files we have for GRCm38 date back to 2012 and are stored in archival locations not compatible with our newer delivery tools.

If possible, I highly recommend using the latest assembly and annotation: GRCm39 assembly with the corresponding annotation released in Sep 2020. You can use the 'Download Assembly' button on this page: https://ncbi.nlm.nih.gov/assembly/GCF_000001635.27/. Alternatively, you can use NCBI Datasets for this.

If you need GRCm38 assembly only, you should download GRCm38.p6, the latest patch release for this assembly as suggested by @genomax. GRCm38.p6 is a newer version of GRCm38 differing only in having a set of extra patch sequences that address various genome problems. The patches are a snapshot of some of the changes used to make GRCm39. GRCm38.p6 is supported through our various tools, including both sequence and annotation. To download data, you can either navigate to the FTP path, use the 'Download Assembly' button on this page: https://ncbi.nlm.nih.gov/assembly/GCF_000001635.26/ or use NCBI Datasets. The latest annotation on the GRCm38 assembly was released for GRCm38.p6 in Jul 2020; GRCm38 is no longer annotated. If you would like to consume the latest annotation generated using the latest software and data, and undergoes active curation by the RefSeq staff, please consider switching to GRCm39 assembly.

Finally, if you require GRCm38 assembly FASTA file without any alt/patch sequences, you can download the FASTA for GRCm38.p6 along with the assembly report file (known as 'Assembly structure report' in the Download Assembly menu and with an _assembly_report.txt suffix on the FTP site), make a list of Refseq accessions (in column 7) for rows that have 'C57BL/6J' in column 8. Be sure to add 'NC_005089.1' to your list if you require mitochondrion as well. Once you have a list of accessions, you can use something like seqkit grep to extract relevant FASTA records from the GRCm38.p6 FASTA file.

ADD COMMENT

Login before adding your answer.

Traffic: 2107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6