How to download CDS from ncbi for pangenome analysis?
2
0
Entering edit mode
4.3 years ago
sidraas • 0

How to download *_cds_from_genomic.fna.gz (CDS from genomic FASTA) from ncbi for pangenome analysis?

genome fasta • 2.1k views
ADD COMMENT
1
Entering edit mode

What have you tried? Please add more detail to your question.

ADD REPLY
1
Entering edit mode
2.9 years ago

You can use ncbi-genome-download. The instructions for installation and usage are available here

https://github.com/kblin/ncbi-genome-download

For what you are asking, you need to specify the -F parameter as 'cds-fasta'

ADD COMMENT
1
Entering edit mode
2.9 years ago
MirianT_NCBI ▴ 760

Hi,

You can use NCBI Datasets. The default genome package includes:

  • genomic fasta (chr*.fna, unplaced.scaf.fna)
    • transcript fasta (rna.fna)
    • protein fasta (protein.faa)
    • CDS fasta (cds_from_genomic.fna)
    • GFF3 (genomic.gff)
    • metadata files (sequence_report.jsonl for each assembly, assembly_data_report.jsonl and dataset_catalog.json).

To download only the cds, you can use the following command (I'm using human as example, but you can use any taxonomic level):

datasets download genome taxon human \
--exclude-gff3 --exclude-protein --exclude-rna --exclude-seq \
--filename cds_only.zip

If you're downloading a really large number of files (let's say all vertebrates), I would recommend adding the flag --dehydrated. With this flag, datasets downloads the json and jsonl files, and a file called fetch.txt with paths to the data to be downloaded (rehydrated). To rehydrate a package, you can follow the steps below:

unzip cds_only.zip -d cds_only
datasets rehydrate --directory cds_only

I hope it helps!

ADD COMMENT

Login before adding your answer.

Traffic: 1754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6