Question

How to download entire human protein sequences from RefSeq database hrough command line?

0

Entering edit mode

4.2 years ago

mathavanbioinfo ▴ 80

Dear All, I am trying to download entire protein sequences from the RefSeq database. I used the following command ./datasets download genome taxon "homo sapiens" --exclude-protein I got an error New version of client (11.8.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets Downloading: ncbi_dataset.zip 434MB done Error: Internal error (invalid zip archive). Please try again Usage datasets download genome taxon <taxon> [flags] Ref https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-gene/

Refseq programming • 3.1k views

ADD COMMENT • link updated 4.2 years ago by vkkodali_ncbi ★ 3.8k • written 4.2 years ago by mathavanbioinfo ▴ 80

score 2 · Answer 1 · 2021-03-26

2

Entering edit mode

4.2 years ago

GenoMax 151k

Using NCBI datasets for human RefSeq genome accession to get sequences of all proteins.

$ datasets download genome accession GCF_000001405.39 --exclude-gff3 --exclude-rna --exclude-seq

ADD COMMENT • link 4.2 years ago by GenoMax 151k

score 0 · Answer 2 · 2021-03-26

Using EntrezDirect to get all human proteins in RefSeq database (truncated for brevity):

$ esearch -db protein -query "human [orgn] AND srcdb refseq [PROPERTIES]"  | efetch -format fasta > refseq_human.fa

>NP_001380844.1 polyamine-modulated factor 1 isoform 10 [Homo sapiens]
MAEASSANLGSGCEEKRHEGSSSESVPPGTTISRVKLLDTMVDTFLQKLVAAGSNGTPCGAMCRNRRPRT
SSWQMPSWQGGGRWRSCSYRSRPSSRPGRLYTENRGSWLLC
>NP_001380831.1 26S proteasome complex subunit SEM1 isoform e [Homo sapiens]
MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAGYSELEE
>NP_001310960.2 semaphorin-4B isoform 1 precursor [Homo sapiens]
MLRTAMGLRSWLAAPWGALPPRPPLLLLLLLLLLLQPPPPTWALSPRISLPLGSEERPFLRFEAEHISNY
TALLLSRDGRTLYVGAREALFALSSNLSFLPGGEYQELLWGADAEKKQQCSFKGKDPQRDCQNYIKILLP
LSGSHLFTCGTAAFSPMCTYINMENFTLARDEKGNVLLEDGKGRCPFDPNFKSTALVVDGELYTGTVSSF

score 0 · Answer 3 · 2021-03-26

I just tried the same command with the latest version (11.8.1) of datasets and it worked without any issue. May be upgrade the datasets application and try again? That said, please read on for a bit more info.

It is possible that you are downloading a lot more data than you really want. You see, the command datasets download genome taxon "homo sapiens" will download _all_ human assemblies (~800 of them). I am assuming that you don't want to download the genome sequence for nearly 800 assemblies.

You should add the --refseq flag to your command to restrict the results to the latest RefSeq assemblies; there are 2 of them: GRCh38 aka hg38 and GRCh37 aka hg19. You may not even need _two_ of them either. Very likely, you need just the one _latest reference assembly_ that is GRCh38/hg38. For that, you should run the command with --refseq and --reference flags, like so:

datasets download genome taxon "homo sapiens" --refseq --reference

How do you know how many assemblies will end up in your data package _before_ you spend the time to download it? There are two ways to figure this out.

Use the datasets summary command. The following command datasets summary genome taxon "homo sapiens" --limit NONE will return the total count of assemblies for your query: 791. If you drop the --limit NONE flag you will be able to download metadata for all of those assemblies in JSON from which you can pick and choose the assemblies of interest to you, copy the assembly accessions and use them with the datasets download command (look for the --inputfile option in datasets download genome accession command).
Download a dehydrated package. Running datasets download command with the --dehydrated flag is like doing a dry-run of the download command; you will download a data package that metadata but not the entire sequence data. You can read more about it here. Once you run this command and download the zip file (which should be very small), unzip it and look inside the file ncbi_dataset/data/assembly_data_report.jsonl to see a report of all assemblies that will end up in a real package. If everything appears as it should be you can either choose to 'rehydrate' the package that you have already downloaded or run the same command without the --dehydrated flag to download the actual data.