Hi!
As Genomax mentioned, you can use NCBI Datasets.
Here's how to do that starting from a file with a list of human genes (genes.txt
), one per line:
tp53
brca1
mc1r
You can use the following command to download all proteins as a single file:
datasets download gene symbol --inputfile genes.txt --include protein --taxon human --filename human-proteins.zip
unzip human-proteins.zip -d human-proteins
Archive: human-proteins.zip
inflating: human-proteins/README.md
inflating: human-proteins/ncbi_dataset/data/protein.faa
inflating: human-proteins/ncbi_dataset/data/data_report.jsonl
inflating: human-proteins/ncbi_dataset/data/dataset_catalog.json
There are a few additional options that might be useful for you:
You can download other files in addition to the protein sequences. Below you have a list of available files that can be added using the --include
flag. By default (aka. without using this flag), rna
and protein
sequence files are included.
- gene: gene sequence
- rna: transcript
- protein: amino acid sequences
- cds: nucleotide coding sequences
- 5p-utr: 5'-UTR
- 3p-utr: 3'-UTR
- product-report: gene transcript and protein locations and metadata
If you want each protein as a separate FASTA file, you need to loop over the list of symbols and download one zip archive for each. Like this:
cat symbols.txt | while read GENE; do
datasets download gene symbol "$GENE" --taxon human --include protein --filename "$GENE".zip;
done
Collecting 1 records [================================================] 100% 1/1
Downloading: tp53.zip 3.08kB done
Collecting 1 records [================================================] 100% 1/1
Downloading: brca1.zip 29kB done
Collecting 1 records [================================================] 100% 1/1
Downloading: mc1r.zip 2.61kB done
In that case, each data package is named after the gene name/symbol and inside you have all isoforms of each protein. To rename them in a way that might make more sense to you, take a look at this post here.
I hope it helps! Feel free to reach out if run into any issues :)