I would like to retrieve all the FASTA RefSeq files for a given Gene entry in NCBI, e.g. http://www.ncbi.nlm.nih.gov/gene/3803#reference-sequences
How should I go about this?
[EDIT: In more detail, I would like 1) the refseq independent of genome build 2) the hg38 version if it exists 3) the hg38 alt_loci versions if they exist.]
My inclination is to do some python web scraping (collect as "FASTA" links under that div, then scrape that page and write it out to a FASTA file), but possibly there an easier way to do this through NCBI Entrez Batch Entry or biopython interface to Entrez, or through filtering results in nuccore, which allows me to download all results as a single FASTA file.
My manual way is to click each "FASTA" link, click send > File > FASTA > Create File and save the file, which is not reasonable for >100 sequences (the KIR genes on hg38 and hg38 alt loci).
I tried NCBI nucleotide, but it returns more results than match NCBI gene, e.g. http://www.ncbi.nlm.nih.gov/nuccore/?term=KIR2DL2, specifically "KIR2DL2[All Fields] AND ("Homo sapiens"[Organism] AND biomol_genomic[PROP] AND refseq[filter])"
returns 38 results instead of 6. Regardless, using nuccore is still a little cumbersome, though it would reduce my work by an order of magnitude.
duplicate of Get Fasta File With Protein Sequences Given Entrez Gene Ids
it's not an exact duplicate -- that's asking for the protein sequences, I'm looking for the genomic reference seqs
Note: Turns out "(KIR2DL2[All Fields] AND ("Homo sapiens"[Organism] AND biomol_genomic[PROP] AND refseq[filter])) AND KIR2DL2[Gene Name]" gives me what I expect (7 sequences) and is hand for double-checking. The accepted answer + bash script linked in the comments really allow me to batch the process, though.