Get genomes in genbank that are NOT in RefSeq
1
0
Entering edit mode
5.1 years ago
rororo ▴ 10

NCBI has two sections for assemblies, genbank (all submitted sequences) and RefSeq (curated genbank sequences).

A list of both is available here: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt and ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt.

Now I want to get a list of assembly (genome) accession numbers, that are in genbank but not in RefSeq. Unfortunately I could not find any mapping file on NCBI's sites. Has someone an idea how to obtain that list?

genome refseq genbank ncbi • 1.3k views
ADD COMMENT
0
Entering edit mode

Unless I'm missing a trick, this should be as simple as something like:

comm -23 assembly_summary_genbank.txt assembly_summary_refseq.txt

Haven't double checked that this is 100% accurate though, and assumes I got the files the right way round!

ADD REPLY
1
Entering edit mode
5.1 years ago
vkkodali_ncbi ★ 3.8k

The assembly_summary_genbank.txt file has a field gbrs_paired_asm which indicates whether there is a matched RefSeq pair for a given GenBank assembly. You should be able to get the entire list of assemblies without a matching RefSeq assembly as follows:

awk 'BEGIN{FS="\t";OFS="\t"}($18=="na")' assembly_summary_genbank.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6