Confused by reference files
2
0
Entering edit mode
6 weeks ago
Nesma • 0

Hello everyone, I'm trying to align my samples to generate vcf files but i'm a bit confused by the reference files. I downloaded this reference from ncbi (shown in attached image) and it contains two files. should i merge them? because i tried that once but i don't think it was the right thing to do. I took a look at them and I think they are identical. If that's the case then which file should i use? downloaded reference

reference alignment • 507 views
ADD COMMENT
0
Entering edit mode

align my samples to generate vcf files

What software are you using? Have you read the manual? what are the components of your command? An aligner will typically take an index, which you may have to create from a fasta file. Need more info.

ADD REPLY
2
Entering edit mode
6 weeks ago

What you see there are the same data:

The first is the RefSeq version (GCA), and the other is the GenBank (GCF) version of the same data.

Basically, some genomes in GenBank get "promoted" to be included in RefSeq and when they get the "promotion" they also get a RefSeq accession.

I agree that it is counterproductive and confusing that NCBI distributes the same data under different accession numbers.

ADD COMMENT
2
Entering edit mode
6 weeks ago
GenoMax 148k

it contains two files.

There is a reason for that. From FAQ: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/troubleshooting/faq/

A GenBank (GCA) genome assembly contains assembled genome sequences submitted by investigators or sequencing centers to GenBank or another member of the International Nucleotide Sequence Database Collaboration (INSDC). The GenBank (GCA) assembly is an archival record that is owned by the submitter and may or may not include annotation. A RefSeq (GCF) genome assembly represents an NCBI-derived copy of a submitted GenBank (GCA) assembly. RefSeq (GCF) assembly records are maintained by NCBI. In some cases the RefSeq (GCF) assembly may not be completely identical to the GenBank (GCA) assembly due to assembly improvements made by NCBI staff. All RefSeq (GCF) genome assemblies include annotation.

When a GCF* record is available it is always best to use that version.

ADD COMMENT

Login before adding your answer.

Traffic: 1617 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6