What is in the UCSC/NCBI genomes please?
1
2
Entering edit mode
10.2 years ago
Aurelie MLB ▴ 360

Hello,

I am looking at the human genome available through Bioconductor packages: NCBI GRCh38 and UCSC.hg19. And I do not get all the different sequence names I see. Could you help please?

In the UCSC.hg19, I do have chromosomes 1 to 22 + X and Y. But I also have chrM, chr1_gl000191_random, chr4_ctg9_hap1, chrUn_gl000212....and so on.

In the NCBI GRCh38, I can see sequences called MT, HSCHR1_CTG2_UNLOCALIZED, HSCHR3UN_CTG2, HSCHR2_RANDOM_CTG1....

What are those _random, _unlocalized, chrUn, _ctg9_hap1...please??

Should all those sequences be used when trying to align NGS reads to the genome for instance? or only a subset?

many thanks

Aurelie

genome • 1.9k views
ADD COMMENT
5
Entering edit mode
10.2 years ago

chrM == MT == mitochondrial DNA. You should probably use this.

The chrUn_* sequences are unplaced contigs. So they may belong in the genome, but we don't know where. I personally use these, but I know that's not universal.

chr*_unlocalized and chr??_*_random are contigs that are known to belong to a specific chromosome (I've never looked into how that was determined) but haven't yet been integrated in. You'll want to use these.

The various *hap* chromosomes are alternate haplotypes. There aren't a lot of good ways to deal with these during alignment yet, so a lot of people don't use these. An upcoming version of BWA is supposed to handle these in a good way, see if you plan to use BWA keep an eye out for that update and then definitely use these.

ADD COMMENT
0
Entering edit mode

Thanks a lot !!!

ADD REPLY

Login before adding your answer.

Traffic: 2929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6