I am stumped as to which reference genome has these contigs... Any help? I've looked through the common ones (GRCh37/38, hg19/38, hs37d5)
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, M, 1_gl000191_random, 1_gl000192_random, 4_ctg9_hap1, 4_gl000193_random, 4_gl000194_random, 6_apd_hap1, 6_cox_hap2, 6_dbb_hap3, 6_mann_hap4, 6_mcf_hap5, 6_qbl_hap6, 6_ssto_hap7, 7_gl000195_random, 8_gl000196_random, 8_gl000197_random, 9_gl000198_random, 9_gl000199_random, 9_gl000200_random, 9_gl000201_random, 11_gl000202_random, 17_ctg5_hap1, 17_gl000203_random, 17_gl000204_random, 17_gl000205_random, 17_gl000206_random, 18_gl000207_random, 19_gl000208_random, 19_gl000209_random, 21_gl000210_random, Un_gl000211, Un_gl000212, Un_gl000213, Un_gl000214, Un_gl000215, Un_gl000216, Un_gl000217, Un_gl000218, Un_gl000219, Un_gl000220, Un_gl000221, Un_gl000222, Un_gl000223, Un_gl000224, Un_gl000225, Un_gl000226, Un_gl000227, Un_gl000228, Un_gl000229, Un_gl000230, Un_gl000231, Un_gl000232, Un_gl000233, Un_gl000234, Un_gl000235, Un_gl000236, Un_gl000237, Un_gl000238, Un_gl000239, Un_gl000240, Un_gl000241, Un_gl000242, Un_gl000243, Un_gl000244, Un_gl000245, Un_gl000246, Un_gl000247, Un_gl000248, Un_gl000249, AC_000005.1, AC_000006.1, AC_000007.1, AC_000008.1, AC_000017.1, AC_000018.1, AC_000019.1, NC_000883.2, NC_000898.1, NC_001348.1, NC_001352.1, NC_001354.1, NC_001355.1, NC_001356.1, NC_001357.1, NC_001405.1, NC_001430.1, NC_001434.1, NC_001436.1, NC_001454.1, NC_001457.1, NC_001458.1, NC_001460.1, NC_001472.1, NC_001488.1, NC_001489.1, NC_001490.1, NC_001526.2, NC_001531.1, NC_001576.1, NC_001583.1, NC_001586.1, NC_001587.1, NC_001591.1, NC_001593.1, NC_001595.1, NC_001596.1, NC_001612.1, NC_001617.1, NC_001653.2, NC_001655.1, NC_001664.2, NC_001676.1, NC_001690.1, NC_001691.1, NC_001693.1, NC_001694.1, NC_001710.1, NC_001716.2, NC_001722.1, NC_001781.1, NC_001796.2, NC_001798.1, NC_001802.1, NC_001806.1, NC_001837.1, NC_001897.1, NC_001943.1, NC_002645.1, NC_003266.2, NC_003443.1, NC_003461.1, NC_003977.1, NC_004102.1, NC_004104.1, NC_004148.2, NC_004295.1, NC_004500.1, NC_005134.2, NC_005147.1, NC_005831.2, NC_006273.2, NC_006577.2, NC_007018.1, NC_007026.1, NC_007027.1, NC_007455.1, NC_007605.1, NC_008188.1, NC_008189.1, NC_009333.1, NC_009334.1, NC_009823.1, NC_009824.1, NC_009825.1, NC_009826.1, NC_009827.1, NC_009887.1, NC_009996.1, NC_010329.1, NC_010810.1, NC_010956.1, NC_011202.1, NC_011203.1, NC_011800.1, NC_012042.1, NC_012213.1, NC_012485.1, NC_012486.1, NC_012564.1, NC_012729.2, NC_012798.1, NC_012800.1, NC_012801.1, NC_012802.1, NC_012950.1, NC_012959.1, NC_012986.1, NC_013035.1, NC_013114.1, NC_013115.1, NC_014185.1, NC_014952.1, NC_014953.1, NC_014954.1, NC_014955.1, NC_014956.1, NC_015150.1, NC_015630.1, NC_016157.1, NC_017993.1, NC_017994.1, NC_017995.1, NC_017996.1, NC_017997.1, NC_019023.1, NC_019026.1, NC_019027.1, NC_019028.1]
Side note: given bam files have no helpful info in the header. Contig names listed above are from header file and from GATK spitting out error messages
A quick google of those accessions at NCBI suggests that isnt a single genome. One of those accessions is a complete coronavirus, another is a complete Papillomavirus.
I would guess, based on the 2 I checked, they're all viral genome records.
Sorry, I'm new to this stuff - so this means the ref genome used to align isn't one of the common ones? Is it some proprietary/concatenated reference?
To me, it looked like hs37d5 + more viral genome. I just can't find an updated version
I don't think you will. I was discussing this with someone else a couple of weeks back here. See: Where can I download GRCh38-lite.fa file and all_sequences.fa file for hg38 version
You will need to get GRCh38 and append viral genomes yourself if you need an updated version.
Thanks so much!! I was also given a GVCF file and was originally just using them. However, I wanted to do my own preprocessing and variant calling and compare to the given GVCF. Another point of confusion is that when filtering the GVCF, I used hg38 as reference. How is there a switch in reference genomes from bamfiles to GVCF?
Sorry, reached my 5 post limit as a newbie :(
I'm not sure I really follow. What has you under the impression that that is a reference genome?
Anything can be a reference genome. It's not so common to see multiple genomes concatenated together like that, unless someone was trying to make a 'viral database' or something.
Where did this file originate?