I am trying to make a comprehensive ROI file for music from the 1000 genomes project, since our BAMs and callers used that reference fasta (the human_g1k_v37.fasta). My question is about making sure that I have the right GTF file defining the exon and CDS sequences such that I can make a ROI file for the v37 reference.
On the 1000 genomes site I found a README for the gencode GTF in
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
where the fasta reference was, but no associated file. There were GTF annotion files in
It seems that the file gencode7.coding.20120326.gtf.gz had only CDS and start/stop_codons, but not the exon regions. Based on the GTF suggested by Cyriac in his excellent post about ROI file creation Best Reference Sequence For Music And "Bit_Test" Error, I kept looking for a GTF that defined CDS and exon regions. The file gencode7_GRCh37.tgz had a Level 1&2 data file that had such sequence, but I worry that with GRCh37 in the name, I'm going to run into problems with chromosome/gene addresses (BAMs and SNPs called using 1000 genomes v37 reference). Is there a way of knowing if this GTF file is the right annotation for the human_g1k_v37.fasta?
Thank you, DD
PS-To make things even more confusing, the gencode7_GRCh37.tgz file "gencode.v7.annotation.level_1_2.gtf" that I used for ROI creation had "chr1" which I understand to be the UCSC hg19 naming convention; so this file is on the 1000 genomes server, with GRCh37 in the name, and hg19 autosome naming conventions; is this the right file?