I am entirely newbie on WGS. At this point, I have worked just with data from SNP-array (plink format, .vcf, .haps/.sample), and I have to verify the allele frequency of a single SNP in WGS data. Therefore, I have to perform SNP calling, alright?
The thing is, I did download the data in .bam format from ENA. However, I do not know from which reference genome it was generated, and I do need a fasta reference to make the SNP calling via HaplotypeCaller from GATK. Anyone knows how do I get the reference genome (build version) in order the download the right fasta file?
I saw a pretty similar question at this post, where @matted wrote:
In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.
However, as I said I am a newbie. How can I infer the reference by chromosome names and the assembly version by the sizes? My purpose is really straightforward: just to see the allele frequency of a SNP in these data.
Any help will be very appreciated!
Can you provide a link for the data you downloaded? Sometimes data is supplied as unaligned BAM files instead of plain fastq.
Sure! https://www.ebi.ac.uk/ena/data/view/PRJEB29074
It has a paper linked, is the assembly/annotation mentioned there? (I cannot access it, due to paywall)
Methods say this. So it was aligned to GRCh37:
.
Yes, it did! My bad, sorry! I'm probably a little bit anxious. But all your tips were from great value!
You may manually check few reads. Align them to both hg19 and hg38 and check the coordinates....