In my case, We had used the Target enrichment sequencing approach for European Bison. We had sequenced a specific chromosome, for example, Chr1, Chr 2, Chr 15, Chr 9 for a specific gene panel.
I had used given below command to call Variant using GATK Haplotype Caller,
/home/anaconda3/share/gatk4-4.2.4.0-0/gatk HaplotypeCaller --java-options -Xmx100g -R genome.fa -I fixed.bam -O NmMdAndUqTag_fixed.g.vcf.gz -ERC GVCF --minimum-mapping-quality 20 --min-base-quality-score 20
But When I check the results obtained using Haplotype Caller, I am getting variants (SNPs) for all chromosomes. I am very confused. Shouldn't We have Variant only on the specific Chromosome that I had mentioned above? I was mainly interested to get Variant on specific sequenced (Mentioned above) chromosomes.
Do we have any specific program or parameter in GATK to call variants within targeted regions?
I had tried to use -L Haplotype caller parameter also. I had created a file named "interval_list", where I had mentioned the regions that were used to sequence.
head -n 1 interval_list
Chr "\t" Start Position "\t" End Position
But It didn't work for me.
Your suggestions would be extremely helpful.
Thank you so much in advance.
what is the output of
add
-L chr1
to your command to only get gvcf ofr chrom chr1Thank you so much for Reply,
I have a file like this (Chr , Start, End). This is the file that we had used to sequence specific region of few chromosomes.
Is there any possibility to provide this information in the command?
Or Should I provide -L chr 25 chr 9 in the command given above?
show me the output of
please..
First column is chromosome Number.
samtools idxstats fixed.bam
10 104305016 141994 399
11 107310763 158019 587
12 91163125 615931 1624
13 84240350 313695 875
14 84648390 130584 382
15 85296676 558964 1515
16 81724687 123275 429
17 75158596 133060 376
18 66004023 138564 465
19 64057457 100898 318
1 158337067 260952 780
20 72042655 88197 283
21 71599096 116152 339
22 61435874 82316 259
23 52530062 106692 308
24 62714930 89024 247
25 42904170 2469549 6367
26 51681464 1484592 4001
27 45407902 64325 217
28 46312546 88778 229
29 51505224 91403 326
2 137060424 194638 582
3 121430405 178447 518
4 120829699 181654 572
5 121191424 190725 574
6 119458736 179055 510
7 112638659 155326 474
8 113384836 176518 498
9 105708250 244368 665
MT 16338 6234 43
X 148823899 845334 2294
so, although you said
you can see that your mapper found some hits one the whole genome. So GATK HaplotypeCaller did his job the right way and found some SNPs on all chromosomes.
It means We have a problem with our Target enrichment Protocol? Should I consider these variants for downstream pipeline such as GWAS?
you can convert your file as a BED file.
and use the the option
-L file.bed
I would like to update regarding previous Query,
I would like to mention once again that for target enrichment sequencing, we sequenced multiple specific regions of specific chromosomes. In my case, I had focused on chr12, 13, 15, 25. We had sequenced different specific regions related to specific genes for each chromosome. So Should I include these regions information also in this -L interval_list. As I am mainly interested to see variants on these specific regions.
I have tried to perform like this, Is It the correct way?
/home/anaconda3/share/gatk4-4.2.4.0-0/gatk HaplotypeCaller --java-options -Xmx100g -R genome.fa -I fixed.bam -O fixed_25_1.g.vcf.gz -ERC GVCF --minimum-mapping-quality 20 --min-base-quality-score 20 --include-non-variant-sites -L interval_list.bed
head -n 3 interval_list.bed
25 4088138 4103061
25 5371120 6637687
15 45220566 45250906
First Column is the Chromosome number.
Interval_list.bed -> shows the specific regions on the specific chromosomes in which we are interested to see variants.
Your suggestions would be really helpful. Thank you so much in advance. Have a nice day :)
this is not an interval list. A header is missing. https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists