Hi,
I'm encountering an issue with multiple gVCF files, all of which contain "N"-nucleotides in the REF column. In an attempt to resolve this, I used the fill-from-fasta plugin-function from bcftools with the following command:
bcftools +fill-from-fasta input_vcf.g.vcf.gz -- -c REF -f path/to/fasta_file/Homo_sapiens_assembly38.fasta
GATK was used to generate the gvcf-files. Thats why I downloaded and used the fasta-file from the GATK-github-repository.
I generated the gVCF files using GATK, and I downloaded the fasta file from the GATK GitHub repository.
However, the problem persists, and the "N"-nucleotides remain in the REF column.
bcftools query -f '[%CHROM\t%POS\t%REF\t%ALT\n]' inputvcf.g.vcf.gz|grep -v -E '[ACTG]' | head -n 10
Is there something wrong with my code, the fasta file or something else? Do you have any othe suggestions on how to solve that problem? Any insights or suggestions on how to address this issue would be greatly appreciated.
Thank you in advance.
cross posted : https://stackoverflow.com/questions/77880020/
Thanks to Pierre for finding the cross-post
Please keep in mind that posting the same question to multiple sites can be perceived as bad etiquette, because efforts may be made to address a problem that has already been solved elsewhere in the meantime.
The helpful thing to do if you do decide to post on multiple forums is to add a link to the other forum posts on each post so people will look at the other posts before investing their effort.