Entering edit mode
11.4 years ago
Jordan
★
1.3k
Hi,
I have a bam file of size 62G. When I do the variant analysis by samtools, it gives a huge bcf file like 34G. It's quite unusual. It's not even a vcf file. It's a compressed vcf file.
I'm not sure if I'm doing it right. I used the following command:
samtools mpileup -uf ~/refs/human_g1k_v37.fasta normal.bam | bcftools view -bvcg - > normal.raw.bcf
When I looked at this bcf file, I found something rather strange.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Unknown
chr1 10114 . N T 5.45 . DP=38;VDB=2.588829e-01;AF1=1;AC1=2;DP4=0,0,11,15;MQ=0;FQ=-105 GT:PL:GQ 1/1:37,78,0:45
chr1 10115 . N A 4.76 . DP=26;VDB=5.001104e-03;AF1=1;AC1=2;DP4=0,0,8,17;MQ=0;FQ=-99 GT:PL:GQ 1/1:36,72,0:42
chr1 10116 . N A 9.51 . DP=26;VDB=1.725300e-02;AF1=1;AC1=2;DP4=0,0,7,17;MQ=0;FQ=-99 GT:PL:GQ 1/1:42,72,0:60
chr1 10117 . N C 8.64 . DP=25;VDB=8.678101e-02;AF1=1;AC1=2;DP4=0,0,6,16;MQ=0;FQ=-93 GT:PL:GQ 1/1:41,66,0:57
chr1 10118 . N C 9.51 . DP=25;VDB=1.464226e-01;AF1=1;AC1=2;DP4=0,0,6,17;MQ=0;FQ=-96 GT:PL:GQ 1/1:42,69,0:60
As you can see all the REF alleles are labeled as N. I'm not sure why it shows that. Can anyone help?
Actually I think I figured out why it's so large. It's recognizing N's as nucleotides in Reference file and hence producing each allele as a variant. So basically each and every position is being labeled as an variant. Hence, the huge file size. But I'm not sure why this is occuring though.