Question

Issue with ANGSD realSFS

0

Entering edit mode

11 months ago

Begonia_pavonina ▴ 210

I want to calculate the overall heterozygosity (the proportion of heterozygous sites) in single samples. For this, I am using ANGSD with the tutorial following. https://www.popgen.dk/angsd/index.php/Heterozygosity

Which gives the following commands:

In BASH:

angsd  -i "$SAMPLE".sorted.bam  -anc "$REF"  -dosaf 1  -gl 1  -out  "$SAF" 
realSFS -nSites 1000000 "$SAF".saf.idx > "$SFS".ml

In Python:

df = pd.read_csv(file_path, sep='\\s+', header=None, index_col=False)
het_by_site = df.iloc[:,1] / (df[0] + df[1] + df[2])

The issue is that the .ml file produced by realSFS countain different lines. I am aware that the three columns are the three categories:

homozygous major allele
Heterozygous
Homozygous minor allele

But I do not know why we got several lines and not a single in the .ml file. There is not enough of them to represent each polymorphic loci. It has been suggested that it is the distribution of alleles frequencies in the genome in the individual, but what does represent each line?

Example .ml file content:

943312.107557 20980.200710 35707.691734 
950777.510501 12764.727800 36457.761699 
944188.395957 20469.017996 35342.586048 
942475.799303 21229.656928 36294.543769 
946783.768672 16380.623403 36835.607925

heterozygosity realsfs angsd • 678 views

ADD COMMENT • link 11 months ago by Begonia_pavonina ▴ 210

0

Entering edit mode

Coming back to my post, I just realized that each line equal 1,000,000. So, I suppose that each window of 1,000,000 sites for which we calculate alleles frequencies with realSFS is a line in the .ml file. Could anyone confirm this?

ADD REPLY • link 11 months ago by Begonia_pavonina ▴ 210