Issue with ANGSD realSFS
0
0
Entering edit mode
3 months ago

I want to calculate the overall heterozygosity (the proportion of heterozygous sites) in single samples. For this, I am using ANGSD with the tutorial following. https://www.popgen.dk/angsd/index.php/Heterozygosity

Which gives the following commands:

In BASH:
angsd  -i "$SAMPLE".sorted.bam  -anc "$REF"  -dosaf 1  -gl 1  -out  "$SAF" 


realSFS -nSites 1000000 "$SAF".saf.idx > "$SFS".ml

In Python:
df = pd.read_csv(file_path, sep='\\s+', header=None, index_col=False)
het_by_site = df.iloc[:,1] / (df[0] + df[1] + df[2])

The issue is that the .ml file produced by realSFS countain different lines. I am aware that the three columns are the three categories:

  1. homozygous major allele
  2. Heterozygous
  3. Homozygous minor allele

But I do not know why we got several lines and not a single in the .ml file. There is not enough of them to represent each polymorphic loci. It has been suggested that it is the distribution of alleles frequencies in the genome in the individual, but what does represent each line?

Example .ml file content:

943312.107557 20980.200710 35707.691734 
950777.510501 12764.727800 36457.761699 
944188.395957 20469.017996 35342.586048 
942475.799303 21229.656928 36294.543769 
946783.768672 16380.623403 36835.607925
heterozygosity realsfs angsd • 316 views
ADD COMMENT
0
Entering edit mode

Coming back to my post, I just realized that each line equal 1,000,000. So, I suppose that each window of 1,000,000 sites for which we calculate alleles frequencies with realSFS is a line in the .ml file. Could anyone confirm this?

ADD REPLY

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6