Hello,
I am trying to adapt David Marques' script for estimating sfs across windows along chromosomes (this script).
I got it mostly working, but my output is missing a single space.
Desired output
scaffold_1 0 50000 21348.131641 376.408856 121.723237
scaffold_1 50000 100000 23732.131372 280.090745 156.812893
scaffold_1 100000 150000 16644.385783 295.719270 245.601252
scaffold_1 150000 200000 15948.860330 484.115095 231.792219
scaffold_1 200000 250000 18815.201865 493.863920 247.090436
The output I obtain instead:
scaffold_1 0 5000021348.131641 376.408856 121.723237
scaffold_1 50000 10000023732.131372 280.090745 156.812893
scaffold_1 100000 15000016644.385783 295.719270 245.601252
scaffold_1 150000 20000015948.860330 484.115095 231.792219
scaffold_1 200000 25000018815.201865 493.863920 247.090436
You can see there is a space missing after the 'end' position of the chromosome window for which the SFS was estimated.
This is the script I am using:
cat mybedfile.bed | \
awk -v ref="$REF" -v anc="$ANC" -v a="$a" -v b="$b" -v i="$i" '
{
cmd1 = "echo \"" $1 ":" $2 "-" $3 "\" > 7_windowDxy/temp/" I ".rf;"
cmd2 = "angsd -b 7_windowDxy/7_0_metadata/" a ".bamlist -out tmp." a "." I
cmd2 = cmd2 " -doSaf 1 -GL 2 -rf 7_windowDxy/temp/" I ".rf -minQ 20 -minMapQ 20 -ref " ref " -anc " anc ";"
cmd3 = "angsd -b 7_windowDxy/7_0_metadata/" b ".bamlist -out tmp." b "." I
cmd3 = cmd3 " -doSaf 1 -GL 2 -rf 7_windowDxy/temp/" I ".rf -minQ 20 -minMapQ 20 -ref " ref " -anc " anc ";"
cmd4 = "echo -n \"" $1 " " $2 " " $3 " \" ;"
cmd5 = "realSFS tmp." a "." I ".saf.idx tmp." b "." I ".saf.idx"
print cmd1 cmd2 cmd3 cmd4 cmd5
}' | xargs -I CMD bash -c CMD | awk '{if(NF>4){print $N}}' > "${a}.${b}.scaffold${i}.50kwin.sfs"
The key part I believe is the final awk '{if(NF>4){print $N}}'
command. I tried adding OFS=" "
, changing print $N
to print $0
, as well as multiple variations of print $1, $2, $3, $4, substr()
*.
This script is in large part the result of me struggling to change the original marqueda script, and I have to admit I am not very experienced in awk scripting, so I've been punching in the dark a lot.
Any suggestions would be appreciated.
Cheers,
L
- I don't remember what I put in
substr()
, I was following some online suggestions, I have now changed it and didn't save it.
Fair point,
I will try that instead. Thanks Pierre
Thanks for the suggestion Pierre.
I changed the script as follows and it now works as desired. (a and b refer to the two pop IDs, i is the scaffold number, and REF and ANC are the paths to reference and ancestral sequence)
Previous while loop script was bugged (it sfs could not be calculated for an interval the interval coordinates and estimated sfs would become unaligned).
This later version fixes that.