AWK script missing space after 3rd field
1
0
Entering edit mode
4 months ago
MboiTui ▴ 20

Hello,

I am trying to adapt David Marques' script for estimating sfs across windows along chromosomes (this script).

I got it mostly working, but my output is missing a single space.

Desired output

scaffold_1 0 50000 21348.131641 376.408856 121.723237
scaffold_1 50000 100000 23732.131372 280.090745 156.812893
scaffold_1 100000 150000 16644.385783 295.719270 245.601252
scaffold_1 150000 200000 15948.860330 484.115095 231.792219
scaffold_1 200000 250000 18815.201865 493.863920 247.090436

The output I obtain instead:

scaffold_1 0 5000021348.131641 376.408856 121.723237
scaffold_1 50000 10000023732.131372 280.090745 156.812893
scaffold_1 100000 15000016644.385783 295.719270 245.601252
scaffold_1 150000 20000015948.860330 484.115095 231.792219
scaffold_1 200000 25000018815.201865 493.863920 247.090436

You can see there is a space missing after the 'end' position of the chromosome window for which the SFS was estimated.

This is the script I am using:

cat mybedfile.bed | \
awk -v ref="$REF" -v anc="$ANC" -v a="$a" -v b="$b" -v i="$i" '
{
  cmd1 = "echo \"" $1 ":" $2 "-" $3 "\" > 7_windowDxy/temp/" I ".rf;"
  cmd2 = "angsd -b 7_windowDxy/7_0_metadata/" a ".bamlist -out tmp." a "." I
  cmd2 = cmd2 " -doSaf 1 -GL 2 -rf 7_windowDxy/temp/" I ".rf -minQ 20 -minMapQ 20 -ref " ref " -anc " anc ";"
  cmd3 = "angsd -b 7_windowDxy/7_0_metadata/" b ".bamlist -out tmp." b "." I
  cmd3 = cmd3 " -doSaf 1 -GL 2 -rf 7_windowDxy/temp/" I ".rf -minQ 20 -minMapQ 20 -ref " ref " -anc " anc ";"
  cmd4 = "echo -n \"" $1 " " $2 " " $3 " \" ;"
  cmd5 = "realSFS tmp." a "." I ".saf.idx tmp." b "." I ".saf.idx"

  print cmd1 cmd2 cmd3 cmd4 cmd5
}' | xargs -I CMD bash -c CMD | awk '{if(NF>4){print $N}}' > "${a}.${b}.scaffold${i}.50kwin.sfs"

The key part I believe is the final awk '{if(NF>4){print $N}}' command. I tried adding OFS=" ", changing print $N to print $0, as well as multiple variations of print $1, $2, $3, $4, substr() *.

This script is in large part the result of me struggling to change the original marqueda script, and I have to admit I am not very experienced in awk scripting, so I've been punching in the dark a lot.

Any suggestions would be appreciated.

Cheers,
L

  • I don't remember what I put in substr(), I was following some online suggestions, I have now changed it and didn't save it.
sfs awk angsd • 482 views
ADD COMMENT
1
Entering edit mode
4 months ago

Why an ugly (sorry) awk script when you can just use a bash loop?

cat mybedfile.bed | while read CHROM START END
do
(...)
done
ADD COMMENT
0
Entering edit mode

Fair point,

I will try that instead. Thanks Pierre

ADD REPLY
0
Entering edit mode

Thanks for the suggestion Pierre.

I changed the script as follows and it now works as desired. (a and b refer to the two pop IDs, i is the scaffold number, and REF and ANC are the paths to reference and ancestral sequence)

cat 7_1_Windows_scaffold${i}_50k.bed | while read CHROM START END
do
# Make rf file with info on window to analyze
echo ${CHROM}:${START}-${END} > ${i}.rf
# Make saf file for each pop with angsd
angsd -b ${a}.bamlist -out tmp.${a}.${i} -doSaf 1 -GL 2 -rf ${i}.rf -minQ 20 -minMapQ 20 -ref ${REF} -anc ${ANC} 
angsd -b ${b}.bamlist -out tmp.${b}.${i} -doSaf 1 -GL 2 -rf ${i}.rf -minQ 20 -minMapQ 20 -ref ${REF} -anc ${ANC}
# Make space delimited bed file (this is redundant and should be removed in next iteration)
echo ${CHROM} ${START} ${END} >> tmp.${a}.${b}.scaffold${i}.50kwin.pos
# Estimate sfs from saf files
realSFS tmp.${a}.${i}.saf.idx tmp.${b}.${i}.saf.idx | awk '{if(NF>4){print $N}}' >> tmp.${a}.${b}.scaffold${i}.50kwin.sfs
# Bind bed file and sfs to obtain input for windowDxy script
paste --delimiters=" " tmp.${a}.${b}.scaffold${i}.50kwin.pos tmp.${a}.${b}.scaffold$[i}.50kwin.sfs > ${a}.${b}.scaffold${i}.50kwin.sfs
done
ADD REPLY
0
Entering edit mode

Previous while loop script was bugged (it sfs could not be calculated for an interval the interval coordinates and estimated sfs would become unaligned).

This later version fixes that.

#############################################
# Loop through the bed file with while read #
#############################################
cat 7_windowDxy/7_0_metadata/7_1_Windows_scaffold${i}_50k.bed | while read CHROM START END
do
# Make rf file with info on window to analyze
echo ${CHROM}:${START}-${END} > 7_windowDxy/temp_${a}_${b}/${i}.rf
# Make saf file for each pop with angsd
angsd -b 7_windowDxy/7_0_metadata/${a}.bamlist -out 7_windowDxy/temp_${a}_${b}/tmp.${a}.${i} -doSaf 1 -GL 2 -rf 7_windowDxy/temp_${a}_${b}/${i}.rf -minQ 20 -minMapQ 20 -ref ${REF} -anc ${ANC} -P 2
angsd -b 7_windowDxy/7_0_metadata/${b}.bamlist -out 7_windowDxy/temp_${a}_${b}/tmp.${b}.${i} -doSaf 1 -GL 2 -rf 7_windowDxy/temp_${a}_${b}/${i}.rf -minQ 20 -minMapQ 20 -ref ${REF} -anc ${ANC} -P 2
# Estimate sfs from saf files
realSFS 7_windowDxy/temp_${a}_${b}/tmp.${a}.${i}.saf.idx 7_windowDxy/temp_${a}_${b}/tmp.${b}.${i}.saf.idx | awk '{if(NF>4){print $N}}' > 7_windowDxy/temp_${a}_${b}/tmp.${a}.${b}.scaffold${i}.50kwin.sfs
# Make a bed file with one line
echo ${CHROM} ${START} ${END} > 7_windowDxy/temp_${a}_${b}/${i}.bed
# Append to sfs file
paste --delimiters=" " 7_windowDxy/temp_${a}_${b}/${i}.bed 7_windowDxy/temp_${a}_${b}/tmp.${a}.${b}.scaffold${i}.50kwin.sfs >> 7_windowDxy/7_1_sfs/${a}.${b}.scaffold${i}.50kwin.sfs
done
ADD REPLY

Login before adding your answer.

Traffic: 1903 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6