Does anyone know where to find (or how to create) a gc-content file that serves as input to PennCNV? The software has a gc model, but it is for hg18 and its format is different fom gc content file found in UCSC genomes.
Does anyone know where to find (or how to create) a gc-content file that serves as input to PennCNV? The software has a gc model, but it is for hg18 and its format is different fom gc content file found in UCSC genomes.
The following bash script should do the job:
IFS="
"
echo -e "Name\tChr\tPosition\tGC"
curl -s "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/snp138.txt.gz" | gunzip -c |\
cut -f 2,3,5|\
while read L
do
C=`echo $L | cut -f 1`
P=`echo $L | cut -f 2`
R=`echo $L | cut -f 3`
S=`echo $L |awk '{beg=int($2)-500; if(beg<1) beg=1; printf("%s:%d,%d",$1,beg,int($2)+500);}'`
echo -e -n "${R}\t${C}\t${P}\t" | sed 's/chr//'
xmllint --xpath "/DASDNA/SEQUENCE/DNA/text()" "http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=$S" |\
tr -d "\n" | awk '{L1=1.0*length($0);c=$0;gsub(/[^GCgc]/,"",c);L2=1.0*length(c); printf("%f\n",100.0*(L2/L1));}'
done
explain: IFS is the internal field sepator (using the newline). Download and g-unzip the dbSNP138: cut the chrom,position,name.
extends the position +/ 500pb , use the UCSC das server to fetch the sequence. remove everything that is not GC to get a ratio length(all)/length(GC_only.)
Note: I'm away from the lab, I've used a DAS server to get the GC% but you'd better use a local resource like samtools faidx
$ bash shell.sh
Name Chr Position GC
rs144773400 1 10144 36.263736
rs201752861 1 10176 38.961039
rs201694901 1 10179 39.260739
rs143255646 1 10228 43.456543
rs200462216 1 10228 43.456543
rs200279319 1 10230 43.656344
rs145599635 1 10233 43.956044
rs148908337 1 10247 44.955045
rs199706086 1 10249 45.154845
rs140194106 1 10254 45.654346
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is very nice code! :-)
Hi Pierre,
I got the following error when I was trying to run the script you post. Any idea? thanks.
B.
Hi Pierre. Is it possible to post the file ? I was not able to work with your code. Thanks