Hello for CNV analysis QC, I am looking for a reliable bed file of regions with extreme (>90%, <10%) GC content.
Any idea were I can find such file? I tried UCSC but it gives it in bins of 5bp, which is not very convenient.
Thanks
Hello for CNV analysis QC, I am looking for a reliable bed file of regions with extreme (>90%, <10%) GC content.
Any idea were I can find such file? I tried UCSC but it gives it in bins of 5bp, which is not very convenient.
Thanks
It's not too difficult to generate a bed file of GC content along the genome. You just need the reference fasta file and a genome file giving the length of the chromosomes. Then with bedtools, first create sliding windows along the genome and for each window calculate the %GC, then use e.g. awk to get rows where %CG is above/below a threshold, something like:
bedtools makewindows -g hg19.genome -w 1000 \ | nucBed -fi hg19.genome.fasta -bed \ | awk '$5 > 0.9 || $5 < 0.1'
See 'nucBed -h' for the output format, I think %GC is going to be in column 5, not sure though.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
what is the bed file in this command?
Did you notice the
|
in the command? Output ofbedtools makewindows
is directly being piped intonucBed
(it is in the bed format).Actually I have an error:
"Less than the req'd two fields were encountered in the genome file (hg19.genome) at line 1."
and my hg19.genome is:
I can fix my error... thanks for your help.
What turned out to be the problem?
I changed "chr1" to "1" and .... . then my error was fixed.
Do you know how can I compute mappability? I want to fix mappability bias and I need mappability like gc content.