How to find total size of regions NOT in bigWig file
1
0
Entering edit mode
5.5 years ago
Rubal ▴ 350

Hello All,

I would like to calculate the 'callable genome'. I have a reference genome in fasta format and a bed file with a list of genome coordinates that were excluded from variant calling. Could anybody advise on how to combine these two files to calculate how many sites in the reference genome were NOT excluded from analyses. So number of bases in the fasta file minus number of bases in regions listed in the bed file? I have found some solutions that would involved converting into bigWig format but I was wondering if there is a simpler tool for this.

Thanks in advance for your help.

Example of bed file regions:

contig1  1588    1589
contig1  3424    3428
contig2  0       401
fasta bed genome • 1.0k views
ADD COMMENT
4
Entering edit mode
5.5 years ago
ATpoint 85k

Maybe I get this wrong, but can't you simply count the bp in the fasta as:

seqtk comp your.fasta | awk '{sum += $2} END {print sum}'

and from this subtract the bp in the BED, where bp in BED is:

awk '{sum += ($3-$2)} END {print sum}'  your.bed

Requires seqtk.

ADD COMMENT
1
Entering edit mode

As an alternative to seqtk could I also use: grep -v ">" file.fasta | wc | awk '{print $3-$1}'

ADD REPLY

Login before adding your answer.

Traffic: 2053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6