Entering edit mode
3.9 years ago
Jimpix
▴
10
Hey! It is possible to check the size of genome (number of bp) in bam (or sam or fastq) file?
Thanks in advance.
Hey! It is possible to check the size of genome (number of bp) in bam (or sam or fastq) file?
Thanks in advance.
Reads will overlap, so you can't simply count the lengths of reads. To count the number of unique bases over reads per chromosome for assembly hg38
, for example, using BEDOPS, bash, awk, and UCSC Kent utilities:
$ ASSEMBLY=hg38
$ bam2bed --reduced < reads.bam | bedmap --echo --bases-uniq <( fetchChromSizes ${ASSEMBLY} | grep -v "*_*" | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - ) - > unique_bases_per_chromosome.bed
To get the total unique bases over the genome, sum up the last column in the result:
$ awk -v FS="\t" '{ s += $4 } END { print s }' unique_bases_per_chromosome.bed > total_unique_bases_over_assembly.txt
For SAM files, use sam2bed
in place of bam2bed
. For FASTQ, you would use a mapping tool to map raw sequence to locations on the genome, turning that into BAM.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You are not checking the size of the genome if you are simply looking at your raw or aligned data. You are just counting the number of bases sequenced.
If you want to estimate size of the genome then you need to do something like this.
I want to know how to check number of base pairs in bam file
You could simply run
reformat.sh -Xmx10g in=your.bam
from BBMap suite. It will produce output that will contain this information.mappedonly=t
etc should modify the output. Test it out.That's not the same as the size of the genome though. What exactly do you want?