Fasta length does not match Ensembl info
1
0
Entering edit mode
6.1 years ago
firestar ★ 1.6k

I am calculating the total genome length from a fasta file using the following code

zcat genome.fa.gz | grep -v ">" | wc | awk '{print $3-$1}'

For Yeast, I get 12,157,105, and the Ensembl info indicates exactly 12,157,105. So, that adds up.

For Human, I get 56,917,651,860, but the Ensembl info indicates 3,609,003,417.

Anyone know why? I must be missing something.

annotation fasta • 1.2k views
ADD COMMENT
1
Entering edit mode

From the ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/README:

TOPLEVEL --------- These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

ADD REPLY
0
Entering edit mode

Ah right! Ensembl count for human is based on the primary assembly. And some organisms don't have primary assemblies, just the top-level.

ADD REPLY
0
Entering edit mode
6.1 years ago

here you are counting the number of lines.

You should try wc -m as mentionned in wc man page :

-m 
displays a character count. You cannot specify this option with -c.

Thus :

zcat genome.fa.gz | grep -v ">" | wc -m

-

edit : to count only character and not newlines :

zcat genome.fa.gz | grep -v ">" |  tr -d '\n' | wc -m
ADD COMMENT
0
Entering edit mode

I think wc -m counts the newline characters as well. What we actually want is total_chars - newline_chars = bases.

ADD REPLY
0
Entering edit mode

I edit my answer accordingly. You have to use tr -d '\n' to remove newlines

ADD REPLY

Login before adding your answer.

Traffic: 1525 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6