Maximum bits in ggseqlogo is less then 2?
3
1
Entering edit mode
6 months ago
Mattan ▴ 10

Hi all, I'm using this code as a toy example to see if I understand how ggseqlogo works

# Create a data frame with the sequences
sequences <- c("ACA","ATA","AGA","AAA","ATA","AGA")
df <- data.frame(Sequence = sequences)

# Plot the sequence logo
ggseqlogo(df$Sequence)

I fail to understand the results I'm getting. enter image description here

Why isn't the maximuum 2 bits for positions 0,2? Also, why is the middle poistion (1) isn't populated? We do know that G and T are more common there.

ggseqlogo • 507 views
ADD COMMENT
2
Entering edit mode
6 months ago
Mensur Dlakic ★ 28k

Most sequence logo implementations have a small sample correction by default. This is easy to test by using 15-20 sequences, in which case the height of terminal residues should be closer to 2 bits and the middle stack might show up. This is what I get after increasing the number of sequences to ~100.

enter image description here

You may need to turn the small sample correction off.

ADD COMMENT
2
Entering edit mode
6 months ago

The original paper describes the information at a position as:

Information(l) = 2 - (H(l) + e(n))

Here, Information(l) is the value along the y-axis in your plots. H(l) is the uncertainty at position l (which decreases when you have more of the same bases at that position in the sequence), while e(n) is a correction factor required when the number n of sample sequences is small. That correction factor is apparently described in this paper but it is also defined in the source code for ggseqlogo.

ADD COMMENT
1
Entering edit mode
6 months ago
bk11 ★ 3.0k

Did you check with method="prob" too?

ggseqlogo(df$Sequence, method="prob")

enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 1131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6