I swear I have seen this somewhere but I can't find it again. Is there a script somewhere to produce text Stem-and-Leaf Histograms from sequence lengths of fasta file?
I swear I have seen this somewhere but I can't find it again. Is there a script somewhere to produce text Stem-and-Leaf Histograms from sequence lengths of fasta file?
You can do this in R with a combination of the seqinR and the aplpack packages:
library(seqinr)
data=read.fasta("file.fasta")
l=getLength(x)
library(aplpack)
stem.leaf(l)
There are a number of options to suit your needs:
stem.leaf(l, unit=100)
Just have a look here for more output options and details:
?stem.leaf
This functions extends far more than the simple stem() function in the graphics package.
My initial idea was to use Biopython
for the parsing of the fasta file and matplotlib
for the drawing of the stem-leaf plot. I couldn't find a function that draws a text histogram in matplotlib
to my amazement, so here is my take:
from Bio import SeqIO
from collections import defaultdict
input_file = r'path\to\file.fasta'
lengths = [len(record.seq) for record in SeqIO.parse(open(input_file), 'fasta')]
d = defaultdict(list)
for length in lengths:
d[length/10].append(length%10)
for nr in range(min(d.keys()), max(d.keys())+1):
print "{0:>3} | {1}".format(nr, ' '.join(map(str, sorted(d[nr]))) if nr in d else '')
For this fictitious input file it produces this histogram:
5 | 3 4 8
6 | 0 0 6 7 9
7 | 0
8 | 6
9 | 0
10 | 9
11 | 2 5 6
12 | 0 0 0 0 5 8
There is something in the kent source tree called textHistogram. It takes a file with one number per line, which you can get from fasta using awk...
textHistogram <(awk < test.fa '$0 !~ /^>/ {print length($0)}')
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.