FASTA to Position Frequency Matrix
2
0
Entering edit mode
2.6 years ago
KH ▴ 90

What tools are out there that can generate a position frequency matrix from a FASTA file (or a BED file)? I have a file containing many short (10bp) reads and want the count of each base at each position. I could write up something to do it, but I'm working with 50-100M sequences per file, and was wondering if any currently available tool is particularly fast.

motif fasta PWM PFM • 1.0k views
ADD COMMENT
1
Entering edit mode
2.6 years ago
KH ▴ 90

What I eventually went with was using the Biostrings package in R to do most of the heavy lifting.

After loading in the sequences, I converted them to a DNAStringSet object using DNAStringSet(seq_list) and then got the counts at each position with consensusMatrix(DNAStringSet_object)

This can then be converted to a proportion table with base R prop.table()

ADD COMMENT
0
Entering edit mode
2.6 years ago
Trivas ★ 1.8k

You could also try the DiffLogo R package (specifically, getPwmFromFile) but I have no experience in using the package so not sure how easy it is to use/speed.

ADD COMMENT

Login before adding your answer.

Traffic: 2278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6