Entering edit mode
2.2 years ago
pentro
•
0
In the protein fasta files I downloaded from the human NCBI genome, I got the longest isoform for each gene from R with the help of the code below.
library(Biostrings)
fasta.s <- readDNAStringSet("homo.faa")
names.fasta <- names(fasta.s)
gene.iso <- sapply(names.fasta,function(j)
cbind(unlist(strsplit(j,'\\.'))[2:3]))
gene.iso.df <- data.frame(t(gene.iso)) colnames(gene.iso.df) <-
c('gene','isoform')
gene.iso.df$width <- width(fasta.s)
gene.iso.df.split <- split(gene.iso.df,gene.iso.df$gene)
best.id <- sapply(gene.iso.df.split,function(x)
row.names(x)[order(x$width)[1]])
fasta.s.best <- fasta.s[best.id]
writeXStringSet(fasta.s, filepath='homo_isoform.faa')
But how can I find out how many isoforms there are in total?
You can use
sapply
as you used to get the longest isoform, only here you want to get thelength
of each list element so something like:should work