I was wondering if there are any pre built packages in R or Python/ scripts, where one can extract FASTA sequences that are longer than N bases or shorter or exactly N bases from a multi fasta file.
I was wondering if there are any pre built packages in R or Python/ scripts, where one can extract FASTA sequences that are longer than N bases or shorter or exactly N bases from a multi fasta file.
I think this post should help you.
bioawk -c fastx '(length($seq) > 60){ print ">"$name"\n"$seq }' input.fasta > output.fasta
One option for R is the seqinr package.
Use read.fasta() to read in a multi-sequence fasta file:
library(seqinr)
f <- read.fasta("myfile.fa") # default is DNA; add ", seqtype = "AA" for protein
Use summary() for a range of sequence statistics, including length:
summary(f[[1]]) # first sequence
summary(f[[1]])$length # length of first sequence
Use lapply() to get sequences longer than N bases:
f1 <- f[which(lapply(f, function(x) summary(x)$length) > 60)] # N > 60
# ignore warnings; they relate to sequence composition
There is also sure to be a R/Bioconductor solution using Biostrings.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I believe biopieces can do that as well https://code.google.com/p/biopieces/wiki/HowTo#Howto_filter_sequences_on_length