I'm trying to find the instances of a degenerate DNA sequence (contains N's, R's, K's, etc.) in the human genome. I am using the matchPattern
function provided in Biostrings. However, when I use matchPattern(pattern, subject, fixed=FALSE)
in order to force the interpretation of the IUPAC extended letters as ambiguities, it returns a lot of sequences that are all N's since the beginning and end of the sequenced chromosomes in the human genome contains thousands of N's. Is there any way to ignore those regions or just ignore patterns that are all N's? Thank you very much.
So you want to remove all Ns? Why not to
gsub("N", "", genome)
?Antonio, thanks for the response. I tried use the trimLRpattern but it seems like it will only trim up to a certain amount. For example, if I trim away "NNNN" it only trims the first four and last 4 N's. Is there a way to trim away all of the N's for all of the chromosomes in one shot (also given that the number of N's are variable and I don't know beforehand how many N's there are on either side). Thanks again!