I have a dataset of 5640 protein sequences in fasta format. I want to select and make a new dataset with the proteins having more than 100 amino acids. How can I do it using cmd command or Rstudio? What should be the script to do this? I badly need the answer!! Please help me.....
What format is your data in already? Have you created a dataframe?
Please provide more information - ideally a minimal representative dataset (the first 10 lines of a dataframe with
head()
would do for instance).I haven't created a separate dataframe. The data are in fasta format in a fasta file. For example, below are the first 10 sequences of this dataset........
I just want to use a command to select the protein sequences having 100 or less than hundred amino acid and exclude them from the dataset and get a new dataset containing remaining proteins having >100 amino acid.
For an R solution, see https://stackoverflow.com/questions/8640377/remove-all-rows-where-length-of-string-is-more-than-n for a hint.
Your data is in a horrible, horrible format at the moment however. You've no easy way of demarcating the headers from the sequences as they're separated by a single whitespace and both your headers and sequences also have spaces in.