Entering edit mode
3.9 years ago
Explorer
▴
10
I am trying to find unique sequences along with count and IDs from a FASTA file in R using Biostring. For exmaple
>random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag
The following code gives me a list of unique sequences
library(Biostrings)
random <- readDNAStringSet("random.fasta")
unique(random)
DNAStringSet object of length 4:
width seq names
[1] 10 TATGTGCGAG random sequence 1
[2] 10 AGGGTGTTAT random sequence 2
[3] 10 GACTCGCGGT random sequence 4
[4] 10 GCAGCCATCG random sequence 6
But I am not sure how to return “count” and “IDs” for each unique sequence and how to remove sequences with ambiguous characters. Can anyone help please? Thanks
This operation might be a lot easier in bioawk. Do you absolutely need to use R? If so, I'd recommend using dplyr to
group_by
andsummarise
I am trying to learn R but if there is a simpler command in awk, I would really appreciate it if you may share.
Did zx8754's solution work? Like I said, you could use awk but it will be more complicated. Even bioawk may not help if your identifiers have white spaces in them.