Entering edit mode
6.7 years ago
rmash
▴
20
I have a txt file with sequences
1 TACCCTGTAGAACCGAATTTGT miRNA mmu-mir-10b PM
2 GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT tRNA Mus_musculus_tRNA-Gly-GCC-4-1 PM
3 TACCCTGTAGATCCGAATTTGT miRNA mmu-mir-10a PM
4 GCATTGTGGTTCAGTGGTAGAATTCTCGCCT tRNA Mus_musculus_tRNA-Gly-GCC-2-2 IM
5 ACCCTGTAGAACCGAATTTGT other other NA
6 TACCCTGTAGAACCGAATTTG other other NA
7 GCATTGGTTCAGTGGTAGAATTCTCGCCT tRNA Mus_musculus_tRNA-Gly-GCC-2-7 IM
8 GCATTTGTGGTTCAGTGGTAGAATTCTCGCCT tRNA Mus_musculus_tRNA-Gly-GCC-4-1 IM
9 TACCCTGTAGAACCGAATTTGTG miRNA mmu-mir-10b PM
10 GGTGAATATAGTTTACAAAAAACATTAGACTGTGAATC tRNA tRNA-His IM
I'd like a count matrix based on the 3rd value in each line such that I have something like. What's the best way to do this?
mmu-mir-10b 2
tRNA-His 1
other 2
etc
That means you have no repeated reads. I don't see any repeated read in given example.
The command did what it supposed to do. You have repeated substrings, but not repeated strings. For example, "ACCCTGTAGATCCGAATTTGT" is repeated 4 times in other sequences but as a substring.
Yes, I was doing something silly, but I've fixed it and there should now be repeats, as the 4th item in each line, how would I go about making a count matrix from this? Sorry, i'm a novice
Did you derive this strange file from your alignments in SAM/BAM format? Since you are dealing with miRNA (which should have no gaps) why not just count the instances of the miRNA a read is aligned to (in third field of your alignment file) to get the count matrix.
You are interested in miRNA counts and not read counts, is that correct?
yes thats what id like to do but dont know how to in unix
thats correct, have edited main post to make it clearer