Extracting counts of reads into a matrix
2
0
Entering edit mode
6.7 years ago
rmash ▴ 20

I have a txt file with sequences

1   TACCCTGTAGAACCGAATTTGT  miRNA   mmu-mir-10b PM
2   GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   PM
3   TACCCTGTAGATCCGAATTTGT  miRNA   mmu-mir-10a PM
4   GCATTGTGGTTCAGTGGTAGAATTCTCGCCT tRNA    Mus_musculus_tRNA-Gly-GCC-2-2   IM
5   ACCCTGTAGAACCGAATTTGT   other   other   NA
6   TACCCTGTAGAACCGAATTTG   other   other   NA
7   GCATTGGTTCAGTGGTAGAATTCTCGCCT   tRNA    Mus_musculus_tRNA-Gly-GCC-2-7   IM
8   GCATTTGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   IM
9   TACCCTGTAGAACCGAATTTGTG miRNA   mmu-mir-10b PM
10  GGTGAATATAGTTTACAAAAAACATTAGACTGTGAATC  tRNA    tRNA-His    IM

I'd like a count matrix based on the 3rd value in each line such that I have something like. What's the best way to do this?

mmu-mir-10b 2
tRNA-His 1
other 2

etc

counts reads unix grep • 2.0k views
ADD COMMENT
0
Entering edit mode

That means you have no repeated reads. I don't see any repeated read in given example.

ADD REPLY
0
Entering edit mode

The command did what it supposed to do. You have repeated substrings, but not repeated strings. For example, "ACCCTGTAGATCCGAATTTGT" is repeated 4 times in other sequences but as a substring.

ADD REPLY
0
Entering edit mode

Yes, I was doing something silly, but I've fixed it and there should now be repeats, as the 4th item in each line, how would I go about making a count matrix from this? Sorry, i'm a novice

ADD REPLY
0
Entering edit mode

Did you derive this strange file from your alignments in SAM/BAM format? Since you are dealing with miRNA (which should have no gaps) why not just count the instances of the miRNA a read is aligned to (in third field of your alignment file) to get the count matrix.

You are interested in miRNA counts and not read counts, is that correct?

ADD REPLY
0
Entering edit mode

yes thats what id like to do but dont know how to in unix

ADD REPLY
0
Entering edit mode

thats correct, have edited main post to make it clearer

ADD REPLY
0
Entering edit mode
6.7 years ago
rmash ▴ 20

This seems to work

awk '{seen[$4]++} END{for(x in seen) print x, seen[x]}' test.txt > test_counts.txt

Is there a better option?

ADD COMMENT
0
Entering edit mode

Better in what terms? If it gives you the right answer then this is fine.

Just to be safe I suggest that you do something similar on your original SAM file so there are less chance of errors.

ADD REPLY
0
Entering edit mode
6.7 years ago
 $ datamash -s -g 4 count 4 < test.list 
Mus_musculus_tRNA-Gly-GCC-2-2   1
Mus_musculus_tRNA-Gly-GCC-2-7   1
Mus_musculus_tRNA-Gly-GCC-4-1   2
mmu-mir-10a 1
mmu-mir-10b 2
other   2
tRNA-His    1

test.list is data from OP and first column is serial numbers from 1 to 10 from OP.

ADD COMMENT

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6