Hi all, Currently, after mapping, I got a sam file which contained approximately 30 million mapped reads. Then, by using bedtool bamtobed, I was able to obtain a bed file. However, there were so many many similar reads. Therefore, in order to reduce the size of the final bed file, is there any way that I can mention each read only once as well as its number of presentation in the sam file.
For instance:
var_1 0 15 ATGCATGCATGCCGTA
var_1 0 15 ATGCATGCATGCCGTA
var_1 0 15 ATGCATGCATGCCGTA
var_2 5 20 ATGCATGCGGGCCCC
Will become:
var_1 0 15 ATGCATGCATGCCGTA 3
var_2 5 20 ATGCATGCGGGCCCC 1
Thank you in advance!
That's so cool, thank you. However, if I want the output in order like this:
var_1 0 15 ATGCATGCATGCCGTA 3
var_2 5 20 ATGCATGCGGGCCCC 1
Is there any command that I can use? Thanks.
Add
| awk '{print $2,$3,$4,$5,$1}'
to end of original command.Results in
Oh you beat me by a second :D
sure, you can just pipe the results in
awk
, for instance:Hi, I notice that for my bed file (which contains ~30 million reads and ~ 2.5 GB in size), I need to sort my file using sort myfile.bed before using uniq -c. Otherwise uniq -c only will not provide expected result. What could be the reason for that? When will I need to sort my file? Thanks.
Indeed,
uniq
works only on sorted files because it only check for duplicate in contiguous rows. So basically, you always need to sort your files (unless they are already sorted).