Couting number of similar reads in sam file and output in bed file?
1
1
Entering edit mode
5.1 years ago
bobbyle0210 ▴ 10

Hi all, Currently, after mapping, I got a sam file which contained approximately 30 million mapped reads. Then, by using bedtool bamtobed, I was able to obtain a bed file. However, there were so many many similar reads. Therefore, in order to reduce the size of the final bed file, is there any way that I can mention each read only once as well as its number of presentation in the sam file.


For instance:


var_1 0 15 ATGCATGCATGCCGTA


var_1 0 15 ATGCATGCATGCCGTA


var_1 0 15 ATGCATGCATGCCGTA


var_2 5 20 ATGCATGCGGGCCCC


Will become:


var_1 0 15 ATGCATGCATGCCGTA 3


var_2 5 20 ATGCATGCGGGCCCC 1

Thank you in advance!

RNA-Seq samtools bedtools • 1.3k views
ADD COMMENT
3
Entering edit mode
5.1 years ago

Use uniq -c your bed file.

uniq -c mybed.bed > uniqbed.bed

or

echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c

3 var_1 0 38 ATGCATGCATGCCGTA
1 var_2 5 40 ATGCATGCGGGCCCC
ADD COMMENT
0
Entering edit mode

That's so cool, thank you. However, if I want the output in order like this:


var_1 0 15 ATGCATGCATGCCGTA 3


var_2 5 20 ATGCATGCGGGCCCC 1


Is there any command that I can use? Thanks.

ADD REPLY
2
Entering edit mode

Add | awk '{print $2,$3,$4,$5,$1}' to end of original command.

$ echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c |  awk '{print $2,$3,$4,$5,$1}'

Results in

var_1 0 38 ATGCATGCATGCCGTA 3
var_2 5 40 ATGCATGCGGGCCCC 1
ADD REPLY
0
Entering edit mode

Oh you beat me by a second :D

ADD REPLY
1
Entering edit mode

sure, you can just pipe the results in awk, for instance:

echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c | awk '{ print $2,$3,$4,$5,$1; }'
ADD REPLY
0
Entering edit mode

Hi, I notice that for my bed file (which contains ~30 million reads and ~ 2.5 GB in size), I need to sort my file using sort myfile.bed before using uniq -c. Otherwise uniq -c only will not provide expected result. What could be the reason for that? When will I need to sort my file? Thanks.

ADD REPLY
0
Entering edit mode

Indeed, uniq works only on sorted files because it only check for duplicate in contiguous rows. So basically, you always need to sort your files (unless they are already sorted).

ADD REPLY

Login before adding your answer.

Traffic: 2469 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6