Question

Couting number of similar reads in sam file and output in bed file?

1

Entering edit mode

5.1 years ago

bobbyle0210 ▴ 10

Hi all, Currently, after mapping, I got a sam file which contained approximately 30 million mapped reads. Then, by using bedtool bamtobed, I was able to obtain a bed file. However, there were so many many similar reads. Therefore, in order to reduce the size of the final bed file, is there any way that I can mention each read only once as well as its number of presentation in the sam file.

For instance:

var_1 0 15 ATGCATGCATGCCGTA

var_2 5 20 ATGCATGCGGGCCCC

Will become:

var_1 0 15 ATGCATGCATGCCGTA 3

var_2 5 20 ATGCATGCGGGCCCC 1

Thank you in advance!

RNA-Seq samtools bedtools • 1.3k views

ADD COMMENT • link updated 5.1 years ago by Carlo Yague 8.9k • written 5.1 years ago by bobbyle0210 ▴ 10

score 3 · Accepted Answer · 2019-11-08

3

Entering edit mode

5.1 years ago

Carlo Yague 8.9k

Use uniq -c your bed file.

uniq -c mybed.bed > uniqbed.bed

or

echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c

3 var_1 0 38 ATGCATGCATGCCGTA
1 var_2 5 40 ATGCATGCGGGCCCC

ADD COMMENT • link 5.1 years ago by Carlo Yague 8.9k

0

Entering edit mode

That's so cool, thank you. However, if I want the output in order like this:

var_1 0 15 ATGCATGCATGCCGTA 3

var_2 5 20 ATGCATGCGGGCCCC 1

Is there any command that I can use? Thanks.

ADD REPLY • link 5.1 years ago by bobbyle0210 ▴ 10

2

Entering edit mode

Add | awk '{print $2,$3,$4,$5,$1}' to end of original command.

$ echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c |  awk '{print $2,$3,$4,$5,$1}'

Results in

var_1 0 38 ATGCATGCATGCCGTA 3
var_2 5 40 ATGCATGCGGGCCCC 1

ADD REPLY • link 5.1 years ago by GenoMax 147k

0

Entering edit mode

Oh you beat me by a second :D

ADD REPLY • link 5.1 years ago by Carlo Yague 8.9k

1

Entering edit mode

sure, you can just pipe the results in awk, for instance:

echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c | awk '{ print $2,$3,$4,$5,$1; }'

ADD REPLY • link 5.1 years ago by Carlo Yague 8.9k

0

Entering edit mode

Hi, I notice that for my bed file (which contains ~30 million reads and ~ 2.5 GB in size), I need to sort my file using sort myfile.bed before using uniq -c. Otherwise uniq -c only will not provide expected result. What could be the reason for that? When will I need to sort my file? Thanks.

ADD REPLY • link 5.1 years ago by bobbyle0210 ▴ 10

0

Entering edit mode

Indeed, uniq works only on sorted files because it only check for duplicate in contiguous rows. So basically, you always need to sort your files (unless they are already sorted).

ADD REPLY • link 5.1 years ago by Carlo Yague 8.9k