Question

Sorting And Binning Reads From A Sam File

1

Entering edit mode

11.1 years ago

zeropoint1 ▴ 10

Hello everyone,

I'm a chemist branching out into bioinformatics and I may use some of the incorrect nomenclature so I apologize in advance. The following questions concern NGS data from an in vitro selection of DNA aptamers.

What I have:

A library consisting of millions of 50 nt long DNA sequences in SAM format (the files is about 2 Gb). Each sequence has an identical 5 nt at the 3' end and the remaining 45 nt are semi-random. Some sequences should appear only once in the library and some sequences should appear hundreds of times.

What I need to do:

I first need to remove any sequences without the proper 5 nt at the 3' end (for some reason, some of the sequences were misread). I then need to collect identical sequences into groups (called binning, I believe), and then sort the groups by number of sequences. As stated previously, most of the groups will have only a single member but some should be quite large.

That's really it.

If someone could please suggest a method to perform this task, i would be very grateful. I would prefer software compatible with windows 7 but I could also do this on linux. I have access to matlab.

Your help is very much appreciated.

Thanks!

sam ngs sequencing sort • 3.1k views

ADD COMMENT • link updated 11.1 years ago by Pierre Lindenbaum 164k • written 11.1 years ago by zeropoint1 ▴ 10

0

Entering edit mode

FYI if you're going to be doing a lot of NGS work as a bioinformatician, it will be helpful to learn R as many libraries exist that will help you with tasks you'll face. For you it will be easy to learn R if you already know MATLAB. I went from MATLAB to R and it was pretty straightforward.

ADD REPLY • link 11.1 years ago by Dan D 7.4k

score 2 · Answer 1 · 2013-10-16

2

Entering edit mode

11.1 years ago

Istvan Albert 102k

You can use a combination of bioawk and shell scripts to do this like so:

cat data.sam | ~/bin/bioawk -c sam '{ print substr($seq, 5, 45) }' | sort | uniq -c | sort -rg

will give you a count for each sequence

ADD COMMENT • link 11.1 years ago by Istvan Albert 102k

0

Entering edit mode

Badass one-liner

ADD REPLY • link 11.1 years ago by Dan D 7.4k

score 2 · Answer 2 · 2013-10-16

2

Entering edit mode

11.1 years ago

Pierre Lindenbaum 164k

export LC_ALL=C
cut -f 10 data.sam | cut -c 5-50 | sort | uniq -c | sort -rg

ADD COMMENT • link 11.1 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Unexpectedly your plain version is slower than the bioawk version.

$ time cat data.sam | ~/bin/bioawk -c sam '{ print substr($seq, 5, 45) }' > /dev/null 

real    0m1.046s
user    0m1.008s
sys    0m0.052s

$ time cut -f 10 data.sam | cut -c 5-50 > /dev/null

real    0m1.288s
user    0m1.764s
sys    0m0.044s

I only ran it to see how much slower the bioawk version was ;-) plus it has the advantage of not including the headers