Hello everyone,
I'm a chemist branching out into bioinformatics and I may use some of the incorrect nomenclature so I apologize in advance. The following questions concern NGS data from an in vitro selection of DNA aptamers.
What I have:
A library consisting of millions of 50 nt long DNA sequences in SAM format (the files is about 2 Gb). Each sequence has an identical 5 nt at the 3' end and the remaining 45 nt are semi-random. Some sequences should appear only once in the library and some sequences should appear hundreds of times.
What I need to do:
I first need to remove any sequences without the proper 5 nt at the 3' end (for some reason, some of the sequences were misread). I then need to collect identical sequences into groups (called binning, I believe), and then sort the groups by number of sequences. As stated previously, most of the groups will have only a single member but some should be quite large.
That's really it.
If someone could please suggest a method to perform this task, i would be very grateful. I would prefer software compatible with windows 7 but I could also do this on linux. I have access to matlab.
Your help is very much appreciated.
Thanks!
FYI if you're going to be doing a lot of NGS work as a bioinformatician, it will be helpful to learn R as many libraries exist that will help you with tasks you'll face. For you it will be easy to learn R if you already know MATLAB. I went from MATLAB to R and it was pretty straightforward.