Question

Five Step Method For Small RNA Analysis

4

Entering edit mode

12.6 years ago

Click downvote ▴ 720

Hi-

CS student here, totally new to bioinformatics. My supervisor explained a master's thesis topic to me, which sounded interesting and I thought I understood. However, when I tried to read an article describing the method I could use I understood nada. Can you help me?

The task my supervisor proposed sounded to me like "find piRNA clusters (many short strings next to each other) in genome (one darn long string)".

As I have so very little understanding of the topic, framing an intelligible question is not feasible for me (so perhaps this thread is beyond redemption already?) Anyhow, I have pasted a summary of the method (described in "Bioinformatic analysis of barcoded cDNA libraries for small RNA profiling by next-generation sequencing") in quotes and tried to explain what I think it might mean below. Would love it if you could explain and clear up any misconceptions.

We begin:

Next generation sequencing outputs are text files which report sequence and a quality score for each sequenced base.

So each letter in "ACC" gets a score like "1","3" and "8"? Based on what? Does the substring/piRNA get the score or the long string/genome?

These files are processed to (1) trim the 30 barcoded adapter sequence from each read and assign the read to a specific subsample according to the barcode,

What is a read? I suspect it is a piRNA string to be tested against the genome.

Why are reads assigned to a subsample? Is it something like you try to put piRNA1, piRNA2, piRNA3 next to each other to create a longer substring to test against the genome (to find a "piRNA-cluster"?)

What is a barcode?

(2) generate files with unique (non-redundant) reads for each subsample listing the times each unique read is encountered

So you don't want to test reads that are too similar against the genome to save time. Think I got this one.

(3) remove low complexity sequences and adapter– adapter ligation products

So you don't want to test for the string "AAAAAAAA", which presumably is an example of "low complexity". Do not understand what the second part is about at all.

(4) map the unique reads to the genome

Whatever a read is.

and (5) annotate the reads with a specific hierarchy of small RNA annotation databases.

So you look up the suspected piRNA-candidates in a database. Why? To try to find their possible function?

RNA-seq • 5.2k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 12.6 years ago by Click downvote ▴ 720

0

Entering edit mode

What would be some relevant literature that deals with similar bioinformatics topics?

ADD REPLY • link 12.6 years ago by Click downvote ▴ 720

score 5 · Answer 1 · 2012-10-07

To be honest - I am not sure if such a project makes much sense for you as fundamental background knowledge seems to be missing. But let's try to fill the gaps (and maybe open new ones ...):

Regarding the scores of the sequencing output:
- https://en.wikipedia.org/wiki/Phred_score
- https://en.wikipedia.org/wiki/Fastq
The reads are the actual sequences that you get from the sequencer.
Regarding the barcodes: Search for "multiplexing" / "demultiplexing". The barcode is a short (mostly 4-6 nt) sequence that is attached to the end of sequences of different sample libraries (each library gets it's own barcode). These libraries are then sequenced together in on sequences lane. The barcode can then be used to bin (demultiplex) the reads into different groups. The motivation to do that is that the capacity of a sequencer lane is so large that it might be a waste to sequence only sequences of a single sample. But if several samples would be sequenced together without a barcode the results could not be allocated to the originating samples.
- http://www.illumina.com/technology/multiplexing_sequencing_assay.ilmn
Regarding uniquely mapped reads: You map (align) reads against a reference genome. Ideally, reads should be place only at one location in the genome. If you have short reads or reads originating from sequences that occur multiple times in the genome they are placed at several locations. So you don't know the real location for sure and often such reads are discarded.
- I think regarding the annotation comparison you are on the right track. If you find similar sequences in database that have an annotation (function, structure, etc.) you can imply that your sequences have similar features/functions.

Hope this helps.

score 2 · Answer 2 · 2012-10-07

I think you should be able to educate yourself a little bit at least in the technology and also in the biological system that you are investigating when doing a master thesis in bioinformatics. You could actually have read the complete paper where every step is explained in detail with graphics, although it might be lacking background (the why). There are many sources that explain the basics, and the supervisor could also do some education/discussion.

quality: http://en.wikipedia.org/wiki/FASTQ_format of course the read gets the quality. The genome is for simplicity reason thought of as accurate.
barcode: is explained in the paper in figure 3. http://origin-ars.els-cdn.com/content/image/1-s2.0-S1046202312001764-gr3.jpg and by "assign the reads to subsamples according to their corresponding barcodes"
unique: means exactly identical not "too similar"
(3) is explained in detail in point 5 of the procedure section: "Low complexity reads are defined as mono-, di- and tri-nucleotide repeats and are removed from analysis. Reaction by-products, (such as adapter–adapter ligation products) that are the same length as the desired products containing the size-selected insert RNA, are filtered out using the Needleman–Wunsch alignment algorithm."
(5) yes. To understand what is already known in the sample and what is new and unknown to the world. However its not just piRNA, but an extensive list of known sequences. Point 10 in the paper.

"Whatever a read is." I am sure you manage to read up on this yourself.

I wonder if this is really what your supervisor had in mind, because its an experiment driven pipeline. http://www.biomedcentral.com/1471-2105/13/5/abstract (found on the piRNA wiki page) might be more appropriate.

score 2 · Answer 3 · 2012-10-08

A lot of software already exists that does this. This does not imply it is not a good project. Rather, it could help you in increasing your understanding of the requirements. In our lab we've written software for all five steps. I'll just highlight two programs that do the first three steps in you list. These are reaper for stripping adapters, removing low complexity sequence, quality-based trimming, and demultiplexing, and tally for deduplication and tallying of reads. Manuals can be found here.

Added/edit: If existing software works for you I should be a bit more complete. The following is a list without web links, but programs should be easy to locate. They are in no particular order, and I am not really acquainted with most of them. FASTQ/A Clipper software, cutadapt, BTRIM, tagdust, UEA toolkit, mirExpress, seqbuster, DSAP. The reaper/tally software mentioned above will be submitted in a paper at the end of this month, and is actively maintained.