Five Step Method For Small RNA Analysis
3
4
Entering edit mode
12.2 years ago

Hi-

CS student here, totally new to bioinformatics. My supervisor explained a master's thesis topic to me, which sounded interesting and I thought I understood. However, when I tried to read an article describing the method I could use I understood nada. Can you help me?

The task my supervisor proposed sounded to me like "find piRNA clusters (many short strings next to each other) in genome (one darn long string)".

As I have so very little understanding of the topic, framing an intelligible question is not feasible for me (so perhaps this thread is beyond redemption already?) Anyhow, I have pasted a summary of the method (described in "Bioinformatic analysis of barcoded cDNA libraries for small RNA profiling by next-generation sequencing") in quotes and tried to explain what I think it might mean below. Would love it if you could explain and clear up any misconceptions.

We begin:

Next generation sequencing outputs are text files which report sequence and a quality score for each sequenced base.

So each letter in "ACC" gets a score like "1","3" and "8"? Based on what? Does the substring/piRNA get the score or the long string/genome?

These files are processed to (1) trim the 30 barcoded adapter sequence from each read and assign the read to a specific subsample according to the barcode,

What is a read? I suspect it is a piRNA string to be tested against the genome.

Why are reads assigned to a subsample? Is it something like you try to put piRNA1, piRNA2, piRNA3 next to each other to create a longer substring to test against the genome (to find a "piRNA-cluster"?)

What is a barcode?

(2) generate files with unique (non-redundant) reads for each subsample listing the times each unique read is encountered

So you don't want to test reads that are too similar against the genome to save time. Think I got this one.

(3) remove low complexity sequences and adapter– adapter ligation products

So you don't want to test for the string "AAAAAAAA", which presumably is an example of "low complexity". Do not understand what the second part is about at all.

(4) map the unique reads to the genome

Whatever a read is.

and (5) annotate the reads with a specific hierarchy of small RNA annotation databases.

So you look up the suspected piRNA-candidates in a database. Why? To try to find their possible function?

RNA-seq • 4.9k views
ADD COMMENT
0
Entering edit mode

What would be some relevant literature that deals with similar bioinformatics topics?

ADD REPLY
5
Entering edit mode
12.2 years ago
Konrad ▴ 710

To be honest - I am not sure if such a project makes much sense for you as fundamental background knowledge seems to be missing. But let's try to fill the gaps (and maybe open new ones ...):

  • Regarding the scores of the sequencing output:
  • The reads are the actual sequences that you get from the sequencer.
  • Regarding the barcodes: Search for "multiplexing" / "demultiplexing". The barcode is a short (mostly 4-6 nt) sequence that is attached to the end of sequences of different sample libraries (each library gets it's own barcode). These libraries are then sequenced together in on sequences lane. The barcode can then be used to bin (demultiplex) the reads into different groups. The motivation to do that is that the capacity of a sequencer lane is so large that it might be a waste to sequence only sequences of a single sample. But if several samples would be sequenced together without a barcode the results could not be allocated to the originating samples.
  • Regarding uniquely mapped reads: You map (align) reads against a reference genome. Ideally, reads should be place only at one location in the genome. If you have short reads or reads originating from sequences that occur multiple times in the genome they are placed at several locations. So you don't know the real location for sure and often such reads are discarded.
    • I think regarding the annotation comparison you are on the right track. If you find similar sequences in database that have an annotation (function, structure, etc.) you can imply that your sequences have similar features/functions.

Hope this helps.

ADD COMMENT
0
Entering edit mode

Thanks. Will read the paper again with updated knowledge.

ADD REPLY
1
Entering edit mode

Maybe this video helps regarding the barcode questions: https://www.youtube.com/watch?v=hgSoJiOoSQQ And here are two videos that cover NGS in general: https://www.youtube.com/watch?v=oPIQ7sre5vk https://www.youtube.com/watch?v=g0vGrNjpyA8 This should be easier to grasp than dry articles.

ADD REPLY
0
Entering edit mode

Holy moly what a resource youtube is. Thanks.

ADD REPLY
2
Entering edit mode
12.2 years ago
Ido Tamir 5.2k

I think you should be able to educate yourself a little bit at least in the technology and also in the biological system that you are investigating when doing a master thesis in bioinformatics. You could actually have read the complete paper where every step is explained in detail with graphics, although it might be lacking background (the why). There are many sources that explain the basics, and the supervisor could also do some education/discussion.

  • quality: http://en.wikipedia.org/wiki/FASTQ_format of course the read gets the quality. The genome is for simplicity reason thought of as accurate.
  • barcode: is explained in the paper in figure 3. http://origin-ars.els-cdn.com/content/image/1-s2.0-S1046202312001764-gr3.jpg and by "assign the reads to subsamples according to their corresponding barcodes"
  • unique: means exactly identical not "too similar"
  • (3) is explained in detail in point 5 of the procedure section: "Low complexity reads are defined as mono-, di- and tri-nucleotide repeats and are removed from analysis. Reaction by-products, (such as adapter–adapter ligation products) that are the same length as the desired products containing the size-selected insert RNA, are filtered out using the Needleman–Wunsch alignment algorithm."
  • (5) yes. To understand what is already known in the sample and what is new and unknown to the world. However its not just piRNA, but an extensive list of known sequences. Point 10 in the paper.

"Whatever a read is." I am sure you manage to read up on this yourself.

I wonder if this is really what your supervisor had in mind, because its an experiment driven pipeline. http://www.biomedcentral.com/1471-2105/13/5/abstract (found on the piRNA wiki page) might be more appropriate.

ADD COMMENT
0
Entering edit mode

I read the paper several times over. Finding out what a read is seems hard; googling "read" and "bioinformatics" returns nada that seems relevant. And a barcode is a string of bases, but not explained more in depth. Why and how they are sorted according to barcodes is not explained. Still UV.

ADD REPLY
2
Entering edit mode
12.2 years ago
Micans ▴ 270

A lot of software already exists that does this. This does not imply it is not a good project. Rather, it could help you in increasing your understanding of the requirements. In our lab we've written software for all five steps. I'll just highlight two programs that do the first three steps in you list. These are reaper for stripping adapters, removing low complexity sequence, quality-based trimming, and demultiplexing, and tally for deduplication and tallying of reads. Manuals can be found here.

Added/edit: If existing software works for you I should be a bit more complete. The following is a list without web links, but programs should be easy to locate. They are in no particular order, and I am not really acquainted with most of them. FASTQ/A Clipper software, cutadapt, BTRIM, tagdust, UEA toolkit, mirExpress, seqbuster, DSAP. The reaper/tally software mentioned above will be submitted in a paper at the end of this month, and is actively maintained.

ADD COMMENT
0
Entering edit mode

Ahh, so I'll most likely be using software to do this then. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6