Hi-
CS student here, totally new to bioinformatics. My supervisor explained a master's thesis topic to me, which sounded interesting and I thought I understood. However, when I tried to read an article describing the method I could use I understood nada. Can you help me?
The task my supervisor proposed sounded to me like "find piRNA clusters (many short strings next to each other) in genome (one darn long string)".
As I have so very little understanding of the topic, framing an intelligible question is not feasible for me (so perhaps this thread is beyond redemption already?) Anyhow, I have pasted a summary of the method (described in "Bioinformatic analysis of barcoded cDNA libraries for small RNA profiling by next-generation sequencing") in quotes and tried to explain what I think it might mean below. Would love it if you could explain and clear up any misconceptions.
We begin:
Next generation sequencing outputs are text files which report sequence and a quality score for each sequenced base.
So each letter in "ACC" gets a score like "1","3" and "8"? Based on what? Does the substring/piRNA get the score or the long string/genome?
These files are processed to (1) trim the 30 barcoded adapter sequence from each read and assign the read to a specific subsample according to the barcode,
What is a read? I suspect it is a piRNA string to be tested against the genome.
Why are reads assigned to a subsample? Is it something like you try to put piRNA1, piRNA2, piRNA3 next to each other to create a longer substring to test against the genome (to find a "piRNA-cluster"?)
What is a barcode?
(2) generate files with unique (non-redundant) reads for each subsample listing the times each unique read is encountered
So you don't want to test reads that are too similar against the genome to save time. Think I got this one.
(3) remove low complexity sequences and adapter– adapter ligation products
So you don't want to test for the string "AAAAAAAA", which presumably is an example of "low complexity". Do not understand what the second part is about at all.
(4) map the unique reads to the genome
Whatever a read is.
and (5) annotate the reads with a specific hierarchy of small RNA annotation databases.
So you look up the suspected piRNA-candidates in a database. Why? To try to find their possible function?
What would be some relevant literature that deals with similar bioinformatics topics?