Recently, I received 3 miRNA-seq data of human, rat and mouse (Solexa data) without adapters information. We are trying to contact the people who sequenced the data to get the information but a reply will take some time . As I want to proceed with the analysis, is there a way of finding the 3' end adapter of each read (essential) and 5' end (optional) of the small RNA data so start with trimming the data ?
Have you tried FastQC? Adapter dimers might show up in the overrepresented sequences section of the report, and the overrepresented k-mer section may give you some additional clues.
In the FastQC source, FastQC/Contaminants/contaminant_list.txt has a convenient list of adapters that you can use as a starting point for further analysis.
What do you mean by your pipeline doesn't produce such files? You can download FastQC and check for the known adapters as @Daler mentioned. Or you can just provide it as input to FastQC and it gives you back a report. You can check out this report (html format) as to whether there are over-represented sequences. And if they match known Illumina adapters, FastQC also marks them as such.
I have a cheap little perl script called find_3p_adapter.pl I wrote for precisely this purpose. It works by taking in the sequence of a microRNA that you know must be present at pretty high levels in your sample, along with the raw untrimmed FASTQ file. For instance, in a human brain, miR-124 would be a good bet. The script will search for all occurrences of the miRNA query sequence, and track the 'suffiix' (or suffices) that comes after it. That will tell you the adapter sequence.
Here's an example with some plant data (so I've used miR156 as the query, b/c it's pretty abundant in most plant tissues):
Algonquin:raw michaelaxtell$ gzip -d -c Apr1A_R1.fastq.gz | ./find_3p_adapter.pl -m ugacagaagagagugagcac
./find_3p_adapter.pl version 0.2
Thu Aug 8 16:12:12 EDT 2013
directory: /Users/michaelaxtell/data/sRNAseq_data/Physcomitrella_patens/HiSeq2500_Apr22_2013/raw
Query sequence: TGACAGAAGAGAGTGAGCAC
Searching...Done
20754 out of 23706986 reads matched query TGACAGAAGAGAGTGAGCAC (875 reads per million)
Here are the top four adapters found:
Sequence Frequency
TGGAATTCTCGGGTGCCAAGGAACTCCAGT 17771 85.627 %
CTGGAATTCTCGGGTGCCAAGGAACTCCAG 937 4.515 %
ATGGAATTCTCGGGTGCCAAGGAACTCCAG 911 4.390 %
TTGGAATTCTCGGGTGCCAAGGAACTCCAG 358 1.725 %
So in the example above, its a good bet that the adapter starts with "TGGAATTCTCG"
You can just take a look at the SAM or FASTQ and see if you can identify a common sequence occurring in all the reads. The human brain is pretty great at pattern finding.
ADD REPLY
• link
updated 5.2 years ago by
Ram
44k
•
written 12.0 years ago by
Houkto
▴
220
0
Entering edit mode
It doesn't look like you have any identifiable adapters, but you do have some over-represented sequences in this small sample. It would still be useful to run FastQC on these files.
The regular expression does not find common sequences. grep -o outputs only matching strings. [GATCN] matches one of either G,A,T,C, or N in the string. {30,300} repeats this matching pattern a minimum of 30 times and a max of 300 times. These are values that correspond to common ranges of read lengths you would observe with Solexa or Illumina. The backslashes are just to escape the curly braces.
This prints the entire reads from the SAM output. It's up the the human to determine if the 3' ends have any sequence representing an adapter.
Thanks Daler, Matt Shirley and Arun for the help. I misunderstood the bit about FastQC (I thought my pipeline should produce QC that I can check). I downloaded the FastQC tool and run one of my fastq files and a screenprint of the result is here
My fastq sample sequenced with 36bp long reads, I want to trim the `3p end and I do not know which of the overrepresented sequences is the true adapter. Any suggestion ?
Seem like you used the Small RNA v1.5 Sample Preparation kit. Try to clip the adapter sequence 'ATCTCGTATGCCGTCTTCTGCTTG'. To verify if it worked, you can check the length distribution of your reads after clipping. You should see most of the reads having a length of around 24nt in length. To clip your adapter sequence, I would recommend cutadapt. (cutadapt -e 0.15 -O 7 -m 15 -a ATCTCGTATGCCGTCTTCTGCTTG input.fastq -o input.clipped.fastq)
Thanks David, I suspected that this was the adapter and already removed it and the most of the reads length around 22 and 23nt long. However, I was not sure so thanks for confirming that. I use a tool called Reaper to remove the adapter but I would like to know what are the settings -e -o of cutadapt; what do they do ? Thanks again
Hi..i have question
this is huge fasta file i want to separate all fasta sequences contain id c0_g1_i1..can you tell me?may be it is simple.but i am new . i do not
This question is unrelated to the original post so you should start a new thread/post and then come back and delete this post.
Asking unrelated/new questions by using the
Submit Answers
option on an existing thread is not going to get you answers you need.