Question

Removing Illumina Adapters From Rna-Seq Data

4

Entering edit mode

12.9 years ago

Agatha ▴ 350

Hi,

I would like to remove the adapters from raw RNA-seq libraries and I have tried cutadapt (http://code.google.com/p/cutadapt/), which apparently should allow mismatches. However when I specify the adaptor to be cut like this P-UCGUAUGCCGUCUUCUGCUUGUidT , as it was used by the sequencing machine, no sequence is trimmed. When I tried the default FASTX Galaxy dummy adapter : TGTAGGCC, more than 70 000 sequences were trimmed out.

I have also tried the trimLRPattern function from Biostrings/Bioconductor, but I have the same issue as with cutadapt and I imagine I am not specifying the correct string to be clipped.

Also, I cannot do any data manipulation in Galaxy since the file has been loading for two days (approx 4.5 GB) so I need to find another solution..

What adaptor substrings should be used when dealing with RNA seq data? (not the entire default Illumina adapters)

Which is the best tool for this step in the quality control process ?

Sample sequences from the unprocessed FASTQ file:

GTCTGTGATGAATTGCNTTGACTTCTGNNNNNNNNN

CGGACAGGATTGACAGNTTGATAGCTCNNNNNNNNN

AGTCTGTGATGAATTGNTTTGACTTCTNNNNNNNNN

CAGGAACGGTGCACCANTCTCGTATGCNNNNNNNNN

Edit for the ones reading the post

I have used FAR successfully, it is easy to specify certain sub sequences of the adapter and it uses a pwa algorithm to score the best match in the read.

illumina rna adaptor next-gen sequencing fastq • 23k views

ADD COMMENT • link updated 12.8 years ago by Malachi Griffith 20k • written 12.9 years ago by Agatha ▴ 350

0

Entering edit mode

Please provide some example sequences from the data your are interested in that contain the adapter sequences you wish to remove.

ADD REPLY • link 12.9 years ago by Malachi Griffith 20k

0

Entering edit mode

@malachig- I have updated my question with the required info

ADD REPLY • link 12.9 years ago by Agatha ▴ 350

score 7 · Answer 1 · 2012-01-08

There are probably many tools in addition to those that you list. How about the flexible adapter remover 'FAR'.

From the SourceForge description:

FAR is an ideal tool for preprocessing sequencing data
FAR removes adapter sequences from sequencing runs via global alignment (exact)
FAR can be used to demultiplex barcoded sequencing runs (illumina sequencing runs)
FAR supports basic trimming of reads before/after alignment global alignment
FAR supports colorspace and basepairspace sequencing data
FAR supports phred quality trimming
FAR runs in parallel on multiple cpu's and supports Linux/Windows 32 and 64 bit
FAR gives detailed reports (e.g. length distribution of the reads trimmed) in the output
FAR significantly improves mapping rates and genome/transcriptome assemblies

FAR allows mismatches (see --cut-off parameter) and identification of partial adapter sequences where you are just reading some amount of bases into the adapter at the end of your reads (see --min-overlap parameter).

Remember it is also possible that the majority of your reads do not contain any adapter sequence at all. In many Illumina libraries, sequencing starts at the end of the adapter sequence and the first base of reported sequence is actual genome/transcriptome sequence. A variety of libraries types do not follow this pattern and may have adapter sequences that interfere with alignments that do not perform substring alignments (most next-gen sequence aligners).

Have you considered flipping this problem? Instead of searching for an unknown adapter in your reads, try aligning your reads to the genome/transcriptome without trimming anything. What proportion align? Take a small subset of reads and align them with a substring capable aligner such as BLAST or BLAT. Do you see a pattern in the alignment? Does the entire read align or do you get X bases at the beginning or end of the read that fail to align? If so, is there a pattern to the sequence that does not align. Does it look like a known Illumina adapter? etc.

score 3 · Answer 2 · 2012-01-08

3

Entering edit mode

12.9 years ago

Steve Lianoglou 5.2k

It's not clear from your question, but perhaps you are not having any luck w/ adapter clipping software because you might be mis-specifying the adapter sequence itself?

For example, you say that you are specifying the adapter as P-UCGUAUGCCGUCUUCUGCUUGUidT, but are you really including the P- and idT prefix/suffixes? That would be your first problem, you should remove them.

Second: are there really Us in your adapter? Maybe you should be substituting these for Ts?

Third, you ask:

What adaptor substrings should be used when dealing with RNA seq data? (not the entire default Illumina adapters)

No one can really answer that question for you without knowing the details of the library prep. You'll have to talk to the people preparing the library to know for sure.

Other tools you can look at for adapter trimming are:

I'm sure others will offer more .. I've only ever really used cutadapt and the fastx-toolkit, more the latter than the former, but have had good success with them.

ADD COMMENT • link 12.9 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

@ Steve Lianoglou - I am not including the suffixes...But it is still not working..no sequence is trimmed. I will try to reverse complement the sequence and then remove the adapter..you might have a point. Thank you.. Regarding the third aspect, I am just using some libraries from NCBI, and in the associated paper I could find brief details regarding the library preparation...I do not have any experience with sequencing data so at this point I am not sure how I can conclude from that what substrings I could use.. So you are saying that the only way to do this is to contact the seq guys?thanks

ADD REPLY • link 12.9 years ago by Agatha ▴ 350

0

Entering edit mode

@agatha: if this is from an already published paper, you can give us the reference and we can help you identify the appropriate adapter sequence.

ADD REPLY • link 12.9 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

@ Steve Lianoglou [1] C. E. Joyce et al., “Deep sequencing of small RNAs from human skin reveals major alterations in the psoriasis miRNAome.,” Human molecular genetics, vol. 20, no. 20, pp. 4025-4040, Aug. 2011.

ADD REPLY • link 12.9 years ago by Agatha ▴ 350

0

Entering edit mode

@ Steve Lianoglou - this is the paper - any help would be greatly appreciated

ADD REPLY • link 12.9 years ago by Agatha ▴ 350

0

Entering edit mode

@agatha: I can't seem to download the small rna prep kit from Illumina, even though there is a link to download it which just redirect to their "order me" web front. Do you have the pdf?

ADD REPLY • link 12.9 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

@agatha: One thing you can do is to run FastQC on your input fastq file. It will detect enriched sequences in the library and try to match it against a list of "known" contaminants. It will list any adapters found "by name", you can then get the adapter's full length sequence from the contaminant_list.txt file that comes with FastQC itself.

ADD REPLY • link 12.9 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

@Steve Lianoglou -yes I do have the pdf- how can I send it to you?

ADD REPLY • link 12.9 years ago by Agatha ▴ 350

0

Entering edit mode

@Steve Lianoglou- I am not sure what you mean by the known contaminants but I will try to find out PS I have ran the program with the complement of the seq adapter and cutadapt crashed..probably the adapter sequence is too long..

ADD REPLY • link 12.9 years ago by Agatha ▴ 350

0

Entering edit mode

@agatha: Did you get this sorted out?

ADD REPLY • link 12.9 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

@Steve Lianoglou - Yes, I think so...I will see how correct it is if I will find any isomiRs after mapping :-) Thank you for your help anyways !

ADD REPLY • link 12.9 years ago by Agatha ▴ 350