Hi,
I would like to remove the adapters from raw RNA-seq libraries and I have tried cutadapt (http://code.google.com/p/cutadapt/), which apparently should allow mismatches. However when I specify the adaptor to be cut like this P-UCGUAUGCCGUCUUCUGCUUGUidT , as it was used by the sequencing machine, no sequence is trimmed. When I tried the default FASTX Galaxy dummy adapter : TGTAGGCC, more than 70 000 sequences were trimmed out.
I have also tried the trimLRPattern function from Biostrings/Bioconductor, but I have the same issue as with cutadapt and I imagine I am not specifying the correct string to be clipped.
Also, I cannot do any data manipulation in Galaxy since the file has been loading for two days (approx 4.5 GB) so I need to find another solution..
What adaptor substrings should be used when dealing with RNA seq data? (not the entire default Illumina adapters)
Which is the best tool for this step in the quality control process ?
Sample sequences from the unprocessed FASTQ file:
GTCTGTGATGAATTGCNTTGACTTCTGNNNNNNNNN
CGGACAGGATTGACAGNTTGATAGCTCNNNNNNNNN
AGTCTGTGATGAATTGNTTTGACTTCTNNNNNNNNN
CAGGAACGGTGCACCANTCTCGTATGCNNNNNNNNN
Edit for the ones reading the post
I have used FAR successfully, it is easy to specify certain sub sequences of the adapter and it uses a pwa algorithm to score the best match in the read.
Please provide some example sequences from the data your are interested in that contain the adapter sequences you wish to remove.
@malachig- I have updated my question with the required info