So I have multiple (unaligned) paired end RNA-seq fastq files that I would like to trim against known adapters and base quality score.
Because I do not know which sequencer was used (most likely Illumina), I have it run through SolexaQA++ to determine the format first. If it's done by Illumina, depending on the version, I call the appropriate adapter list to pipe into cutadapt.
I have visited Illumina's website that has a pdf file of all adapters, but I am wondering how I can transfer that list into a fasta file. Is there a publicly available adapter_list.fasta
for RNA-seq samples?
Thank you very much for the help.
EDIT: I found that Trimmomatic supplies a set of Illumina adapters for PE, SE in fasta format. Today I learned that .fa is human readable and found that the list is nothing more than the adapter sequence with a ">insert sequencer here" above. What I don't understand is how are these clippers (specifically cutadapt) able to know which adapters are 5' and 3' and trimming accordingly. Also is there a conventional way of calling the sequencer that created the fastq file other than how I am doing it now via SolexaQA++?
EDIT: SolexaQA++ uses the code below to determine the sequencer for an unknown.fastq.
EDIT: I also found that FastQC can also determine sequencer type/version. Now that I know my RNA-seq was sequenced by Sanger/Illumina 1.9, what would be a relevant list of all adapters?
#!/usr/bin/perl
use strict;
use warnings;
my $format = "";
# set regular expressions
my $sanger_regexp = qr/[!"#$%&'()*+,-.\/0123456789:]/;
my $solexa_regexp = qr/[\;<=>\?]/;
my $solill_regexp = qr/[JKLMNOPQRSTUVWXYZ\[\]\^\_\`abcdefgh]/;
my $all_regexp = qr/[\@ABCDEFGHI]/;
# set counters
my $sanger_counter = 0;
my $solexa_counter = 0;
my $solill_counter = 0;
my $i;
while(<>){
$i++;
# retrieve qualities
next unless $i % 4 eq 0;
#print;
chomp;
# check qualities
if( m/$sanger_regexp/ ){
$sanger_counter = 1;
last;
}
if( m/$solexa_regexp/ ){
$solexa_counter = 1;
}
if( m/$solill_regexp/ ){
$solill_counter = 1;
}
}
# determine format
if( $sanger_counter ){
$format = "sanger";
}
elsif( !$sanger_counter && $solexa_counter ){
$format = "solexa";
}
elsif( !$sanger_counter && !$solexa_counter && $solill_counter ){
$format = "illumina";
}
print "$format\n";
Thank you very much the reply. The description.txt will be extremely, extremely helpful. I should've known that the pipeline used for RNA seq would have thorough documentation in TCGA.
I have seen BBMap referred here and there but was hesitant to use because I am new to bioinformatics and cannot weigh tool algorithms. Regardless I will definitely look into it and it seems like the tool is robust, well documented, and guided. The fasta was exactly what I was looking for.
Your input has been very helpful. Any additional feedback, recommendations for rna-seq pipeline/analysis will be greatly appreciated.
Upvoted, bookmarked, accepted.