Hi everyone. I am interested to find out microsatellite from publicly available transcriptome data. I want to use sequencing data from SRA archive of NCBI. I am using CLC Bio workbench for processing of data. In the first very step I am having trouble. Is the sequence read of SRA file is adapter trimmed or not. Another problem I faced while using illumina paired data that using SRA tool kit fastqdump it can not split the data into two files.
To check if the sequences are adapter trimmed you can use
FastQC
tool: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Passing the data through a trimming program is not a bad idea. If it was trimmed then it should come through without changes. Only thing that you have to invest in is some time. This should apply to trimming tool in CLC as well.
As for fastq-dump you need to use the
--split-files
option to get the two reads in two separate files. While you are at it may as well use the--origfmt
option to recover the original fastq file headers.What SRA # are you looking at?
Get the fastq files from EBI: http://www.ebi.ac.uk/ena/data/view/ERR1203908
Did you check the project metadata? The information if the reads have been trimmed or not may be available there. Also, maybe fastq-dump is not splitting files because it is a single end run?