I am puzzled how to change the encode to Sanger format phred+33 of the data and how to trim the adapter sequence? I have also used fastqc tool
to find their overrepresented sequences, but found no good information. And there were no information in the paper and NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37686).
$ head -n8 SRR493015.fastq
@SRR493015.1 HWI-ST667_0105:1:1101:1543:1997 length=37
CCCCCTGGGCCTCTCTGTAGGCACCATCAATCTGATC
+SRR493015.1 HWI-ST667_0105:1:1101:1543:1997 length=37
FFFFFHFHHHJIJIJIIHJJJJIJJJJJJJIJJJIIJ
@SRR493015.2 HWI-ST667_0105:1:1101:2733:1999 length=37
TCGTACGACTCTTATCTCTGTAGGCACCATCAATCTG
+SRR493015.2 HWI-ST667_0105:1:1101:2733:1999 length=37
FDFDFHHHHGIIGGIHJJJJJIEHEHGGHHIIJJJIG
$ awk 'NR % 4 ==0' SRR493015.fastq |python guess-encoding.py
# reading qualities from stdin
no encodings for range: (43, 74)
Fastqc will also predict the encoding of the data at the beginning of its report. There are many tools for trimming/removing adaptors. I like Trimmomatic. It handles paired end reads well and will trim adaptors; you need to know the platform used to generate the sequences as the adaptors change.
The odds are good that one line is either malformed or you concatenated two files with different encodings together. What that error is telling you is that one of the lines have a minimal Phred score of 43 (meaning either Sanger or Illumina 1.8+ encoding) and a maximum score of 74 (meaning either Solexa, Illumina 1.3+, or Illumina 1.5+ encoding). That's not actually valid. I should note that the snippet you posted looks like Illumina 1.3+ encoding (that's also what the guess-encoding.py script returns). You might slowly increase the number of reads fed to the python script until you hit this error. Then, you'll know where in your fastq file the problem read(s) occur.
Edit: Actually that script seems to have been written before Illumina 1.8 was introduced. If you edit the definition of RANGES at the top to be as follows then it'll work
I also corrected the Illumina-1.5 definition, which was wrong (though there's no practical reason to differentiate 1.3 from 1.5 (or Sanger from 1.8, for that matter)).
For the moment, but inevitably illumina will make it 33,75 and then 33,76, so this just future proofs things a bit. The only really important thing is Solexa vs Sanger Phred scores and then the offset (33 or 64). The remainder doesn't really matter much. In fact, were I to write that script, the outputs would just be Solexa, Phred+33 and Phred+64.
Do you wish to change the encoding of the fastq file to sanger encoding(phred +33)??
Yes. More important is that I want to remove the adapters. But I don't know the adapters sequences.