How To Understand The Encoding And Trim Their Adapters
3
0
Entering edit mode
10.9 years ago
bmechuangye ▴ 20

Hi,

I am puzzled how to change the encode to Sanger format phred+33 of the data and how to trim the adapter sequence? I have also used fastqc tool to find their overrepresented sequences, but found no good information. And there were no information in the paper and NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37686).

$ head -n8 SRR493015.fastq
@SRR493015.1 HWI-ST667_0105:1:1101:1543:1997 length=37
CCCCCTGGGCCTCTCTGTAGGCACCATCAATCTGATC
+SRR493015.1 HWI-ST667_0105:1:1101:1543:1997 length=37
FFFFFHFHHHJIJIJIIHJJJJIJJJJJJJIJJJIIJ
@SRR493015.2 HWI-ST667_0105:1:1101:2733:1999 length=37
TCGTACGACTCTTATCTCTGTAGGCACCATCAATCTG
+SRR493015.2 HWI-ST667_0105:1:1101:2733:1999 length=37
FDFDFHHHHGIIGGIHJJJJJIEHEHGGHHIIJJJIG

$ awk 'NR % 4 ==0' SRR493015.fastq |python guess-encoding.py 
# reading qualities from stdin
no encodings for range: (43, 74)

Could you give me some advice ? Thanks.

• 12k views
ADD COMMENT
0
Entering edit mode

Do you wish to change the encoding of the fastq file to sanger encoding(phred +33)??

ADD REPLY
0
Entering edit mode

Yes. More important is that I want to remove the adapters. But I don't know the adapters sequences.

ADD REPLY
2
Entering edit mode
10.9 years ago
Ian 6.1k

Fastqc will also predict the encoding of the data at the beginning of its report. There are many tools for trimming/removing adaptors. I like Trimmomatic. It handles paired end reads well and will trim adaptors; you need to know the platform used to generate the sequences as the adaptors change.

ADD COMMENT
2
Entering edit mode
10.9 years ago
bmechuangye ▴ 20

Thanks.

Yes, Fastqc has told that the encoding of the data is Sanger / Illumina 1.9. And Ian, it should be set -phred33 or -phred64 when I using Trimmomatic ?

ADD COMMENT
1
Entering edit mode
10.9 years ago

The odds are good that one line is either malformed or you concatenated two files with different encodings together. What that error is telling you is that one of the lines have a minimal Phred score of 43 (meaning either Sanger or Illumina 1.8+ encoding) and a maximum score of 74 (meaning either Solexa, Illumina 1.3+, or Illumina 1.5+ encoding). That's not actually valid. I should note that the snippet you posted looks like Illumina 1.3+ encoding (that's also what the guess-encoding.py script returns). You might slowly increase the number of reads fed to the python script until you hit this error. Then, you'll know where in your fastq file the problem read(s) occur.

Edit: Actually that script seems to have been written before Illumina 1.8 was introduced. If you edit the definition of RANGES at the top to be as follows then it'll work

RANGES = {
    'Sanger': (33, 73),
    'Solexa': (59, 104),
    'Illumina-1.3': (64, 104),
    'Illumina-1.5': (66, 104),
    'Illumina-1.8': (33, 94)
}

I also corrected the Illumina-1.5 definition, which was wrong (though there's no practical reason to differentiate 1.3 from 1.5 (or Sanger from 1.8, for that matter)).

ADD COMMENT
0
Entering edit mode

Thanks. But I think the RANGS should be :

RANGES = { 'Sanger': (33, 73), 'Solexa': (59, 104), 'Illumina-1.3': (64, 104), 'Illumina-1.5': (67, 104), 'Illumina-1.8': (33, 74) }

ADD REPLY
0
Entering edit mode

For the moment, but inevitably illumina will make it 33,75 and then 33,76, so this just future proofs things a bit. The only really important thing is Solexa vs Sanger Phred scores and then the offset (33 or 64). The remainder doesn't really matter much. In fact, were I to write that script, the outputs would just be Solexa, Phred+33 and Phred+64.

ADD REPLY

Login before adding your answer.

Traffic: 2225 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6