Question

non-ACGT characters in simulated RNA-seq reads

0

Entering edit mode

8.5 years ago

Fadel ▴ 20

using flux simulator to generate RNA-seq reads. I found some reads contain [acgtn] characters.

I couldnt understand what is the difference between these small letters and IUPAC Ambiguity Codes?

I highly appreciate if anyone could clarify what they represent.

@I:3796153-3796696W:C32E8.1:2:490:6:489:S
CAAAAAAATGAAGCAAGAGGATTGCAGAAGCAAGAGGATTGCAGAGCAAGTAGGACACGATGCGcacgggGcacAAcgAcaTgCagccncAGAGCggncG
+
IIIIIGHIHFIIFIIIHHEGGDD8@GGDIDIIGGHEBBEBGIHHH@@??EEG<BBB?BHHH25#####################################

update 1: I noticed those characters associated with low quality score # represents P_error = 0.63096

P.S the reference genome is Caenorhabditis_elegans.WBcel235.dna.toplevel.fa

RNA-Seq flux • 2.1k views

ADD COMMENT • link 8.5 years ago by Fadel ▴ 20

1

Entering edit mode

Generally, lower-case letters are used to represent uncertainty or low confidence. It's more precise to use fastq format, but fasta is older and lower-case fasta letters probably predate it. N usually indicates you have no idea what the base is. Other IUPAC codes typically are used to indicate a degree of degeneracy (such as a SNP that could be one of two bases) rather than error probability due to signal levels.

So, I don't recommend using lower-case or non-ACGTN characters in fastq reads. There's no reason for it and it can break a lot of tools.

BBMap's reformat has a couple of flags for dealing with this kind of read, if they are a problem:

reformat.sh in=reads.fq out=fixed.fq touppercase

...will convert lowercase to uppercase, or...

reformat.sh in=reads.fq out=fixed.fq lowercaseton

...will convert them to N. It works with fasta or fastq. Also, the "iupacton" flag will convert non ACGTN degenerate codes to N.

ADD REPLY • link 8.5 years ago by Brian Bushnell 20k

0

Entering edit mode

thanks @Brian, the fasta file is the reference genome used to generate the reads from. the reference genome doesnt have any lower case characters or even N. these lower case characters are generated by the simulator. I would rephrase my question to be, should I treat 'a' as 'A' or as 'N' ? especially in sam specification. a read has range of characters *|[A-Za-z=.]+

ADD REPLY • link 8.5 years ago by Fadel ▴ 20

1

Entering edit mode

Regardless of the sam specification, I have never seen a sequencing machine produce lower-case letters (bear in mind that I have never worked with Sanger, just high-throughput platforms). Since the point of read simulators is to mimic the output of a sequencing machine, they should not put out lower-case letters; and, of course, it does not make much sense to put out lower-case sometimes and upper-case sometimes both with the same quality score of 2.

Given that these bases have a quality-score of 2 and are thus probably wrong, it's best to assign them N. Better yet, if you quality-trim to a threshold of Q3 they will all disappear, so it won't matter of you did a->A or a->N. Retaining base calls that are probably wrong do not lead to better results.

ADD REPLY • link 8.5 years ago by Brian Bushnell 20k

0

Entering edit mode

You could use randomreads.sh from BBMap to generate the data and not have lower case bases.

If you were to trim the above read it is going to take out ~40% of the 3'-end though.

ADD REPLY • link 8.5 years ago by GenoMax 152k