Converting Solexa Reads To Fastq And Identifying Mismatched Bases
2
0
Entering edit mode
11.5 years ago
soosus ▴ 90

I would like to write a simple script or single line perl to convert the following two Solexa sequence reads (assume these reads in a text file "Solexa.export.txt") into fastq format. (Note: the quality scores are coded as ASCII offset of 64):

ILLUMINA-A93428 66      1       1       7848    1267    CGATGT  1       ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG   ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB chrX.fa         122287709       R       40C21GAC7       269                Y

ILLUMINA-A93428 66      1       1       8782    1281    CGATGT  1       GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG   hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe`] chrX.fa         144709339       F       72      335                        Y

Also need help identifying mismatched bases in "Solexa.export.txt"

r fastq • 3.3k views
ADD COMMENT
0
Entering edit mode
11.5 years ago

You might try this script to convert to SAM/BAM. Then, you can use tools like picard Sam2Fastq and samtools calmd to get the information you need.

https://github.com/samtools/samtools/blob/master/misc/export2sam.pl

ADD COMMENT
0
Entering edit mode
11.5 years ago
Mitch Bekritsky ★ 1.3k

Try this:

awk '/./ {print "@"$1"\n"$9"\n+\n"$10}' Solexa.export.txt

The '/./' in the awk command skips any blank lines.

When I run this one-liner on your sample script, it produces the following output:

@ILLUMINA-A93428
ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG
+
ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB
@ILLUMINA-A93428
GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG
+
hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe`]

When you say that the encoding is offset by 64, does that mean you want to change it, or is that just an FYI?

Also, what do you mean that you want to identify mismatched bases? Do you just want to find reads that don't match perfectly? In that case, try this:

awk '$14 !~ /^[0-9]+$/ && /./' Solexa.export.txt

I'm assuming that the 14th field in Solexa.export.txt reports the alignment of the read to the reference genome. In that case, any entry where field 14 isn't only digits would imply that there are mismatched bases. In your second sequence, field 14 reports 72, the length of the read. In the first sequence, it reports 40C21GAC7, implying 40 matched bases, then a mismatched C, etc.

ADD COMMENT
0
Entering edit mode

I mean the chromosomal position of locations where the reference base does not match the alternative base. Also, thanks for your reply!

ADD REPLY
0
Entering edit mode

Not sure about the exact math, but you can parse the 14th field and use the start position of the read to get the exact position of the mismatch in the reference genome. Not sure which field represents position in your output since it doesn't look like it's in SAM format, but you could take the start site, add 41, and that should be the position of the C mismatch. Should be a little perl script.

Happy to help!

ADD REPLY

Login before adding your answer.

Traffic: 1803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6