The Number Substitute The Character
2
0
Entering edit mode
12.4 years ago
camelbbs ▴ 710

HI,

In 1000genomes data, I found the fastq of exome seq is like this:

@341035 0PEBCSOLiDPEP20110531001BSOLiDPEP20110531001B1323_187/1

T232102322030210120120010000120000330003.320..03032

+

!'(%%%)%%'%&&&%&%&%%%,)''%&%%%)%.&%%(%.%!&/&!!&()%%

I want to ask if this format can be mapped to genome by bwa directly. Do I need to convert the 01,2,3 to t,c,g,a?

How can I interpret it.

thanks,

Chunjiang

exome seq • 2.2k views
ADD COMMENT
4
Entering edit mode
12.4 years ago
matted 7.8k

This is SOLiD colorspace data. There are a variety of approaches to working with it, but they all require a bit of reading and thought. Look through this forum and other sources for relevant information for your current task. This post might be a good place to start.

ADD COMMENT
0
Entering edit mode

+1 for "... they all require a bit of reading and thought."

ADD REPLY
0
Entering edit mode

thanks.........

ADD REPLY
0
Entering edit mode
12.4 years ago
camelbbs ▴ 710

Thanks. But are you sure that's the solid colorspace format?

I just download it from 1000genomes. I check the info in SRA database related to this is:

==============================

Accession: ERX024730

Experiment design: Solexa sequencing of Human individual HG00881 random pair end library

Submission: ERA062402 by BGI

Study summary: Exome sequencing of the Chinese Dai in Xishuangbanna, China (CDX) (SRP004062) • Study • All experiments (more...)

Sample: (more...)

Library: HUMaghXGZAAAPEI-9 (more...)

Platform: Illumina (less...)

Instrument model: Illumina HiSeq 2000

==================================

And i checked bwa, there is a script called solid2fastq.pl, it need two files to work (csfasta, quality file).

Any other tools can do it? Thanks.

ADD COMMENT
0
Entering edit mode

I think you are mixing something up. Where exactly are you getting the FASTQ file (above) from, and how are you telling that it's supposedly this run? The run you indicate is indeed an Illumina run, and from the ENA an example line is:

@ERR047667.1 FCB09RWABXX:3:1101:1137:2055/1
ANTTACTGATAATAGTTATATCACTAATTTCAGTTTAACAAAAAGGTTCACTATAACTTATTTTAATCTCTGTAATAACTTCAAATTAAA
+
C#1ADFFFHHHHHJJIIIJIJJJJJJJJJJJJJIJJJJIIIJJJJJGHIFIJJJJJJJJJJJJJJJJIIJJJJIIGIJJJJJJJJHHHHH

But it doesn't match your example, which is definitely colorspace. Furthermore, your example has the string "SOLiD" in the read name.

ADD REPLY
1
Entering edit mode

I have a guess for your error:

For the human sample this corresponds to (HG00881), the exome sequencing was on Illumina and the low-coverage sequencing was on SOLiD. There are three Illumina runs:

ERX024730 ERX024731 ERX024732

And three SOLiD runs: ERX016841 ERX016842 ERX016843

So I assume you mixed up the two groups.

The SOLiD FASTQ files, as expected, look like your original example, e.g.:

@ERR039668.341035 solid0738_20110610_PE_BC_SOLiDPEP20110531001_B_SOLiDPEP20110531001_B_13_23_187/1
T232102322030210120120010000120000330003.320..03032
+
!'(%%%)%%'%&&&%&%&%%%,)''%&%%%)%.&%%(%.%!&/&!!&()%%

EDIT: I realized this read is your exact example above. It's the first read from run ERX016841, which indeed is annotated as SOLiD, not the one you report above. Not sure where that came from.

ADD REPLY
0
Entering edit mode

Yes. I mixed up them. Thanks a lot. But I am not sure what does they mean the Low coverage region. Is that not the whole genome seq? and I think I dont know how to align this seq.

@ERR039668.341035 solid073820110610PEBCSOLiDPEP20110531001BSOLiDPEP20110531001B1323187/1

T232102322030210120120010000120000330003.320..03032

+

!'(%%%)%%'%&&&%&%&%%%,)''%&%%%)%.&%%(%.%!&/&!!&()%%

ADD REPLY

Login before adding your answer.

Traffic: 1958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6