Question

Converting Gene Expression Omnibus Format To Fasta

0

Entering edit mode

12.1 years ago

Click downvote ▴ 720

I have found the a number of reads I want to test against a genome using Bowtie. They are located here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM113418

The problem is that the data is in the format shown below:

> ID_REF = SEQUENCE
> VALUE = NUMBER OF READS 
> ID_REF    VALUE
> AGGCAGTGTAGTTAGCTGATTGC    197 
> TCCCTGGTCTAGTGGTTAGGATTCGGC    177
> TCACAACAACTGTGTGGAGGTATAGGTGT    149 
> TATTTATTGAGGGCCTACTATGTGCCGGG    125

While Bowtie wants reads in this format:

> @r0/2 GAATACTGGCGGATTACCGGGGAAGCTGGAGC
>+EDCCCBAAAA@@@@?>===<;;9:99987776 
>@r1/2 AATGTGAAAACGCCATCGATGGAACAGGCAAT
>+EDCCCBAAAA@@@@?>===<;;9:99987776 
>@r2/2 AACGCGCGTTATCGTGCCGGTCCATTACGCGG
>+EDCCCBAAAA@@@@?>===<;;9:99987776

Is there a standard way for converting the first format into the second? Or are you supposed to process them in some other way? Thanks.

bowtie fasta • 2.0k views

ADD COMMENT • link updated 12.1 years ago by Damian Kao 16k • written 12.1 years ago by Click downvote ▴ 720

0

Entering edit mode

Edited for readability. Note how your data format was not displayed correctly in the original post; indenting lines with 4 spaces was required.

ADD REPLY • link 12.1 years ago by Neilfws 49k

score 2 · Answer 1 · 2012-10-14

It looks like that cDNA library was sequenced with a 454 platform back in 2006. The raw files are located at the bottom of the page: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5026

They don't contain read quality information, which you need to convert to the fastq format.

I suggest you just take the raw sequences, convert it to a simple fasta file. Bowtie can take in a fasta file with no quality scores with the -f option. Under this option bowtie will just assume all base pairs have a quality of 40.