Question

Compairing A 454 Assemblage With A Gene Expression Analysis In Python

2

Entering edit mode

13.8 years ago

Nico ▴ 30

HELP? i need to map 20bp sequences produced by a gene expression analysis (fastq)to the contigs produced by a 454 assemblage (fasta), here is the kicker i dont care about the tags mapped to the open reading frame (ORF) range of the contigs. i have a file with the ORF ranges for each contig in the assemblage i just can’t combine all 3 ideas, im a noob so sorry if this is super cakie i have an idea but am a poor programmer, match the names of the contigs in the assemblage file and the ORF finder file then delete all the sequence in the ORF range. taking this new ORF free assemblage and mapping the short tags to it, only allowing for one (20bp) tag to contig(ORF free) match .

so i want to locate the 20bp tags in the 454 assembalage out side of the ORF range.

example data in files. (files up to 2GB)

20bp tags

GATCCAGGATGGAGTCTTTG
GATCCTGAAGATGGAGTCTC
GATCCTTCCACACAAAGTGG
GATCACTGGACCGGCGGCCA
GATCTTTAGTCTAAAGTGTC

the 454 assemblage

GD4HFQN03F001A tcagacgagtgcgtGCGCACCGGACCAGAAAGGCCAGGCTGGGCCTGAGCAAGGGGAGGCAGTGGGTACGACCACGGCCagcggcgcggccaattccgggaccgagccaaggcagcggcgggaggcggctttggtcacgggacgtcttcctgcggaggtgggcccgtcgacagtcatggaaatgacctcttccccaccggaggaggacgagaccgacgggtggcggcggat
GD4HFQN03F00HK tcagacgagtgcgtCCATTACGGCCGAAAACCTCACCTTACCTCAACATAAAACAATACAACACAGCAACACTCAAACCCCACACCCTCTCAATAACCAACAACAAATCTTGAAAAACATGTCTTCCGGCCTCACCAACGAAAAGGGCCAAGGCGCCTCGCACGCAACCCAGTCCTCCGTCCCCGGCAAAGTCCAGGACAAGGCCCCCAAGTCC
GD4HFQN03F00IW tcagacgagtgcgtGCTCTCGCCGGATCTAGATTGGTGGTTCCATGGTCCGCCGTCACACCAGTTCTTGATGCGATTGCTCACCGGTGTTTCATCACCAGACGACGTCGATTTCCAATTTCAGCCCCAATCCTTCGCCTCATTCGGACCCACCGTCCTTGTCGAAGGCTGCGATTCCGATCTGTCCATAGCTTGGGTTCACGCCTGGACCGTGACGGATGGGATAATTACCCAGGTTAGGGAGTACTTCAACACTTCCCTCACCGTCACCCGCCTCGGCCCGTTACGTAGCGTATCGTTGacctgagactgccaaggcacacaggggataggn

the ORF ranges

GD4HFQN03F001A 176,445
GD4HFQN03F00HK 119,409
GD4HFQN03F00IW 70,303

______________________________________NEW_______________________________________________

the 454 assembly with the ORF number appended at end csv format

Fiesta accessions,GHA sequence,Field3
>GD4HFQN03F001A,tcagacgagtgcgtGCGCACCGGACCAGAAAGGCCAGGCTGGGCCTGAGCAAGGGGAGGCAGTGGGTACGACCACGGCCagcggcgcggccaattccgggaccgagccaaggcagcggcgggaggcggctttggtcacgggacgtcttcctgcggaggtgggcccgtcgacagtcatggaaatgacctcttccccaccggaggaggacgagaccgacgggtggcggcggatcttcgcgtcctcgttggtcttcttgccttcggcgaggagggaggcttacggaactcctcgcagtcgccgaggacgaggttcgatgtggcggtcgaaggccgatgaacttgccgacaagctggcggaccatcctggatggtcgacccgccatgcggtagttgatgtactggaggcatcttggagctctctgagactgccaaggcacacaggggagtaggnnn,445
>GD4HFQN03F00HK,tcagacgagtgcgtCCATTACGGCCGAAAACCTCACCTTACCTCAACATAAAACAATACAACACAGCAACACTCAAACCCCACACCCTCTCAATAACCAACAACAAATCTTGAAAAACATGTCTTCCGGCCTCACCAACGAAAAGGGCCAAGGCGCCTCGCACGCAACCCAGTCCTCCGTCCCCGGCAAAGTCCAGGACAAGGCCCCCAAGTCCGTCGAAGACAAGCTCCCCGACGCCGTCCACGACGCCGGCAGCAACCCTCACACCGGCAAGGTTTCGCACGCCACCGGCGACTCCAAGGTGCCGCAGACGCTGCAGGAGGGCCTGCCGGAGAAGGTCGAGAAGATCGTGCCCAACAAGCTGCACGACACCAAGGGCGCGACTTTCAGCGATGGATCTGTTGGCAAGTAAAAAGACCAATTCCATTCGGTGGGTTTAGTATGACTATGATTTGTATGATAGAAATTATgacttgaaatggaagatttgacgaacgactgagactgccaaggcacacaggggataggnnn,409

__________________________FINAL CODES THAT WORK___________________________________________________ the code that deleted all sequence but 5'UTRs

import csv
gseq=0;orfn=0
lines = csv.reader(open('test2.1GHA'))
lines.next()
for line in lines:
    ID=line[0]
    gseq=line[1]
    orfn=line[2]
    lseq=list(gseq)
    utrstart=int(orfn)
    del lseq [:utrstart]
    utrseq=''.join(lseq).upper()
    utrseq.upper()
    print ID, '\n', utrseq,'\n', orfn
    #i then copy and pasted results into a  Notepad txt file
    #i need help with the output to a file, seemed to only write one line when i tried

this finds the short reads in the 5'UTRs

from Bio import SeqIO               # Bio Python
fh=open('test2GHA.txt')
for record in SeqIO.parse(fh, 'fasta'):
    id=record.id
    seq=record.seq
    useq=seq.upper()
    tfh=open('test5tags.txt')
    taglines=tfh.readlines()
    for line in taglines:
        tag= line.upper()
        lntag=tag.replace('\n',',')
        seplntag=lntag.split(',')
        for x in seplntag:
            sonetag=''.join(x)
            if useq.find(sonetag) >=1:
                print id, sonetag
            else:
                pass

i hope this might help some other newbies out there.

thanks again guys, for the help

FINAL TWEAK i need it to output the position of the tag in the UTR (index dosent work) or just find the tag closest to the end of the UTR sequence instead of all of them.
suggestions?

got it sorry..

   if useq.find(sonetag) >=1:
                ftag=useq.find(sonetag)
                print id, sonetag, ftag+1  ##position would be nice
            else:
                pass

python gene orf fasta • 5.1k views

ADD COMMENT • link updated 9.1 years ago by Biostar 20 • written 13.8 years ago by Nico ▴ 30

1

Entering edit mode

Hi Nico. One pre-requisite of doing anything in Python or any other language is to actually know that language. I suggest you decide which language would be beneficial for you and that you take time to learn it. People on a forum can be extremely helpful, but they won't do the work for you, only make suggestions to help you get moving. Cheers

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k

1

Entering edit mode

Nico, not at all. If you have code already, it would be a very helpful starting point for answers; no worries about how presentable it is. Learning to program is hard work; having people look at your code and offer suggestions is a valuable way to get better.

ADD REPLY • link 13.8 years ago by Brad Chapman 9.7k

1

Entering edit mode

Your 454 'assemblage' looks a lot like individual reads. The read ID is typical of 454 reads, the first four bases (tcag) are the key sequence and the next ten lower case bases are one of the MIDs from 454...

ADD REPLY • link 13.8 years ago by lexnederbragt ★ 1.3k

0

Entering edit mode

Your approach is correct: trim the 454 contigs by removing the ORF regions; then map the tags to these trimmed contigs with a short read aligner. Can you include the code you've written so far to do the trimming? This will help someone provide you with a useful answer.

ADD REPLY • link 13.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

I apologize for coming off this way in my post, I have been attempting to learn python, I started with “Bucky’s” youtube tutorial. I then read Allen Downey’s book. Yet, I still think like a human biochemist, ha! Then I found Paulo Nuin’s blog and I started Sebastian Bassi’s book. I am feeling some pressure from my committee and was discouraged, maybe a little desperate when I posted this. I meant no disrespect to this form. I have some code written up and when I feel like it is more presentable I will post it. I look forward to your suggestions. Thanks, nico

ADD REPLY • link 13.8 years ago by Nico ▴ 30

0

Entering edit mode

Hi. From what you describe, I guess you have already covered much ground in your learning of Python. Do show us the code! :)

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k

0

Entering edit mode

Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format (using an make table Access Query, sorry…) then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped!(needs work on the out) As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my file, to no prevail, and help would be very much appreciated!

ADD REPLY • link 13.8 years ago by Nico ▴ 30

0

Entering edit mode

Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format (using an make table Access Query, sorry…) then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped! As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my file, to no prevail gonna keep at it let me know if you guys have any ideas. cheers

ADD REPLY • link 13.8 years ago by Nico ▴ 30

0

Entering edit mode

Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped! As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my files, to no prevail.. im going to keep at it, but let me know if you guys have any ideas. cheers

ADD REPLY • link 13.8 years ago by Nico ▴ 30

0

Entering edit mode

I’m having a problem now with several tags showing up in one UTR, I need only the one nearest to the end of the UTR sequence, and .index doesn’t work because Seq (embedded in biopython) doesn’t have attribute .index any suggestions? Possibly just an addition to the print call?

ADD REPLY • link 13.7 years ago by Nico ▴ 30

0

Entering edit mode

nevermind got it sorry

ADD REPLY • link 13.7 years ago by Nico ▴ 30

Ram · Answer 1 · 2011-02-22

5

Entering edit mode

13.8 years ago

Brad Chapman 9.7k

Nico, thanks for posting your code. I didn't have a chance to go through the code in detail, but for your current 'yep/nope' issue, you want to change:

if tag in useq:

to

if useq.find(tag) >= 0:

The 'a in b' syntax tests whether the item a is contained within the list b. If I understand your intention correctly, you want to determine if the tag is found within the larger useq. For this, use the string.find(c) syntax to search through the string for any instances of c.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

This seems to be a step in the right direction, output didn’t change but I’m looking further in to this change. (tag) represents a file of~2million tags

ADD REPLY • link 13.8 years ago by Nico ▴ 30

0

Entering edit mode

Nico, glad this helps. Beyond the actual code, it would be worth trying a different approach to solve your problem. It sounds like you are re-writing a short-read aligner. Once you have your ORFs prepared as a FASTA file, use an aligner like bowtie or bwa to map the reads to the genome. These you can use Python with pysam to get the information you want from the alignments.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

Yes but far less complex, the 454 data is not a genome ie. It has no chromosomal data with it, I am going to explore these suggestions Thanks so much for your input so far

ADD REPLY • link 13.8 years ago by Nico ▴ 30

0

Entering edit mode

Short-read aligners are designed to rapidly search a large number of small sequences against a base sequence. The base does not have to be "traditional" chromosome assemblies to take advantage of this. I'm just mentioning this because you will find Python slow for searching millions of tags against lots of base sequences.

ADD REPLY • link 13.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

I got it!! Thanks a so much for your direction, I did rewrite my own short-read aligner but being as newbish as I am it was actually easier for me then manipulating someone else’s code. I have posted the code, still would be nice to have one scrip to run it all and the needs some work on the output but these things will come.

ADD REPLY • link 13.8 years ago by Nico ▴ 30

0

Entering edit mode

I’m having a problem now with several tags showing up in one UTR, I need only the one nearest to the end of the UTR sequence, and .index doesn’t work because Seq (embedded in biopython) doesn’t have attribute .index any suggestions? Possibly just an addition to the print call?