HELP? i need to map 20bp sequences produced by a gene expression analysis (fastq)to the contigs produced by a 454 assemblage (fasta), here is the kicker i dont care about the tags mapped to the open reading frame (ORF) range of the contigs. i have a file with the ORF ranges for each contig in the assemblage i just can’t combine all 3 ideas, im a noob so sorry if this is super cakie i have an idea but am a poor programmer, match the names of the contigs in the assemblage file and the ORF finder file then delete all the sequence in the ORF range. taking this new ORF free assemblage and mapping the short tags to it, only allowing for one (20bp) tag to contig(ORF free) match .
so i want to locate the 20bp tags in the 454 assembalage out side of the ORF range.
example data in files. (files up to 2GB)
20bp tags
GATCCAGGATGGAGTCTTTG
GATCCTGAAGATGGAGTCTC
GATCCTTCCACACAAAGTGG
GATCACTGGACCGGCGGCCA
GATCTTTAGTCTAAAGTGTC
the 454 assemblage
GD4HFQN03F001A tcagacgagtgcgtGCGCACCGGACCAGAAAGGCCAGGCTGGGCCTGAGCAAGGGGAGGCAGTGGGTACGACCACGGCCagcggcgcggccaattccgggaccgagccaaggcagcggcgggaggcggctttggtcacgggacgtcttcctgcggaggtgggcccgtcgacagtcatggaaatgacctcttccccaccggaggaggacgagaccgacgggtggcggcggat
GD4HFQN03F00HK tcagacgagtgcgtCCATTACGGCCGAAAACCTCACCTTACCTCAACATAAAACAATACAACACAGCAACACTCAAACCCCACACCCTCTCAATAACCAACAACAAATCTTGAAAAACATGTCTTCCGGCCTCACCAACGAAAAGGGCCAAGGCGCCTCGCACGCAACCCAGTCCTCCGTCCCCGGCAAAGTCCAGGACAAGGCCCCCAAGTCC
GD4HFQN03F00IW tcagacgagtgcgtGCTCTCGCCGGATCTAGATTGGTGGTTCCATGGTCCGCCGTCACACCAGTTCTTGATGCGATTGCTCACCGGTGTTTCATCACCAGACGACGTCGATTTCCAATTTCAGCCCCAATCCTTCGCCTCATTCGGACCCACCGTCCTTGTCGAAGGCTGCGATTCCGATCTGTCCATAGCTTGGGTTCACGCCTGGACCGTGACGGATGGGATAATTACCCAGGTTAGGGAGTACTTCAACACTTCCCTCACCGTCACCCGCCTCGGCCCGTTACGTAGCGTATCGTTGacctgagactgccaaggcacacaggggataggn
the ORF ranges
GD4HFQN03F001A 176,445
GD4HFQN03F00HK 119,409
GD4HFQN03F00IW 70,303
______________________________________NEW_______________________________________________
the 454 assembly with the ORF number appended at end csv format
Fiesta accessions,GHA sequence,Field3
>GD4HFQN03F001A,tcagacgagtgcgtGCGCACCGGACCAGAAAGGCCAGGCTGGGCCTGAGCAAGGGGAGGCAGTGGGTACGACCACGGCCagcggcgcggccaattccgggaccgagccaaggcagcggcgggaggcggctttggtcacgggacgtcttcctgcggaggtgggcccgtcgacagtcatggaaatgacctcttccccaccggaggaggacgagaccgacgggtggcggcggatcttcgcgtcctcgttggtcttcttgccttcggcgaggagggaggcttacggaactcctcgcagtcgccgaggacgaggttcgatgtggcggtcgaaggccgatgaacttgccgacaagctggcggaccatcctggatggtcgacccgccatgcggtagttgatgtactggaggcatcttggagctctctgagactgccaaggcacacaggggagtaggnnn,445
>GD4HFQN03F00HK,tcagacgagtgcgtCCATTACGGCCGAAAACCTCACCTTACCTCAACATAAAACAATACAACACAGCAACACTCAAACCCCACACCCTCTCAATAACCAACAACAAATCTTGAAAAACATGTCTTCCGGCCTCACCAACGAAAAGGGCCAAGGCGCCTCGCACGCAACCCAGTCCTCCGTCCCCGGCAAAGTCCAGGACAAGGCCCCCAAGTCCGTCGAAGACAAGCTCCCCGACGCCGTCCACGACGCCGGCAGCAACCCTCACACCGGCAAGGTTTCGCACGCCACCGGCGACTCCAAGGTGCCGCAGACGCTGCAGGAGGGCCTGCCGGAGAAGGTCGAGAAGATCGTGCCCAACAAGCTGCACGACACCAAGGGCGCGACTTTCAGCGATGGATCTGTTGGCAAGTAAAAAGACCAATTCCATTCGGTGGGTTTAGTATGACTATGATTTGTATGATAGAAATTATgacttgaaatggaagatttgacgaacgactgagactgccaaggcacacaggggataggnnn,409
__________________________FINAL CODES THAT WORK___________________________________________________ the code that deleted all sequence but 5'UTRs
import csv
gseq=0;orfn=0
lines = csv.reader(open('test2.1GHA'))
lines.next()
for line in lines:
ID=line[0]
gseq=line[1]
orfn=line[2]
lseq=list(gseq)
utrstart=int(orfn)
del lseq [:utrstart]
utrseq=''.join(lseq).upper()
utrseq.upper()
print ID, '\n', utrseq,'\n', orfn
#i then copy and pasted results into a Notepad txt file
#i need help with the output to a file, seemed to only write one line when i tried
this finds the short reads in the 5'UTRs
from Bio import SeqIO # Bio Python
fh=open('test2GHA.txt')
for record in SeqIO.parse(fh, 'fasta'):
id=record.id
seq=record.seq
useq=seq.upper()
tfh=open('test5tags.txt')
taglines=tfh.readlines()
for line in taglines:
tag= line.upper()
lntag=tag.replace('\n',',')
seplntag=lntag.split(',')
for x in seplntag:
sonetag=''.join(x)
if useq.find(sonetag) >=1:
print id, sonetag
else:
pass
i hope this might help some other newbies out there.
thanks again guys, for the help
FINAL TWEAK
i need it to output the position of the tag in the UTR (index dosent work) or
just find the tag closest to the end of the UTR sequence instead of all of them.
suggestions?
got it sorry..
if useq.find(sonetag) >=1:
ftag=useq.find(sonetag)
print id, sonetag, ftag+1 ##position would be nice
else:
pass
Hi Nico. One pre-requisite of doing anything in Python or any other language is to actually know that language. I suggest you decide which language would be beneficial for you and that you take time to learn it. People on a forum can be extremely helpful, but they won't do the work for you, only make suggestions to help you get moving. Cheers
Nico, not at all. If you have code already, it would be a very helpful starting point for answers; no worries about how presentable it is. Learning to program is hard work; having people look at your code and offer suggestions is a valuable way to get better.
Your 454 'assemblage' looks a lot like individual reads. The read ID is typical of 454 reads, the first four bases (tcag) are the key sequence and the next ten lower case bases are one of the MIDs from 454...
Your approach is correct: trim the 454 contigs by removing the ORF regions; then map the tags to these trimmed contigs with a short read aligner. Can you include the code you've written so far to do the trimming? This will help someone provide you with a useful answer.
I apologize for coming off this way in my post, I have been attempting to learn python, I started with “Bucky’s” youtube tutorial. I then read Allen Downey’s book. Yet, I still think like a human biochemist, ha! Then I found Paulo Nuin’s blog and I started Sebastian Bassi’s book. I am feeling some pressure from my committee and was discouraged, maybe a little desperate when I posted this. I meant no disrespect to this form. I have some code written up and when I feel like it is more presentable I will post it. I look forward to your suggestions. Thanks, nico
Hi. From what you describe, I guess you have already covered much ground in your learning of Python. Do show us the code! :)
Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format (using an make table Access Query, sorry…) then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped!(needs work on the out) As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my file, to no prevail, and help would be very much appreciated!
Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format (using an make table Access Query, sorry…) then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped! As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my file, to no prevail gonna keep at it let me know if you guys have any ideas. cheers
Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped! As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my files, to no prevail.. im going to keep at it, but let me know if you guys have any ideas. cheers
I’m having a problem now with several tags showing up in one UTR, I need only the one nearest to the end of the UTR sequence, and .index doesn’t work because Seq (embedded in biopython) doesn’t have attribute .index any suggestions? Possibly just an addition to the print call?
nevermind got it sorry