3' UTR sequence
1
0
Entering edit mode
7.8 years ago

Hello, This is my first post here !

I have a pair of fasta sequence files one having several CDS sequences (file X) and the other with several full length transcript sequences (file Y).

The file with CDS (file X) sequence have entries as follows:

>ABCD (40..120)
…………the sequence……………….

This goes on for another 200 different gene entries

The file with the full length transcript sequence (file Y) have entries as follows:

>ABCD (1..700)
…………the sequence……………….

And this also goes on for another 200 different gene entries.

The ABCD implies a gene name which is of course identical in both the fasta files. Hence the sequence is also identical in between coordinates 40..120 for both sets. I want to extract the 3’ UTR sequences from all the fasta sequence entries that are in the full transcript file. So basically, I am trying to come up with a script that will allow me to extract the sequence for every gene from the full transcript file right after where the CDS sequence ends. So in the above example I am trying to get the sequence between coordinates (121 ..700) for gene ABCD from the full transcript fasta file. I am trying to make the script so that it loops through the entire file for all the 200 gene entries.

I am primarily using bash and little of perl. Any help will be most appreciated.

Thanks and regards!

sequence • 2.7k views
ADD COMMENT
0
Entering edit mode

Hello leo1985.arnab!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?p=203861#post203861

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Ok, thanks! I appreciate your comment.

ADD REPLY
0
Entering edit mode

Hello, I am trying to do the same for a bunch of ORFs. Were you successful in extracting the 3' UTRs? I've got FASTA sequences of my transcripts and I have run transdecoder to identify the 3' UTR regions. I'm wondering how to extract these sequences. Please help!

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLY
1
Entering edit mode
7.8 years ago

Do they follow same pattern always ? Like >GENE (start..end) and both have same genes ? You may want to look into bedtools getfasta

You should get a bed format that says:

ABCD    121    700
EFGH 500    900
....
....

Then run:

bedtools getfasta -fi transcript.fasta -bed in.bed

So you are asking bedtools to get the sequence from 120 to 700 bp from full length transcript file. Make sure about 0 and 1 based coordinates.

ADD COMMENT
0
Entering edit mode

Goutham,

Thanks so much for your response. Sorry I could not follow up sooner. To answer your question : yes, the genes in the two fasta files do follow the same order in appearance. However, in your post when you mentioned "you should get a bed format:...." I looked into getfasta for bedtools but I am a bit confused how the bed file will be entertained here, as both are fasta files that I have. I can try bowtie to generate a bam file and then use bedtools to convert bam to bed file. Is that what you meant?

ADD REPLY
0
Entering edit mode

You need to convert your fasta header of CDS to a bed format.

ADD REPLY
0
Entering edit mode

Okay, thanks...........

ADD REPLY

Login before adding your answer.

Traffic: 1887 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6