Hello everyone!!
I have a question that I guess the answer is simple:
I have a fasta file of human cDNA that looks like this:
(for example these 2 genes)
>ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T-cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T-cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
My target is to find the introns of these genes.
How can I do it?
Thanks!!!
Thanks for the quick answer! And my question is a general question' I have to do it for all the genes. So I need to solve it by program.
You might not be able to get the intron sequences from a fasta file with only cDNA. You need a bed/gff file with gene boundaries for the cDNA you are interested in and then you extract the regions that are not cDNA. Just keep in mind that your gene boundaries need to start and end with the CDS you are interested in and not include UTR regions or other adjoining CDS.
But if you just have a handful, then it would be easier to do it manually as finswimmer77 suggested.
doesn't it help that the fasta header has the coordinates? (just asking....... maybe I'm wrong)
The questions is, what do you excatly want. Are you realy just interested in the intron sequence? Or do you want the whole sequence for the given transcript including the intron? What is your goal?
fin swimmer
I'm only want the introns - preference to the second one.
Ok, here is a quick-and-dirty solution:
First we need to extract the transcript numbers in your fasta file.
Now we can use ensembls REST-API to fetch the sequence for the transcript. If we set the mask_feature parameters the introns are in lower case and the exons in upper case. You can than split/regex for the lower case parts to extract the introns.
In example for python3 and a BRCA Transcript:
Thanks so much!! I have one more question: What does it mean: "mask_feature"? And Is there is a short way to generate fasta file of all the introns:
.
Take a look at the link to the REST-API manual I gave to you above. Typically the sequence is given in upper case letters. With the parameter mask_feature set to 1 the introns sequence is in lower case letters.
In my script re.findall returns a list of all introns. Iterate over it and format your output as you like.
fin swimmer
Thank you so much!!
I saw you wrote another solution But I allready used th first one.(Mainly because I use python 2.7)
The problem is when I run this transcript:
I get an error:
I guess it might be because the sequence is to long.
Is there is a way to fix it?
Hello,
the error results from the %0D%0A in the url. %0D%0A encodes a line break. Remove it and it should work.
fin swimmer
O.K. thanks. How can I ensure That it wont repet this error? After all I downloaded these transcripts from ensembl. So I don't get why the url is wrong.
It depends on how you generate the url. Without knowing it, it is quite hard to help.
Sorry for all this mess, but now I see that the error was this:
Gateway Time-Out normaly means something take to long. Just retry it. I cannont reproduce it here this time, even if your linke still contains characters that should not be there (%0A).
Again the question: How do yoi create these urls?
fin swimmer
I copied the transcripts to a new file and now it works.
I have no idea why it didn't work before....
As for your question I created these urls using some commands - (as you wrote above) on cDNAs fasta file to extract the transcripts ids.
It worked fine till this specific transcript. I don't know why..........
Intron needs to be defined according to the gene boundary the cdna falls in, so you would need the gene boundary along with the cdna coordinates