Help with finding introns

0

Entering edit mode

7.8 years ago

elisheva ▴ 120

Hello everyone!!

I have a question that I guess the answer is simple:
I have a fasta file of human cDNA that looks like this:
(for example these 2 genes)

>ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T-cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T-cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

My target is to find the introns of these genes.
How can I do it?
Thanks!!!

cDNA genome introns • 3.6k views

ADD COMMENT • link updated 5.9 years ago by Biostar 20 • written 7.8 years ago by elisheva ▴ 120

1

Entering edit mode

7.8 years ago

finswimmer 16k

Hello,

here is a more comfortable solution with python using ensembl's REST-Api.

You will need a file with one transcript id per line. You can get it the way I described before.

Save the following code as intron2fasta.py

	import requests
	import sys
	import argparse


	def set_args():
	parser = argparse.ArgumentParser()
	group = parser.add_mutually_exclusive_group()
	group.add_argument("-t", "--transcript", help="ensembl's transcript id")
	group.add_argument("-l", "--list", help="file which contains list of transcript id's")
	parser.add_argument("-o", "--out", help="write output to given file instead of stdout")
	return parser.parse_args()


	def read_transcript_file(list):
	with open(list) as f:
	for line in f:
	yield line.strip()


	def lookup_transcript(transcript):
	server = "http://rest.ensembl.org"
	ext = "/lookup/id/"+transcript+"?expand=1"

	r = requests.get(server+ext, headers={"Content-Type": "application/json"})

	if not r.ok:
	r.raise_for_status()
	sys.exit()

	lookup = r.json()

	return lookup


	def intron_coords(exons, i):
	if exons[i]['strand'] == -1:
	intron_start = str(exons[i+1]['end']+1)
	intron_end = str(exons[i]['start']-1)
	else:
	intron_start = str(exons[i]['end']+1)
	intron_end = str(exons[i+1]['start']-1)

	coords = str(exons[i]['seq_region_name'])+":"+intron_start+".."+intron_end+":"+str(exons[i]['strand'])

	return coords


	if __name__ == "__main__":
	args = set_args()
	transcripts = []

	if args.transcript:
	transcripts.append(args.transcript)
	elif args.list:
	transcripts = read_transcript_file(args.list)

	for t in transcripts:
	lookup = lookup_transcript(t)

	for i in range(0, len(lookup['Exon'])-1):
	coords = intron_coords(lookup['Exon'], i)
	server = "http://rest.ensembl.org"
	ext = "/sequence/region/human/"+coords

	r = requests.get(server+ext, headers={"Content-Type": "text/x-fasta"})

	if not r.ok:
	r.raise_for_status()
	sys.exit()
	# create a header like >ENST00000412061.3.Intron_1 chromosome:GRCh38:17:43094861:43095845:-1
	# use headline = ">"+t+".Intron_"+str(i+1) if you don't want the chromosomal position
	headline = ">"+t+".Intron_"+str(i+1)+" "+r.text.split("\n")[0][1:]
	sequence = r.text.split("\n", 1)[1]

	if args.out:
	outfile = open(args.out, "a")
	outfile.write(headline+"\n"+sequence)
	outfile.close()
	else:
	print(headline, sequence, sep="\n")

view raw intron2fasta.py hosted with ❤ by GitHub

Now you can call the script in this way (python3 is required):

python intron2fasta.py -l transcripts.txt -o introns.fa

This produces the file introns.fa containing all intron sequences of the transcripts given in transcripts.txt.

But be careful. A lot of error handling is still missing. Also the REST-Api is called for every single transcript and intron sequence instead of doing this once.

fin swimmer

ADD COMMENT • link 6.9 years ago by finswimmer 16k

0

Entering edit mode

7.8 years ago

finswimmer 16k

Hello,

is it more a general question like "I have an Transcript XY and I like to find out the introns (sequence?)?" Or do you want to know the introns of the two genes you gave us? Should this solved automaticly by a program or do you want to do this by hand.

For the two transcript above you can just go to ensembl an can search for the transcript id. Than you will find out, that these two transcripts have no Introns.

fin swimmer

ADD COMMENT • link 7.8 years ago by finswimmer 16k

0

Entering edit mode

Thanks for the quick answer! And my question is a general question' I have to do it for all the genes. So I need to solve it by program.

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

0

Entering edit mode

You might not be able to get the intron sequences from a fasta file with only cDNA. You need a bed/gff file with gene boundaries for the cDNA you are interested in and then you extract the regions that are not cDNA. Just keep in mind that your gene boundaries need to start and end with the CDS you are interested in and not include UTR regions or other adjoining CDS.

But if you just have a handful, then it would be easier to do it manually as finswimmer77 suggested.

ADD REPLY • link 7.8 years ago by Rohit ★ 1.5k

0

Entering edit mode

doesn't it help that the fasta header has the coordinates? (just asking....... maybe I'm wrong)

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

0

Entering edit mode

The questions is, what do you excatly want. Are you realy just interested in the intron sequence? Or do you want the whole sequence for the given transcript including the intron? What is your goal?

fin swimmer

ADD REPLY • link 7.8 years ago by finswimmer 16k

0

Entering edit mode

I'm only want the introns - preference to the second one.

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

1

Entering edit mode

Ok, here is a quick-and-dirty solution:

First we need to extract the transcript numbers in your fasta file.

grep "^>" input.fa|cut -f1 -d " "|sed -r 's/^.{1}//' > transcripts.txt

Now we can use ensembls REST-API to fetch the sequence for the transcript. If we set the mask_feature parameters the introns are in lower case and the exons in upper case. You can than split/regex for the lower case parts to extract the introns.

In example for python3 and a BRCA Transcript:

import re
import requests, sys

server = "http://rest.ensembl.org"
ext = "/sequence/id/ENST00000415118.1?mask_feature=1"

r = requests.get(server+ext, headers={ "Content-Type" : "text/plain"})

if not r.ok:
  r.raise_for_status()
  sys.exit()

sequence = r.text
introns = re.findall('[atgc]+', sequence)
print(introns[0])

ADD REPLY • link 7.8 years ago by finswimmer 16k

0

Entering edit mode

Thanks so much!! I have one more question: What does it mean: "mask_feature"? And Is there is a short way to generate fasta file of all the introns:

>transcript.index_of_intron    
sequence....

.

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

0

Entering edit mode

What does it mean: "mask_feature"?

Take a look at the link to the REST-API manual I gave to you above. Typically the sequence is given in upper case letters. With the parameter mask_feature set to 1 the introns sequence is in lower case letters.

And Is there is a short way to generate fasta file of all the introns

In my script re.findall returns a list of all introns. Iterate over it and format your output as you like.

fin swimmer

ADD REPLY • link 7.8 years ago by finswimmer 16k

0

Entering edit mode

Thank you so much!!
I saw you wrote another solution But I allready used th first one.(Mainly because I use python 2.7)
The problem is when I run this transcript:

ENST00000537694.1

I get an error:

Traceback (most recent call last):
  File "introns.py", line 44, in <module>
    main(line)
  File "introns.py", line 15, in main
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://rest.ensembl.org/sequence/id/ENST00000537694.1%0D%0A?mask_feature=1

I guess it might be because the sequence is to long.
Is there is a way to fix it?

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

0

Entering edit mode

Hello,

the error results from the %0D%0A in the url. %0D%0A encodes a line break. Remove it and it should work.

fin swimmer

ADD REPLY • link 7.8 years ago by finswimmer 16k

0

Entering edit mode

O.K. thanks. How can I ensure That it wont repet this error? After all I downloaded these transcripts from ensembl. So I don't get why the url is wrong.

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

0

Entering edit mode

O.K. thanks. How can I ensure That it wont repet this error?

It depends on how you generate the url. Without knowing it, it is quite hard to help.

ADD REPLY • link 7.8 years ago by finswimmer 16k

0

Entering edit mode

Sorry for all this mess, but now I see that the error was this:

Traceback (most recent call last):
  File "introns.py", line 44, in <module>
    main(line)
  File "introns.py", line 15, in main
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: http://rest.ensembl.org/sequence/id/ENST00000529724.1%0A?mask_feature=1

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

0

Entering edit mode

Gateway Time-Out normaly means something take to long. Just retry it. I cannont reproduce it here this time, even if your linke still contains characters that should not be there (%0A).

Again the question: How do yoi create these urls?

fin swimmer

ADD REPLY • link 7.8 years ago by finswimmer 16k

0

Entering edit mode

I copied the transcripts to a new file and now it works.
I have no idea why it didn't work before....
As for your question I created these urls using some commands - (as you wrote above) on cDNAs fasta file to extract the transcripts ids.
It worked fine till this specific transcript. I don't know why..........

ADD REPLY • link 7.8 years ago by elisheva ▴ 120

0

Entering edit mode

Intron needs to be defined according to the gene boundary the cdna falls in, so you would need the gene boundary along with the cdna coordinates

ADD REPLY • link 7.8 years ago by Rohit ★ 1.5k

Login before adding your answer.