Question

How to Run the Script for Extracting Splice Sites from the Gene Annotation File

1

Entering edit mode

7.6 years ago

LP ▴ 20

Hi Biostars Community,

I struggle with running the script extract_splice_sites.py Sbicolor_313_v3.1.gene.gtf > Sbicolor_313_v3.1.splicesites.txt) for extracting the splice sites from my gene annotation file. I was initially advidsed to convert my gene annotation file (Sbicolor_313_v3.1.gff3) to gtf format using gffread script, but still the script did not run. I really have no idea where the problem lies in the script. Please help!

Another problem I have is how to view the output file of HISAT. When I ran HISAT (hisat2 -p 8 -x Sbicolor_313_v3.0.index -1 RSL200mMMn0_R1_paired.fastq -2 RSL200mMmM0_R2_paired.fastq -S RSL200mMMn0.sam | samtools -bS -> RSL200mMMn0.bam) directly after creating the index files from my sorghum (Sbicolor_313_v3.0) genome, I was unable to view or locate my HISAT output file. How does one view the HISAT output file? Is there a command/script for viewing HISAT output files to see how many reads mapped?

I am still new in this field - thanks. Lekgolwa

RNA-Seq software error alignment • 6.2k views

ADD COMMENT • link 7.6 years ago by LP ▴ 20

0

Entering edit mode

Where did you get extract_splice_sites.py? Do you get an error when you run it? What does your command look like?

Is -> how you're trying to create the output file in your HISAT command? If you have a bam file you can view it with samtools view or IGV.

ADD REPLY • link 7.6 years ago by st.ph.n ★ 2.7k

score 1 · Answer 1 · 2017-09-29

1

Entering edit mode

7.6 years ago

Renesh ★ 2.2k

For your first question, the extract_splice_sites.py looks for the term gene_id or transcript_id for exons in attribute section (last description column) of off or gtf file. You need to check the file if these terms are present in attribute section. Generally, gff3 file from Phytozome database has ID and Parent term instead of gene_id and transcript_id. Therefore, if you run the code, you will not get any output.

Developers of HISAT need to fix this issue and should include ID and Parent terms for extracting splice sites.

Alternative,

You can download gtf file for S. biocolor from EnsemblPlants database ftp://ftp.ensemblgenomes.org/pub/release-37/plants/gtf/sorghum_bicolor/

ADD COMMENT • link 7.6 years ago by Renesh ★ 2.2k

1

Entering edit mode

This is technically an issue with a malformed GTF due to the GFF-->GTF conversion, rather than with the script per se - the GTF format is defined here, and should always include gene_id and transcript_id (it should also end each line with a ';' character).

It'll be simplest to just replace the ID and Parent definitions with sed: sed 's/ID=/transcript_id /;s/Parent=/gene_id /' yourfile > yournewfile

If the lines don't end with ';' then also: sed 's/$/;/' somefile > properlyformatted

ADD REPLY • link 7.6 years ago by george.ry ★ 1.2k

0

Entering edit mode

Thanks for your input @ george.ry. So how does sed work? Does it mean direct subtitution of "_ID" with "_id". I was intending to just manually replace "_ID" with "_id" hoping the implemented changes would enable the script to work. I haven't done that yet. So I am really not sure what to do right now, because even the gene annotation file with the gtf format downloaded from Ensembl plants is old version. The

ADD REPLY • link 7.6 years ago by LP ▴ 20

0

Entering edit mode

sed is a stream editor. It is replacing pattern delimited in first set of / with the one in second (e.g. ID= is being replaced by transcript_id in example above). This should probably be done as sed 's/ID=/transcript_id /g;s/Parent=/gene_id /g' yourfile > yournewfile (note the g at the end of those patterns) to replace all instances.

ADD REPLY • link 7.6 years ago by GenoMax 150k

0

Entering edit mode

Thanks genomax for your explanation of what "sed" is. So let me try to use it and see what happens.

ADD REPLY • link 7.6 years ago by LP ▴ 20

score 1 · Answer 2 · 2017-10-02

Thanks guys for your answers - very much appreciated.

st.ph.n - to answer your question as to where I got the extract_splice_sites.py, I would say the script comes with the HISAT package. I think it basically helps to improve your mapping. One has to perfom this step although in my opinion performing this step is not a prerequisite, but I might be wrong. I wish I could go further to explain how it works, but I do not want to say this that I am not sure about - I am still new in this field of bioinformatics. I am just using bioinformatics tools to understand molecular biology at another level or from a different perspective. Thanks st.ph.n again for answering my other question about how to view the ouput file of hisat. Viewing the ouputfile is not just the only problem I have. I still struggle with production of the output file itself after running the script. In other words I still need to be able to instruct the script to produce the output file from which I can then be able to use IGV or samtools to view the mapped reads. So I really don't know how to deal with this (first) problem. Why is my script not producing the output file (please see my very first post).

To Renesh, thanks so much for answering the question that I never hoped I could overcome one day. So I will look at my gene annotation file (.gff3) to see if things are in order. If not, I will then have to change the "ID" to "id" and so on, and hopefully the script will run. If however the script will not run, I then drop the use of my .gff3 file downloaded from phytozome in favour of the .gtf file from EnsemblPlants database. So I have plenty of options and it's all thanks to you.

Just one more thing; what do I have to do in order to receive posts directly from my registerd e-mail so that I can make follow-ups almost immediately. I was lucky to come across your reply by just clicking on "messages". Thanks once again.