Question

Viewing the tophat output (mapped.bam and junctions) along with genome coordinates/bases and mRNA annotations

3

Entering edit mode

9.7 years ago

kirannbishwa01 ★ 1.6k

I very much like the IGB tools and its features. While I have been able to make a good use of it, I have been facing a problem and can't seem to find a solution how much I try. I am trying to view the aligned tophat output (mapped.bam and junction files from aligned RNAseq data on the reference A. lyrata genome. When I load the lyrata genome on the IGB browser I can see the genome coordinate and the TAIRmRNA database (the annotated .gff file). But, after I upload a mapped.bam and junction file I am not able to see the alignment (aligned reads) with the reference and the annotation.

But, I figured that the mapped.bam and junction creates its own set of scaffold at the bottom of the default set of scaffold (one to one copy with default, but not sure why?). So, if I select a scaffold that the mapped.bam file has created I am able to see the mapped reads and the junctions but now cannot see the co-ordinate bases and the annotations. However, with A. thaliana genome there is no such problem with viewing the mapped output and junctions from RNAseq data along with genome coordinates and bases, TAIR10 mRNA database and several other databases from other labs.

Also, I see that updated version of phytozome data is available (V10.2). Is the data for A. lyrata available on IGB browser (V7) the same as V10.2?

Thanks,
Bishwa K.

alignment RNA-Seq igb • 6.0k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by kirannbishwa01 ★ 1.6k

3

Entering edit mode

If you perform the alignment yourself, it might be a good idea to actually load the fasta file together with the gtf file you used for alignment to try and visualize the mapping on the IGB. Most of the time, it might just be due to the naming problem. And just in case, you might want to read this if you want to know how to index your fasta file

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by Sam ★ 4.8k

1

Entering edit mode

This also works. To open your fasta file in IGB and use it as the reference, select File > Open Genome from File. (Or click the blue and red DNA icon in the toolbar.)

IGB will then display a window that let's you select a fasta or 2bit format file to use as the reference sequence. (Better to use 2bit - it's much faster to read :-)

You can also enter a genome version and species names. It's optional, but if you do that, then IGB will display the names you selected in the Species and Genome Version menus of the Current Genome tab. Otherwise IGB will assign a default name.

Then, click OK.

What happens at that point is that IGB will scan your reference sequence file, make a list of all the chromosomes and their sizes, and then list them in Sequence table in the Current Genome tab.

At that point, you can open your files as you would normally, including your GTF file. IGB can read GTF files produced by cufflnks.

It can also read some GFF3 files. However, GFF3 files are sometimes not read correctly because different groups interpret the GFF3 specification differently and it's hard to make sure that all GFF3 files will work with IGB. For this reason, we recommend using BED or BED-Detail to represent gene models in IGB.

If you use BED-Detail, make a regular BED file. Then add a column 13 with whatever you want the gene title to be (e.g., TP53) and add a column 14 with whatever descriptive text you'd like to see in the Selection Info tab when you click on the gene. For column 4, insert the name of the gene model, e.g., AT1G07350.1 if it's Arabidopsis. For examples, see the "bed" files on the QuickLoad site - there are many examples from many different species. The text you insert into columns 4, 13, and 14 will be available for searching under the Advanced Search tab, so it's useful to add text you think will be helpful for search, like gene name and gene function.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by Ann ★ 2.4k

1

Entering edit mode

9.7 years ago

alolex ▴ 960

IGB has released a new version, v 8.3.4, which does contain A.lyrata, but it is gene models from Phytosome v7; however, from the release notes on http://phytozome.jgi.doe.gov/pz/portal.html it seems A.lyrata was not changed and is still at v1.0. You need to make sure the aligned data you are loading in has the same names as the genome version--it looks like they are named scaffold_1, scaffold_2, etc. The sequence is only viewable when you are zoomed in a substantial amount, but if it doesn't show up when you are zoomed in try clicking the "Load Sequence" button. If the sequence still doesn't load then you probably have some mismatching names in the files you are using. You should be able to just drag and drop your bam and junctions files onto IGB, zoom in to the area you want to study and click on "Load Data" and "Load Sequence".

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by alolex ▴ 960

1

Entering edit mode

9.7 years ago

Ann ★ 2.4k

Yes, do please put the files into a folder on Google Drive or into a Public Dropbox folder, along with a short document giving the URL or other information indicating where the file came from.

I can then take a look at the files and use them to get a better idea of how to update the IGB QuickLoad site.

Most useful would be:

fasta file from Ensembl/iPlant
fasta file from JGI/Phytozome
annotations file (gtf) from Ensembl
annotations file (gtf) from iPlant community folder

Also helpful would be a few references to recent RNA-Seq (or other *Seq) papers featuring data from A. lyrata - anything where the authors would have run some type of alignment program against a reference genome.

Most genome projects do the same basic pipeline, but with variations. So usually I like to look at recent papers where researchers used the genome - from that, I can get a sense of what resources are available and how (or if) to re-format them for use with IGB.

Also, when you get to this point, you might want to set up your own IGB QuickLoad site to share data with other people in your lab and/or collaborators in other labs. You can distribute the data on a Web site or use a "Public" Dropbox folder.

The Dropbox folder approach is a bit of a hack in that I'm almost positive Dropbox has no idea scientists are using their service in this way, but so far it seems to work great. If your lab can afford $100/year for 1 TB of storage on Dropbox, then this could be a great option. At least in my experience, most *Seq data sets from smaller scale experiments (3 reps, 2 conditions) are 30 Gb or smaller. And since 1 TB = 1,000 Gb, you could probably share a lot of data sets this way.

Details are here: https://wiki.transvar.org/display/igbman/Set+up+a+Quickload+in+Dropbox

You can also do something similar with iPlant, but IGB connectivity to iPlant is still a bit experimental, and we hope to work with the iPlant engineering team to make it easier. So that will also be an option in the near future.

Let me know when the files are ready and I will take a look!

Best,
Ann

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by Ann ★ 2.4k

0

Entering edit mode

Hi Ann,

Thanks a lot for your input.

I like the way we could setup a quickload sever on the dropbox. I have like 20 gb of space on dropbox so I think that should suffice for now.

I am sharing releases that I had downloaded from Ensembl/iplant and JGI/phytozome as a bulk data. It contains fasta, gtf and several other annotation files. This bulk data also contains several other features and annotation files that came with the bulk data (which I think could be helpful in some way and left it in the shared folder).

https://www.dropbox.com/sh/0j6ja1je56epgl7/AAAo_-RjnsmZBtrFnUVh46oYa?dl=0

Regarding the research articles we have not found any labs that have done RNAseq analysis using lyrata references. The paper that discusses genome assembly of lyrata and its comparative analysis with its relative (A. thaliana) is from Hu et. al (2011) http://www.nature.com/ng/journal/v43/n5/full/ng.807.html?WT.ec_id=NG-201105 Hope this is helpful.

Please let me know if you have any question.

Thanks again,
Bishwa K.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Just a quick note:

There are two different folders of JGI releases which contain different files, and I shared both of them.

Thanks,

ADD REPLY • link 9.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

9.7 years ago

kirannbishwa01 ★ 1.6k

Thanks you all for you help. It all make sense to me and now I think there is not just one problem. The first problem I found is that the size of the lyrata genome available via several sites is different. JGI (joint genome institue) and Phytozome have lyrata reference fasta file as 199 mb while the downloads from ensembl and iplant have the reference lyrata genome fasta file at the size of 200 mb. I am not sure but it seems like this slight variation in file size (and the data within it) could have caused the first source of variation in the index file they created. If I am wrong please correct me.

IGB has lyrata reference sequence file in 2bit format, so I am not sure which release does it mainly matches too. But, the problem with the alignment mainly seems due to the naming (as I see there are equal number of scaffolds created while loading default IGB lyrata genome vs. while loading the reference fasta file available from iplant (community data folder, but the scaffolds are named differently). So, for now I will try to see if creating personal synonym file will help.

Also, the size of .gtf file available via ensembl release (at 83 mb) is different from the one available from iplant community data folder (67.1 mb) which makes me think that ensemble now has more annotated genes than those available in iplant .gtf file.

I have just begun analyzing my data and am new to bio-informatics stuff, so my certain assumptions could be wrong. But, I would be happy to hear any feedbacks. Also, it would be great if we could have an updated version of lyrata genome sequence and gene model annotation on IGB.

I could share the several files that I have downloaded from several servers. Please let me know.

Thank you so much all of you.

- Bishwa K.

ADD COMMENT • link 9.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

You should not conclude that 2 fasta files are different just based on a 1MB difference in file size. The content that matters could be exactly the same, but the larger file may just have more meta information, more details in the fasta headers, or have been generated by a different program etc that could result in a slightly different file size. For example, just adding 1 million spaces will increase the file size by about 1MB. Have you looked at the actual content of the fasta files? Do the headers look the same? The gtf file size does look significant though. Can you give us more details on the sequence of steps and the exact files you are trying to load into IGB?

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by alolex ▴ 960

0

Entering edit mode

Good point. The fasta headers could be identical, but the number of bases per line might be different, which could also make the file sizes different thanks to different number of newline characters.

ADD REPLY • link 9.7 years ago by Ann ★ 2.4k

0

Entering edit mode

Yep! Another possibility. Bottom line is any number of formatting changes could be altering the file size without actually changing the content. For fasta files I found this python script that looks like it will tell you if anything is different in content (https://www.cgat.org/downloads/public/cgat/documentation/scripts/diff_fasta.html), but I've not needed to use it yet. For GTF files I'm not aware of a tool that does a direct comparison, but I found this post (http://r.789695.n4.nabble.com/Comparing-two-gff-gtf-files-de-novo-transcripts-v-s-reference-td3934629.html) that points to rtracklayer in Bioconductor.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by alolex ▴ 960

0

Entering edit mode

9.6 years ago

kirannbishwa01 ★ 1.6k

Thank you all for the inputs. The information you all provided has been helpful.

Hi Ann,

Could you please let me know if the files/folders that I shared was of any help to update the annotation on IGB. There was recent version of IGB release but for A. lyrata genome (and its annotation) in don't see any difference. Could you please update me in this regards.

Thanks much,

Bishwa K.

ADD COMMENT • link 9.6 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Yes, the data is very useful - thank you! It looks like the gene annotations in IGB currently are up-to-date, but we need to change the names to make them match the other data sets. Once we do that, it will be much easier to use IGB to view your data. We need to double-check a few things and hope to have a new release ready very soon.

ADD REPLY • link 9.6 years ago by Ann ★ 2.4k

1

Entering edit mode

Phytozome and Ensembl both report 695 scaffolds and 32,667 gene models for genome assembly version 1.0 for A. lyrata. This is also what is present in IGB QuickLoad. So I think IGB QuickLoad is up-to-date. However, I noticed that the gene model names for IGB QuickLoad are numbers like "311229," which are not very informative. IGB provides a google search feature that lets you search google using gene model ids as a query, and I doubt this would be very useful unless we use better gene model ids. I noticed that the files you provided from JGI include a synonyms file that maps these numeric ids onto ids like: fgenesh1_pm.C_scaffold_1000009, and google searching with these ids in some cases turned up useful information. So I am going to modify the IGB gene models file to use these ids instead of the numeric ids being used now. I'll post again when that's done.

ADD REPLY • link 9.6 years ago by Ann ★ 2.4k

1

Entering edit mode

I also noticed that iPlant and Ensembl are using names 1, 2, 3, ... 8 in place of the names scaffold_1, scaffold_2, ... , scaffold_8, which are the names JGI and Phytozome are using. These scaffolds (1 through 8) appear to correspond to physical chromosomes of A. lyrata. I think it would be useful for IGB QuickLoad to synchronize with Ensembl and also iPlant, which appears to be getting its data from Ensembl. So I will change the genome files in IGB QuickLoad to use names 1, 2, 3, etc instead of scaffold_1, scaffold_2, etc.

ADD REPLY • link 9.6 years ago by Ann ★ 2.4k

0

Entering edit mode

Hi Ann, Thanks a lot for fixing these issues. A. lyrata has 8 physical chromosomes. So, scaffold 1-8 should represent to Chr 1-8. While 9 & 10 should represent mitochondrial and chloroplast genomes. There are several other unmapped regions (should be all other scaffolds).

Thanks a lot again,

ADD REPLY • link 9.6 years ago by kirannbishwa01 ★ 1.6k

1

Entering edit mode

Quick followup: Looks like scaffold_9 is bigger than scaffold_10. Also, scaffold_9 has many gene annotations and scaffold_10 has none. The gene models on scaffold_9 look a lot like genes from nuclear chromosomes - they have introns, exons, splicing.

ADD REPLY • link 9.6 years ago by Ann ★ 2.4k

0

Entering edit mode

Another followup:

Code I used to change gene model names and scaffold names is (mostly) here: https://bitbucket.org/aloraine/alyrata

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.6 years ago by Ann ★ 2.4k

1

Entering edit mode

The new files are now available on IGBQuickLoad. IGB is now using the longer, non-numeric names for gene models.

ADD REPLY • link 9.6 years ago by Ann ★ 2.4k

Ram · Accepted Answer · 2015-06-24

Hello,

It sounds to me like the reference genome you used for the alignment step is using different names for scaffolds than the version of the sequence IGB is using.

Some useful info:

IGB is getting reference genome sequence and gene model annotations from a publicly accessible IGB QuickLoad site located at:

http://www.igbquickload.org/quickload

The various genomes we support are contained in folders for each genome, named for the species and the month and year of the genome assembly release.

It looks like our latest A. lyrata genome is in here:

http://www.igbquickload.org/quickload/A_lyrata_Apr_2011/

IGB uses a file called "genome.txt" to populate the list of chromosome/scaffold sequences you see in the "Current Genome" table (right side tab):

http://igbquickload.org/quickload/A_lyrata_Apr_2011/genome.txt

If you download that file and open it in Excel or a text editor, you can see all the names of the chromosomes and their sizes.

The sequence data, which IGB will load when you click the "Load Sequence" button, is in a "2bit" format file called A_lyrata_Apr_2011.2bit. We are using the "2bit" format because it's very compact and there are many utilities for working with it - mostly available from Jim Kent and the UCSC Genome Bioinformatics group. The 2bit format has some nice features that makes accessing sequence data fast and easy for IGB.

If you load a BAM file into IGB and notice that all-new sequence names are getting added to the Current Sequence table and when you click the "Load Sequence" button, no sequence gets loaded, then that usually means: the genome.txt and 2bit files don't contain the sequences you used to run your alignment. This could happen if your genome version is different or if it's the same version but is just using different names.

If using different names is the problem, then you can give IGB a list of synonyms that IGB can use to match names. So for example, if the reference genome sequence you used to do your alignments contains a sequence called "FooBar" which is the same sequence that IGB calls "foobar123", then you can tell IGB the two names mean the same thing by adding a personal synonyms "chromosomes.txt" file to IGB.

For more info about that, see: https://wiki.transvar.org/display/igbman/Personal+Synonyms

Let us know if you need any help with this.

Also, if there is a more recent version of A. lyrata genome, we'd be happy to add it to the IGB QuickLoad system - including sequence and gene model annotations.

So