Question

Custom Genome Curation in IGV

0

Entering edit mode

4 months ago

kerfuffle ▴ 20

Hello,

I have a series of HERV transcripts (in FASTA format) that I'd like to arrange into a usable reference genome for IGV. When I try to index the FASTA and pass in the .fa file as input to the "Load genome..." prompt, the chromosome denominations at the top of the UI disappear and the HERVs themselves don't show up as genes in the bottom track (where the "RefSeq" would normally be).

With this in mind, I'm wondering how I can curate a custom genome exclusively containing this set of 15-20 HERVs. What files would I have to prepare? Would this entail constructing a generic genome on the hg38 build and then constraining it to the loci I want to investigate? Any advice or feedback would be really appreciated.

For reference, here's a sample of what one of the entries in my FASTA looks like:

>HERVL74_2q11.2::chr2:100276116-100280003(+)

TATACTGAAACATTTAACCAAAACATAAAAGGGTGCC...

Please let me know what you think, thanks!

IGV • 953 views

ADD COMMENT • link updated 4 months ago by rfran010 ★ 1.3k • written 4 months ago by kerfuffle ▴ 20

score 2 · Accepted Answer · 2024-07-13

2

Entering edit mode

4 months ago

rfran010 ★ 1.3k

You can simplify the fasta names maybe.

Otherwise, the way you did it is the correct way to load a custom genome into IGV. The chr:x-y box is replaced by the header and location of fasta records.

So you could type in something like,

HERVL74_2q11.2:1-100

If you want annotations, load a bed or gtf file.

ADD COMMENT • link 4 months ago by rfran010 ★ 1.3k

0

Entering edit mode

Thanks for the feedback! I've tried to change the nomenclature of the FASTA headers but the issue still persists. I suspect it has to do with the HERV prefix, since I imagine if it were something like "chr19:1-100" the software would respond better. Here's a screenshot of what I'm seeing right now:

enter image description here

If you have any ideas, feel free to reach out! I appreciate your help :)

ADD REPLY • link 4 months ago by kerfuffle ▴ 20

0

Entering edit mode

Thanks, so what is the issue? It seems you viewing 1,060 BP of one of your HERVs. What else would you be trying to do.

ADD REPLY • link 4 months ago by rfran010 ★ 1.3k

0

Entering edit mode

Sorry, I should have clarified -- when I try to open up my BAM file (with the actual reads), I get an error message saying "[file] doesn't contain any sequence names which match the current genome". The file contains prefixes like chr1, chr2, chr3, etc., hence why I wonder how to rectify that mismatch.

ADD REPLY • link 4 months ago by kerfuffle ▴ 20

1

Entering edit mode

Oh, I see! You need to align the reads to your custom genome.

If the genome you load into IGV differs in any way from the genome you align the reads too, the information will be inaccurate.

Maybe it would be easier if you just loaded your HERVs as annotations so that they show up like the refseq genes?

ADD REPLY • link 4 months ago by rfran010 ★ 1.3k

0

Entering edit mode

Great, loading them as annotations has worked! I do still have one concern - the genes show up as continuous blocks (shaded blue), whereas the RefSeq genes have different structures for exons, introns, and UTRs. Do you know if there's a way to incorporate sequence information when loading a transcript from annotation? Or is this something I'd have to do through the genome file specification? Thanks for your help!

ADD REPLY • link 4 months ago by kerfuffle ▴ 20

1

Entering edit mode

What's the format of your annotation file?

If you format how a gene gtf file would be with exon/gene entries, it may be able to function the way you want.

It might also be able to work with a bed12 format to specify exon/intron locations.

If you're not sure how, I can try to format one of your entries as an example. Would just need the details, like exon locations etc.

ADD REPLY • link 4 months ago by rfran010 ★ 1.3k

0

Entering edit mode

I've currently got my annotation file set up as a BED with pretty minimal info (chrom, chromStart, chromEnd, name, strand); here's a screenshot of a few entries to help you get a better idea of what it looks like:

enter image description here

Regarding exon/intron locations, I believe I might actually have that information. There is a reference HERV GTF file I used in the past, which looks something like this (first five rows):

enter image description here

Is it possible to account for the exon information in the BED or would I have to switch over to GTF? If so, what changes would I need to make for the GTF format (I know 0-based vs. 1-based indexing is a big difference)? Perhaps even taking a subset of this GTF with my HERVs of interest would do the trick. Let me know if you have any ideas.

Thanks again for your help, it really is very much appreciated!

ADD REPLY • link 4 months ago by kerfuffle ▴ 20

1

Entering edit mode

Oh, never mind -- it seems that just subsetting my original GTF worked!! Thanks so much for your help.

ADD REPLY • link 4 months ago by kerfuffle ▴ 20

1

Entering edit mode

Yes, that's certainly the easier way to do it!

I'm curious where you got that reference GTF file. I only work a little with transposon (and in mouse), but I haven't seen a reference that contained gene/exon structure.

ADD REPLY • link 4 months ago by rfran010 ★ 1.3k

1

Entering edit mode

Yeah, so I used a software called Telescope for transposable read alignment and their authors provided an annotation database of curated GTF files from RepeatMasker and L1Base that could be used to run the software. I believe the files are built on hg19 and hg38, unfortunately, but I imagine you could apply their methodology to the mouse data as well. Here are some links that might help: Telescope, Annotation DB.