Question

Issues adding GTF track to split genome on IGV

0

Entering edit mode

4 months ago

Daisy • 0

Hello all,

I am currently attempting to visualise a genome on IGV, which I can successfully load my sequencing reads to, but am unable to add the GTF track to the reference genome due to discrepancies with the chromosome IDs and coordinates.

The genome file contains chromosomes which are split into two parts per chromosome, while the gtf file uses coordinates of the entire chromosome. I used this genome file to map my sequencing reads using HISAT2.

What I do have is a bed file which contains information on how the two parts of the chromosome map to the whole chromosome.

Is there a way to reformat my GTF file such that the coordinates and chromosome names match those of the split chromosome genome file, perhaps using the BED file?

Alternatively, is there a way to successfully load my reads, which were mapped using the split chromosome genome, to the intact reference genome (which I have no issues adding the GTF track to in IGV, but my reads won't map to it).

Many thanks in advance

**Genome file:** 
>chr1A_part1
ACCTCGACCTA

**GTF file:** 
chr1A   GenomeAnnotation       exon    10092   17001   .       -       .       transcript_id "Gene1.1"; gene_id "Gene1"; gene_name "Gene1";

**BED file**
chr1A_part1 0   100000000   chr1A   0   100000000 
chr1A_part2 0   50000000    chr1A   100000000   150000000

IGV • 679 views

ADD COMMENT • link updated 4 months ago by Michael 55k • written 4 months ago by Daisy • 0

0

Entering edit mode

Is there a way to reformat my GTF file such that the coordinates and chromosome names match those of the split chromosome genome file, perhaps using the BED file?

Are you asking how to do this programmatically since there are many entries in GTF?

A "not so smart" way to do this would be to copy the part of the GTF that represents part1 into a separate file. Change the chr1A to chr1A_part1 (with a find and replace in Notepad++ should do it) and the replace the section back in your original GTF.

ADD REPLY • link 4 months ago by GenoMax 150k

0

Entering edit mode

Thanks for your response. Yes, there are many entries in the genome and GTF file. The problem with altering the names in this way is that chr1a could map to either chr1A_part1 or chr1A_part2 depending on its position, which would be a lengthy process to figure out, especially for each chromosome. This would also not update the gene coordinates to match the split genome. I think the BED file provides the information I need for conversion, but I don't know how to utilise it

ADD REPLY • link 4 months ago by Daisy • 0

0

Entering edit mode

If your GTF file is not sorted based on the coordinates then it should be easy to sort using AGAT toolkit. Then it should be a matter of figuring out where the boundaries of your part 1 and part 2 etc and edit the file.

ADD REPLY • link 4 months ago by GenoMax 150k

score 0 · Answer 1 · 2024-11-29

The problems you face after splitting the reference exemplify why this scenario should be avoided if possible. I assume you had excellent reasons for performing the alignment with split chromosomes, however, you should consider if it is possible to run the alignments against the full chromosomes.

Should that not be possible, I recommend mapping the aligned read coordinates from the SAM/BAM files back to the real genome coordinates. Even though computationally more intensive, this is safer and easier than mapping the annotation to the "broken" genome. If you used a script to break up the chromosomes, there might exist another script to fix the alignments after that.

The procedure is very straightforward:

Parse the BED file to determine the lengths of all chromosome fragments,
calculate offsets for each chromosome part as the sum of lengths of all previous parts (in your case only two parts, but this will work for all)
adjust each alignment, line by line: look up the offset for the chromosome_part and add to the coordinates (e.g. +100000000 for chr1A_part2)
replace the chromosome name with the first part (e.g.chr1A_part2 -> chr1A)
optionally adjust the SAM header with the real chromosome names and lengths or discard the header

If you have a constant split length and only two parts per chromosome, the process is even simpler and doesn't require parsing the bed file:

Add SPLITSIZE to all coordinates/POS fields if $chromosome =~ /_part2$/ (perl syntax)
Replace the chromosome names with the real chromosome name

The advantages over other procedures:

You can browse the genome like anyone else and have the same coordinates as anyone else. Therefore, you don't need to convert coordinates each time you report about a region.
You don't need to care about features bridging the splits as there won't be alignments outside of the reference. You only need to remember where those splits were as there won't be any reads or pairs bridging them.