MISO Sorted BAM file header mismatch
0
0
Entering edit mode
5 days ago
bimeva5703 • 0

I am working with a set of BAM files that have been aligned using the following minimap2 command:

minimap2 -a -t 20 -ax splice -uf -k14 --junc-bed gencode.v46.chr_patch_hapl_scaff.annotation.bed GRCh38.p14.genome.fa [fastq_file].fastq.gz

The fastq reads have been trimmed using fastp using default settings. I am using the comprehensive human genome from gencode downloaded as a GFF file, converted to a BED file using bed2gtf (because minimap2 only supports BED annotations).

The GFF version is indexed by Miso using this command (because Miso requires annotations to be indexed into its own format):

index_gff --index gencode.v46.chr_patch_hapl_scaff.annotation.gff gencode.v46.chr_patch_hapl_scaff.annotation_miso_indexed.gff3/

Miso also requires all sequences within a bam file to be the same length, so I wrote a script that extracts sequences of the same length from bam files passed as arguments (I've confirmed this works properly; doing a diff when merging all bam files into a single bam file and extracting the sequences of the same length from that file and running the script on individual results in the exact same file). Miso correctly detects that the file contains all sequences of the same length.

Here is an example of the miso command I am using:

miso --run gencode.v46.chr_patch_hapl_scaff.annotation_miso_indexed.gff3 fastp_default_hg38_tissue_1421_filtered.bam  --output-dir miso/ --read-len 1421

The issue is that I am getting this error:

Found reads of length 1421 in BAM.
11/18/2024 10:35:57 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not    have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
11/18/2024 10:35:57 AM - miso_main - WARNING - Please see:
http://genes.mit.edu/burgelab/miso/docs/
for more information.
11/18/2024 10:35:57 AM - miso_main - WARNING - It looks like your GFF annotation (gencode.v46.chr_patch_hapl_scaff.annotation_miso_indexed.gff3) contains 'chr' chromosomes (UCSC-style) while your BAM file (fastp_default_hg38_tissue_1421_filtered.bam) does not.

My script merges the bam files that contain sequences of the same length into a temporary combined bam file which I then extract the header from, a prepend to the file where the actual sequences are extracted. Here is a truncated example:

@HD     VN:1.6  SO:coordinate
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
....
@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -a -t 20 -ax splice -uf -k14 --junc-bed gencode.v46.chr_patch_hapl_scaff.annotation.bed GRCh38.p14.genome.fa FAX47061_pass_5429d939_37e9d431_0_fastp_default.fastq.gz
@PG     ID:samtools     PN:samtools     PP:minimap2     VN:1.19.2       CL:samtools view -bS FAX47061_pass_5429d939_37e9d431_0_fastp_default.fastq.gz.sam
....
@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.21 CL:samtools merge -b 178_filter_temp.txt fastp_default_hg38_tissue_178_temp_merged.bam
@PG     ID:samtools.3   PN:samtools     PP:samtools.1-2410E84F  VN:1.21 CL:samtools merge -b 178_filter_temp.txt fastp_default_hg38_tissue_178_temp_merged.bam

(All of the individual commands are listed for every file included in the merged file)

The same gencode hg38 annotation file is used for the minimap2 alignment and is the indexed miso file as well, so you wouldn't necessarily expect a header mismatch. The gtf2bed conversion seemed to function perfectly fine so I don't think that's the issue.

I'm trying to get the isoforms from the bam files using miso. Maybe there's a better way of going about getting sorted files with sequences of the same read length and there's better software to use now (I know miso hasn't been updated in like a decade). Maybe there's an issue with the header merging, I'm not sure. I would appreciate any help on this.

bam miso sam minimap2 • 251 views
ADD COMMENT
1
Entering edit mode

the error message seems very clear about the problem:

Your chromosome names do not match.

You show a lot of code above, but what are the chromosome names in the files you are using?

Maybe you have extra chromosomes in one file vs the other, and so on.

Chromosome name mismatch is a common error and vexing at that (hence the detailed error message), so I would look at that first.

ADD REPLY
0
Entering edit mode

Which files do you mean specifically? The bam header vs the gff annotations?

Is there any software or tools out there that can help with this?

ADD REPLY
1
Entering edit mode

Extract the names of the chromosome for the reference file and for your GFF file and do a comparison.

This would only require a few command line commands with some little caveats:

# Extract chromosome names from the reference genome
samtools faidx genome.fa
cat genome.fai | cut -f 1 | sort > a

# Extract the unique chromosome names from the annotation file
cat annotations.gff  | grep -v '#' | cut -f 1 | sort -u > b

at this point a and b are sorted files with chromosome names - you can compare these with comm to extract common and different elements

# Chromosomes that are present only in file b would be
comm -1 -3 a b 

and so on

ADD REPLY
0
Entering edit mode

ok, doing this on GRCh38.p14.genome.fa.fai and gencode.v46.chr_patch_hapl_scaff.annotation.gff3 (https://www.gencodegenes.org/human/) yields:

comm -1 -3 a b

Nothing

comm -2 -3 a b
GL000008.2
GL000208.1
....
KI270864.1
KN196482.1
....
KQ031386.1
....
MU273350.1

comm -3 a b yields the same output as comm -2 -3 a b.

Not sure if I can upload the full output.

ADD REPLY

Login before adding your answer.

Traffic: 1409 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6