Forum:Stranger Things: unexpected limitations of popular tools
9
19
Entering edit mode
7.5 years ago

Twice in just as many days, we were left scratching our heads as mainstream tools broke unexpectedly and spectacularly on what seemed to be normal tasks.


Cigar Break

For example take the SAM file that we now fittingly named cigarbreak.sam

 samtools view -b http://data.biostarhandbook.com/strange/cigarbreak.sam > output.bam

thanks it's close, but no cigar,

E::bam_write1] too many CIGAR operations (95243 >= 64K for QNAME "placed")
samtools view: writing to standard output failed: Value too large to be stored in data type

Turns out this particular SAM file cannot be turned into a BAM file. Suprised? Rightfully so. Why does the BAM format limit how many operations may be present in the alignment?

EDIT:

Summarizing the comments/answers:

  1. This is limitation is caused by the way the BAM format stores the value internally 16 bits short = 2^16 = 65536 max operations
  2. The CRAM format does not have this limitation.

Align it like it's still 1999

Here is another unexpected behavior. The hisat2 aligner will, occasionally, refuse to align a read that has an insertion in it.

wget http://data.biostarhandbook.com/strange/reference.fa
wget http://data.biostarhandbook.com/strange/query.fa

bwa index reference.fa
hisat2-build reference.fa reference.fa

The two sequences are identical except the query has an insertion after position 49. Let's do the alignments:

bwa mem reference.fa query.fa | cut -f 1,2,6

among other things it prints:

sequence    0   49M1I134M

This is the correct answer. The query sequence has 49 matches, 1 insertion followed by 134 matches. There are no mismatches by the way. Now let's run the same sequence via hisat2:

hisat2 -f -x reference.fa query.fa | cut -f 1,2,6

the output is:

sequence    4   *

the query sequence is not aligned at all. Make no mistake hisat2 is supposed to be an indel aware aligner. It fails for this particular sequence. Amusingly we detected this when we used hisat2 to "check and validate" results produced via bwa mem.


Post your own "stranger things" observations as followup answers.

samtools hisat2 • 9.8k views
ADD COMMENT
0
Entering edit mode

Turns out the cigar string limitation was already mentioned here: Splitting SAM/BAM alignments

ADD REPLY
0
Entering edit mode

The hisat2 example seems like a bug to me. Did you fill in a bug report?

edit: turns out the hisat2 indel issue has been reported before (Q: HISAT2 (graph index) alignment over InDels), and vivekbhr reported the issue to the authors.

edit 2: which version of hisat2? There is one fresh out of the oven - HISAT2 2.1.0 release 6/8/2017 - which maybe fixes the issue?

ADD REPLY
0
Entering edit mode

Fascinating looks like there is nothing new under the Biostar. Both issues have already been reported

In my observation, in general, insertions and deletions are handled correctly - the failure only occurs in some cases.

I've checked out the latest source of hisat2, the problem still persists.

Filed an issue here: https://github.com/infphilo/hisat2/issues/129

ADD REPLY
0
Entering edit mode

Figuring out what to put effort into allocating memory for can be a hard problem, at the start, especially for tools written when memory and disk resources were considerably more constrained. Programming is hard work! Especially when feature creep sets in and people expect more from a format than it initially allows.

ADD REPLY
8
Entering edit mode
7.5 years ago

R has a weird bug in which counting of arrays starts with index 1.

ADD COMMENT
1
Entering edit mode

The only language that gets this correct is Fortran, where arrays can start at any number you want :) Yes, knowing this and having used Fortran makes me old.

ADD REPLY
0
Entering edit mode

As far as I know also Julia does this. Which is nice, because initially, it started at 1...

ADD REPLY
3
Entering edit mode
7.5 years ago

Why does the BAM format limit how many operations may be present in the alignment?

currently the number of cigar operations is stored in a 16 bits short = 2^16 = 65536 max operations

https://github.com/samtools/htslib/issues/437

ADD COMMENT
3
Entering edit mode

Good to know - but it does demonstrate very well what Donald Knuth called "premature optimization is the root of all evil" (in software development perspective that is).

The SAM/BAM formats are choke full of premature optimizations, from the weird and semi useless flag system, to the two kinds of (different) CIGAR strings (column 6 and the MD tag) the weird tagging system, that the index is not part of the file but a separate one, etc .

It was all developed in the era when they thought shuffling back and forth the tens of millions of 76bp long Illumina sequencing is what bioinformatics will always be about.

ADD REPLY
7
Entering edit mode

I always think the biggest mistake in SAM (not BAM) is the integer flag because it is not human readable. I proposed letter flags, using one letter for one flag bit, but the proposal fell through. Probably I should have acted more firmly at the time. On the other hand, I am not sure how binary flag in BAM is a big issue. How would you encode strand, mate strand, unmapped, read1/2, etc efficiently and compactly?

The second major mistake is the 65535 cigar limitation. The decision was short-sighted. A pull request is pending to fix it for future tools.

On CIGAR+MD, MD can be regenerated and is unimportant. Furthermore, there is not an efficient and convenient way to encode CIGAR and MD together – you would need variable-length array and heap allocation, which is slow, error prone and hard to work with.

A separate index is due to stream-ability. It was not clear how to achieve both at the beginning. Later after we found the EOF marker, we realized it is possible to append the index to the end of BAM, but apparently not many were interested in this.

SAM was designed with full genome alignment in mind. That's why it has hard clipping and uses binning index, which would not be necessary for 36bp reads at the time. Due to the lack of real-world long-read data and applications at the time, SAM did not work well with long reads; the situation is not much improved even with the addition of supplementary alignment and SA several years later. I asked myself from time to time how we could design a better format for long reads if we had started over, but I could not with all the corners we have to cut here and there. It is easier to complain than to come up with a good solution.

The internal of a program can be continuously optimized without interfering with users. A format can't because later optimization may break compatibility, arguably the most important factor to a format. You have to bake early optimization decisions upfront even for imaginary use cases (e.g. long reads). A few are mistakes but some are still amendable (e.g. long cigar). The rest make BAM survive today.

ADD REPLY
2
Entering edit mode

First of all, I do recognize that criticizing and commenting - especially in hindsight is trivial. The premature optimization problem is that it leaks into the solution - we are only looking at things that we think we can optimize today.

The main problem with the SAM format is that it relies too much on the existing paradigms - where the order of columns carry information, where we must think in terms of a fixed number of columns plus a variable number of unordered tags etc. It is not an extensible format that allows it to grow seamlessly and allow new information in the future, it is not a format that could replace other formats - it only adds one extra burden.

The right way to design SAM would have been to make it so that it also offers a viable replacement to all interval type data - BED and GFF (as well as to the and the endless variety of various tab delimited formats). A format where no conversion to another similar format would ever be necessary.

For that a shift in thinking is required - it is like "duck-typing" in high level languages. Any file should work as a SAM file as long as it has the columns that a SAM format requires. Adding 100 more columns to a SAM format should not matter. Adding columns in a different order should not matter. Since the index is a separate entity anyway, there is a lot of room to implement indexing and optimization that allows for fast access to the underlying data.

I think the universal bioinformatics interval format is a header and column defined, sortable (offering multiple indices), bgzippable tab delimited format - working the way similar to how tabix works. Making this file compact, efficient etc should be independent of what the file actually contains.

The job of the "standardization" committee would be to prescribe what each column means what: for example that the header specified as start has to be a long, positive, non-zero integer in a 1-based coordinate system. Then, in addition, the committee could state that for a file to be a valid SAM it has to have a certain number of columns and what each of those columns must contain.

Note how much simpler life has just become. A tool can look at a file and provide meaningful feedback on what might be missing from a file and how to add that information. There is no need to "convert" files - this is the key - no conversion would ever be needed. You'd only need to add information to a file without losing what is already there.

ADD REPLY
2
Entering edit mode

My understanding is that to an alignment format, you still propose to have fixed columns and optional columns. The major difference from SAM would be that each line ought to have a fixed number of optional columns, as are defined in the header. This could work, but I am not sure how much it improves over SAM. The committee can't freely add fixed columns. Doing this pushes the compatibility issues to the downstream tools. In addition, the several problems with BAM you talked about (flag, cigar and MD) are due to binary encoding. Your proposal will be exposed to the same potential pitfalls.

ADD REPLY
0
Entering edit mode

Can't you achieve most of what you propose by just adding user-defined SAM tags? These are similar to "optional" columns, and have the benefit of namespacing and static typing. I guess one big loss here would be the ability to naively sort the file on tags, but sorting shouldn't be too hard, and would be just another reason to rely on a samtools API.

I think you have three features: format simplicity, ease of parsing, and ease of validation, but it's hard to optimize all three.

ADD REPLY
2
Entering edit mode

Tags are the problem - they turn a column based format into a record based format.

The file is now a mix of vertical columns and unordered horizontal records where you have to fully parse a row to figure out what type of information it might contain.

Tags should be the last resort of adding information into a file. Used when we need to store a piece of information that is very peculiar and it is still important to add that. Note what they are called: tags. Yet tags are first class citizens today, essential information that characterizes every single alignment has to be a tag, AS, MD etc because there is no other way to add a new column.

As I said before the real problem is the paradigm: we need to disentangle the order of a column from what the columns contain. If we could do that all pieces fall in place. But as long people think in terms of: "the first column will contain this, the second columns will contain that..." we'll keep ending up with files where the utility diminishes greatly over time.

It is similar to how databases work - you don't create an SQL query of column 1 to column 2 just because you know what is in column 1. The content of a column and its order should be immaterial to the action that we perform.

ADD REPLY
0
Entering edit mode

"the first column will contain this, the second columns will contain that..."

Isn't a definition like that required to codify a format though?

ADD REPLY
1
Entering edit mode

the reason that we have some many formats is that scientists confuse the "information" that a file contains with the order of the columns

start    end 
100      200

and

 end     start
 200     100

are identical from the point of view of the useful information that they contain. But somehow, in bioinformatics, it is "impossible" to even think about that these two files as being identical. Hence the paradigm shift that is needed.

Encoding external information into the order of the columns is like saying that the same file will mean something else when it is in a different folder. It kind of works - until it all breaks down.

ADD REPLY
1
Entering edit mode

While I agree with your point, during analytical process one needs to be able to examine a file/dataset to make sure that things are on track. Having ordered data (at least during the manual verification) makes this simple for mortals (like me). As an end user I don't care how the information is internally stored as long as it is done right.

ADD REPLY
1
Entering edit mode

Why would it be simpler to validate a file format where you have to know beforehand that column 2 is start versus a tabular file where the header tells you that you have a column named start?

What is simpler to communicate to a "mere mortal" that their file is missing the column named transcript id versus saying they need a correct GTF file format?

You can still validate and enforce formats to be GTF and BED and GFF if they have the right columns, but now you have the advantage that you could store other columns.

You could have a file that is BED and GFF at the same time. Once you have genes.bed.gff now all of a sudden we don't need to convert we don't need to keep the same information in two different files.

ADD REPLY
0
Entering edit mode

I was referring to visual examination of the contents (where it is easier to follow left-to-right/top-to-bottom order) and not validating file format (which would be hard to do visually).

A side note. While this order may work for many it must be equally foreign to those who use languages that are written from right-to left.

Would the order-free columns have a key at the beginning of the file that tells where a data column will be located in the file so the headers do need to be parsed?

ADD REPLY
0
Entering edit mode

Having ordered data is also nice for interoperability with normal unix tools, since you don't need to tweak your awk commands after looking at headers.

ADD REPLY
1
Entering edit mode

that's why bioawk exists to do just this - Heng actually has had all the right ideas all along - bioawk, tabix these are the tools we should build bioinformatics around - and not a new formats.

ADD REPLY
0
Entering edit mode

yeah ! there's only RDF ! let's use RDF to encode BAM !! :-)

enter image description here

ADD REPLY
0
Entering edit mode

I guess I just don't see too much of a problem with record-based formats. If you need to efficiently index columns, you'll need a database. For most unix tools you're relying on readline and so any operation on columns has to read the entire row anyway. With a record-based format you're at least validating the content of each row/column as you iterate through the file, and I gather this is your main issue - that parsing a record takes more time than splitting a line on fixed delimiters. For most of my use cases it's not the record parsing efficiency that needs optimizing.

ADD REPLY
0
Entering edit mode

I can understand why you say yours is a paradigm shift: optional columns unify multiple formats, but optional tags don't. If you had implemented this 20 years ago and a toolbox working with it as efficiently as other unix tools, your proposal might have greatly influenced bio format design – I can't really predict. If we look at it today, unfortunately, it is not as appealing. As I said above, as an alignment format, it is not really better than SAM. When you go down to binary and implementation, your proposal will be less efficient than BAM/htslib, both speed and memory wise, and will probably have less convenient C APIs (e.g. you can't access data directly via a plain C variable). These are the trade-offs you pay for flexibility. That said, I am actually interested in trying some of your ideas, but won't be soon.

ADD REPLY
0
Entering edit mode

The UCSC bigBed format is something that comes very close: it has only very few defined minimal columns (chrom, start, end, score, strand) and you can add what you want. If the file has a certain structure, it's a PSL (called bigPsl) alignment. It can also be a gene model file when it has certain columns (bigGenePred).

ADD REPLY
0
Entering edit mode

As far as I know CRAM can handle the longer CIGARs. It's not the first time that I read about this issue... See for example also this thread Splitting SAM/BAM alignments

ADD REPLY
1
Entering edit mode

yes indeed it does.

efetch -id NZ_AP014854 -format fasta -db nuccore > NZ_AP014854.fa
samtools view -O cram -T NZ_AP014854.fa -b http://data.biostarhandbook.com/strange/cigarbreak.sam > output.cram

If there ever was a reason to switch to cram this is it.

ADD REPLY
0
Entering edit mode

weird and semi useless flag system

Amen to that.

ADD REPLY
0
Entering edit mode

It is greatest flaw beyond all that is that is both confusing and plain wrong. The forward strand should not be thought of as the opposite of reverse, first in pair is not the opposite of second in pair and so on. This generates endless confusion, an immense waste of effort and money that is omnipresent yet hard to capture, demonstrate and quantify.

ADD REPLY
2
Entering edit mode
7.5 years ago

SOAPaligner hasn't been updated since 2011 and still replaces N by G in alignments: SOAPaligner 2.21 - does it replace all Ns by Gs in reads?
But then again, I don't think people still use this aligner.

OrthoMCL seems to have a bug when calculating paralogs where it keeps self-hits - PorthoMCL (reimplementation without MySQL) has a fix but also a flag to keep that bug for 'better' reproducibility: https://github.com/etabari/PorthoMCL/wiki/Manual#step-5-finding-best-hits
Description of the bug: https://github.com/etabari/PorthoMCL/wiki/Comparison#orthomcl-bug

ADD COMMENT
2
Entering edit mode
7.5 years ago
h.mon 35k

A very annoying (and wasteful, really) feature of several of the public databases of raw reads is that uploaders are not required to provide the phenotype metadata associated with the reads.

edit: coming from a evolutionary biology background, I often think about "phenotypes", but what I really meant was metadata, be it biological, or technical.

ADD COMMENT
1
Entering edit mode

Having uploaded data to the SRA I never had any phenotypes for that data in the first place (but this is, in my opinion, one of the biggest problems in bioinformatics - mountains of genomic data, hills of phenotype data)

ADD REPLY
0
Entering edit mode

Could not agree more! I would not even call it hills...

ADD REPLY
0
Entering edit mode
7.5 years ago
microfuge ★ 2.0k

Use of gff/gtf by so many tools. Although an efficient format and hierarchical but still a nightmare. Bed12 is so much easy. Never managed to cleanly convert gff/gtf-> bed12.

ADD COMMENT
2
Entering edit mode

You can't store a full GTF worth of info in bed12, the entries are always transcript-centric.

ADD REPLY
1
Entering edit mode

The main problem with the BED format is that you can only store a very limited and predetermined type of information.

ADD REPLY
0
Entering edit mode
7.5 years ago

This may be a question in itself but maybe it fits well in this thread...

What's the purpose of the SRA format? Every question about it seems to be about converting from sra to fastq which makes me wonder whether the sra format was at all necessary.

(Then of course, in hindsight it's easy to be critical and say this should have been done better)

ADD COMMENT
2
Entering edit mode

What's the purpose of the SRA format?

To annoy the rest of the world, seemingly.

ADD REPLY
2
Entering edit mode

The purpose of SRA format is to have a column-oriented datastore where each column can have a separate datatype-specific compression applied. Sequences, sequence names, quality scores, alignment information... all get compressed more efficiently than just gzipping all the data. Rows in SRA are individually addressable, so you can get a specific subset of the data without reading the whole file. The reference assembly is also linked to the data, so any operations that require comparison to the reference, e.g CRAM-like compression or calculating SAM tags, will be done against the correct genome.

@genomax is right that it's for NCBI's internal use, and I totally agree that they should have provided an easier way to dump FASTQ files without requiring their client library.

ADD REPLY
1
Entering edit mode

It perhaps made sense in how the data is handled internally at NCBI. They should have made it easy to recover fastq files transparently though. Fortunately EBI-ENA provides that functionality for most SRA accessions.

If I recall right, implementation of SRA database was done by external contractors. Rest you can imagine.

WikiPedia has the following quote:

Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible data compression, API access and conversion to other formats such as FASTQ

ADD REPLY
0
Entering edit mode
7.2 years ago
ATpoint 86k

It does not relate to a certain tool, but why are databases like dbGaP, IHEC and DEEP access restricted, requiring to fill out these extensive application forms, including abstracts of your project. It is not that these databases contain manuals to build biological weapons, so why all this restriction? DEEP data are even encrypted after download, requiring additional processing time.

ADD COMMENT
1
Entering edit mode

The restricted access is meant to protect personally identifiable information. We don't yet understand what this data could be used for.

ADD REPLY
0
Entering edit mode
7.2 years ago

Cut in GNU-Linux: If you cut multiple fields or single field (from a tsv, csv or a patterned data), output will be tabular from a single file. If you cut from multiple files, it would be serial instead of tabular. At least an option to print in columns would be nice.

ADD COMMENT
2
Entering edit mode

You can cut multiple files and paste their output in columnar form easily with paste, e.g.:

$ paste <(cut -f1-5 A.txt) <(cut -f4,8 B.txt) ... <(cut -f3-5,7 N.txt) > result.txt
ADD REPLY
0
Entering edit mode
7.2 years ago
Charles Plessy ★ 2.9k

The index in BWA stores the reference sequences in a contiguous manner, therefore when the first read of a pair aligns very near the end of a chromosome and the second read aligns very near the beginning of the next chromosome, the pair may be marked as proper. Now I always postprocess bwa aln with samtools fixmate.

ADD COMMENT
0
Entering edit mode

I believe that this is a "feature" not a bug - that is was made to function like that by design, to detect chromosomal fusions.

ADD REPLY
0
Entering edit mode

I do not see how it can be made useful. For instance, in the example below, let's imagine an index containing the sequence of chromosomes a, b, c and d (in that order), and two pairs (-> / <- and => / <=), then only the pair -> / <- can be marked proper, most of the times wrongly. Even if the pair => / <= were coming from a real rearrangement, the two reads would be too far apart to be marked proper.

        ->            =>
aaaaaaaaaabbbbbbbbbcccccccccccddddddddd
          <-                        <=
ADD REPLY
0
Entering edit mode

I don't disagree with the sentiment that this behavior can be troublesome and ought to be documented.

But I will say that the meaning of the word "proper-pair" is undefined in the specification. Anything can be proper-pair as long as the aligner considers it as such. We end users often assume that "proper pair" is what we think is right.

So it is all about documentation (and in this case lack of it).

In this case, I believe that chromosomal fusions are indicated by a read pair that otherwise aligns within the expected fragment distance (proper pair) with the mates on different chromosomes.

ADD REPLY

Login before adding your answer.

Traffic: 1762 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6