赤足者巧巧裸足:Big Dictionaries I: query/update tradeoffs | ...

来源:百度文库 编辑:九乡新闻网 时间:2024/04/28 08:39:17

Big Dictionaries I: query/update tradeoffs

Andy Twigg | February 22, 2011 |

I’ve been asked to write a short series of technical posts about the problem of efficiently implementing a big dictionary — the fundamental data structure underlying file systems, databases and the Acunu Storage Core — and in the process hopefully shedding some light on our technology.

In this first post I want to discuss query/update tradeoffs for unversioned dictionaries. That might be a bit of a mouthful at first, but it’s not too complicated. The plan for future posts looks something like this: in part 2 we’ll examine versioned dictionaries and some tradeoffs there. In part 3, we’ll consider the problem of building a cache-oblivious versioned dictionaries, and why you’d want such a beast.

Before diving in, let me summarise a bit where Acunu sits in this landscape. We’ve developed the first suite of cache-oblivious (and “cache-aware”) algorithms that support fully-versioned dictionaries with almost-optimal query/update tradeoffs. They also keep these bounds when applied to SSDs, which have a somewhat different access model to traditional disks. Efficient versioning is a very powerful tool, and our schemes are the first that allow dynamic query/update tradeoffs for versioned data, in particular allowing very high update rates for fully versioned metadata.

Most of the algorithms I’ll cover in these posts aren’t in textbooks (yet), although several can be found in research papers. We plan to publish papers with more details and proofs; we’ll post links when they become available.

The Dictionary

An ordered dictionary supports two basic operations:
* update(key,value) associates value with key (overwriting any previous value, if any);
* query(start, end) returns all (k,v) pairs where k is between start and end.

We’re interested in ‘big’ dictionaries, by which I mean that the are far too many keys to fit into internal memory, so we must resort to using external memory, such as disk. For the past couple of decades, the B-tree has been the data structure of choice, largely because queries are fast — indeed, asymptotically optimal. But what does this mean? In order to talk more precisely about bounds on query and update cost, we need a cost model. The ‘disk access model’ (DAM) assumes we have a disk and some memory, and that data is always read or written in blocks of B elements (in other words, it’s not worth reading or writing any less than this). In practise, a large SATA disk typically has B around 256KB-512KB (but this view of the world is quite oversimplified, as we’ll discuss in part 3!)

A trivial tradeoff

With N elements, a B-tree has height , so updates and queries both run in time . It’s been shown that queries in the DAM model cannot be done faster than this on average. However, it’s clear with a little bit of thought that update cost can be significantly lower if we’re willing to sacrifice query cost for the resulting structure. Consider an unordered append-only log. Inserts are super-fast: 1/B block accesses on average. However, because it’s unordered, queries must scan through the log, and are doomed to be horrendously slow: .

So technically update cost has been improved, but only at the expense of making queries ridiculously slow. This begs the question: are there better tradeoffs between query and update performance?

Having one’s cake and eating it

There is a very interesting tradeoff between query and update performance, and B-trees achieve only one point on this tradeoff.
Data structures that achieve much better update performance without resorting to linear scans on queries are known, for example log-structured merge (LSM) trees. Over the last decade, the algorithms community has developed a much better understanding of these tradeoffs, many of them with matching upper and lower bounds. In particular, consider the following (folklore?) scheme, based on the `logarithmic method’.


Maintain a set of geometrically-growing sorted arrays in `levels’; level k contains at most 2 arrays of size at most entries, indexed via an embedded B-tree. Updates go to an in-memory buffer of size B; when full, this array is sorted and written out into level 0 along with an index, and then for each level in turn, if there are two arrays in the level, we merge them and write out a new array (with its associated B-tree index) into the level above, deleting the input arrays when done. Since the two input arrays to this merge process at level k have size entries, the output has size at most entries, as required for membership in level k+1. Queries are answered by querying the B-tree of each array and the most recent result (if any) returned.

After inserting N elements there are at most levels, so queries have cost . The update cost can be calculated by considering the merges: every merge involves a sequential merge of two arrays of size at least B, so on average costs per element. At worst, every element is involved in merges, so the average (or amortized) update cost is .

Let’s put some real numbers to this. B is typically 256KB/(100 bytes per element), say 2500, and for 25TB. Then this scheme does around 38/2500 = 0.015 IOs/insert, while a B-tree needs around IOs. A factor of over 700x! Meanwhile, queries have slowed down by around . With some work, this can be improved a lot, but it shows what is possible.

This scheme is far from perfect (e.g. the avid reader will have noticed that the worst-case update cost is far worse than the average case), but it gives a flavour of how these kinds of data structures work; this sort of data structure is roughly what underpins Google’s BigTable, and its open source relatives including Cassandra, HBase, and so on. More recently, Tokutek’s fractal tree data structure shows how one can improve the lookup performance of this idea to  by applying a technique known as fractional cascading. Indeed, one can prove that their query/update bounds are asymptotically optimal. Thus, we have two points on the optimality curve: B-trees and fractal trees.

As good as it gets

Tradeoffs like these are ubiquitous in fundamental computer science problems, so you won’t be surprised to hear that a range of possible tradeoffs has been figured out. The following is possible, and known to asymptotically optimal: for any constant , we can achieve updates in IOs, and lookups in IOs.

I’m not going to go into any more detail than this, but it’s interesting to note that the tradeoff is exponentially asymmetric – if we’re willing to tolerate a small increase in lookup cost, we can potentially obtain a much larger reduction in update cost. To see this, set , which for the values of B used earlier, makes updates around times faster while making queries slower by a factor of 2.

In Part II, I look at what happens if we want to add versioning to our dictionary – that’s where it starts to get really interesting!