Bin-arrays

Introduction

Bin-arrays are a compact way of storing a large collection of lists while allowing modifications in O(1) time. They share many of the characteristics of doubly-linked lists and are applicable to many of the situations where doubly-linked lists would be used. Bin-arrays are most useful, however, when dealing with large amounts of data, and when using a minimal amount of memory is more important than having extremely fast lookups, insertions, or deletions.

Structure and Algorithms

A precise statement of the problem being solved is, how do you compactly store a large collection of homogeneous lists of varying lengths while allowing deletions of arbitrary elements and insertions at arbitrary positions in constant time?

Figure 1. Minimal Representation

As a basis for evaluating the effectiveness of our proposed solution, consider the simpler problem of storing a known set of values in an immutable structure.
Figure 1 above shows the most compact way of storing the pair of lists [1, 2, 3] and [4, 5]. The general scheme is to place all elements of each list in a contiguous block of slots in a single array. If this is done carefully, we can even do without explicitly recording the index of the last element of each list, and thereby save space. Specifically, we can deduce the index of the end of a given list from the index of the start of the following list.

Figure 2. Linked-List Representation

The most straightforward solution to the stated problem is shown in Figure 2 above. The general scheme is to allocate a node for each element in each list and to link together the elements using forward and backward pointers. This approach is easy to implement, easy to understand, and reasonably compact. However, the memory usage is not as low as it could be for the following reasons. First, allocating a large number of small objects can cause an undesirable amount of overhead in the general purpose memory allocation routines of a given language (for example, malloc() in C). Some implementations of memory allocators are better able to deal with this situation than others, but it is probably not unreasonable to assume that there will be undesirable overhead. (Note that in C++, a programmer can write a class-specific operator new() to dramatically reduce, though perhaps not entirely eliminate, the overhead.) Second, using a separate object for each element entails a nontrivial amount of fixed overhead per object in Java. This is so because every class in Java must derive from Object, which means that even an empty object takes up around 16 bytes.

Figure 3. Array-based Linked-List Representation

Figure 3 above shows an approach which avoids both these problems by allocating only three arrays rather than lots of individual objects. The general scheme is to store each element in an array and to use a pair of array indices to link together the elements. Memory usage, however, may still not be as low as could be desired. If the user data consists simply of 4-byte integers and each link data is a 4-byte integer, then there is a constant 200% overhead in this representation. When storing a million elements, this amounts to 8MBs of overhead. This may seem like a tolerable amount of overhead given the large amounts of memory common in even desktop systems these days, but it still seems worthwhile to try to reduce the memory usage even further. It is quite likely that we may want to deal with not just a million elements, but say ten million elements, or even fifty million elements. These numbers imply overheads of 80MBs and 400MBs, which are sizable amounts even today.

Figure 4. Bin-arrays Representation

The bin-arrays data structure, shown in Figure 4 above, is a modification of the array-based linked-list approach. It reduces memory usage by reducing the overhead of the link data. In this scheme, the elements of a given list are stored in one of four arrays. The array chosen depends on the length of the list. A list of length 1 (for example, [6]) is stored in the 1-bins array; one of length 2 (for example, [4, 5]) is stored in the 2-bins array; one of length 3 (for example, [1, 2, 3]) is stored in the 3-bins array; and finally, one of length 4 or longer (for example, [7, 8, 9, 10, 11, 12, 13]) is stored in the threaded 4-bins array. Inserting to, or deleting from, a list may cause that list to move to a different array. For example, appending to the list [6] will cause it to move successively to the 2-bins array, then the 3-bins array, and finally to the 4-bins array. Similarly, deleting from the list [1, 2, 3] will cause it to move successively to the 2-bins array and then to the 1-bins array. Two bits are reserved in the index stored in the index array for indicating which of these four arrays actually contains the elements of a given list.

The 1-bins, 2-bins, and 3-bins arrays store user data contiguously in fixed-size blocks. Because each stored list fits entirely within a block, we can do without storing any link data. Hence, the memory usage in this scheme for a collection of lists containing fewer than 4 elements is near optimal.

The 4-bins array stores the lists that have 4 or more elements. The approach is similar to the array-based linked-list representation except that there are now four data elements per a pair of forward and backward links. There are also a few other slight differences. The "previous" link of the first bin of a given list stores the index of the last bin of the same list. This makes it possible to find the last bin in constant time, in order to do appends, and to determine when the end of the chain of bins has been reached during traversals. Another difference is that the "next" link of the last bin gives the size of the list. In combination with the "previous" link of the first bin, this makes it possible to determine the length of any list in constant time. Finally, because each bin has room for four elements, and because not all of the slots may be occupied at a given moment, two bits in the "previous" link of each bin is reserved for recording the current occupancy (assuming that all data items are packed at the beginning of the bin).

There are two distinct algorithms for modifying the lists in the 4-bins array. One algorithm assumes that the lists are unordered, and the other that they are ordered. All lists in a given 4-bins array must be of the same type, either ordered or unordered.

The algorithm for unordered lists is the simpler of the two and always ensures that the 4-bins array is used in the most optimal way. The procedures for inserting and deleting an element is as follows: (i) to insert an element, simply append it to the end of the list, and (ii) to delete an element, overwrite the slot of the element with the last element in the list. See Figure 5 below.

Figure 5. Modification Algorithm for Unordered Lists

The algorithm for ordered lists is only a little more involved. We introduce some notation to ease the discussion. Number the bins that make up a list sequentially from zero, and let u(i) be the number of data items actually occupying bin i; that is, u_i is the usage of bin i.

The procedure for inserting an element into bin i is as follows:

  1. if u(i) < 4, then insert into bin i;
  2. else if u(i-1) < 4, then insert into bin i-1;
  3. else if u(i+1) < 4, then insert into bin i+1;
  4. else split bin i into two bins; redistribute the elements accordingly; and insert into one of the bins, ensuring the proper ordering of elements.

The procedure for deleting an element from bin i is as follows:

  1. remove the element from bin i, decrementing the usage count;
  2. if u(i) = 0, then unlink bin i from the chain of bins;
  3. else if u(i-1) + u(i) <= 4, then combine bins i-1 and i;
  4. else if u(i) + u(i+1) <= 4, then combine bins i and i+1;
  5. else nothing further need be done.

Analysis of Memory Usage

We assume the minimum amount of memory needed to store a dataset is determined directly by the number of bytes in the dataset itself. So we consider that a dataset of one million 4-byte integers requires a minimum of 4 million bytes, even if the values are sufficiently limited in their range to make it possible to store the same data without loss in 1 million bytes. That is, we are excluding the quite different definition of minimum memory usage that would arise in a discussion about compression. The purpose of the following analysis is to examine the degree of overhead beyond this straightforward calculation of the minimum amount.

As a basis for comparison, we note that the minimal representation uses exactly the minimum amount of memory. Hence, the ratio waste/data is zero for this scheme. To determine this ratio for the object-based linked-list representation, suppose that each data element takes up 4 bytes, that each link pointer takes up 4 bytes, and that each object takes up a further 16 bytes as a consequence of descending from Object in Java. Then waste/data = (2*4 + 16) / 4 = 6. To determine this ratio for the array-based linked-list representation, suppose the same sizes as given above for data and links. Then waste/data = 2*4 / 4 = 2. The determination of this ratio for the bin-arrays scheme is more complicated; however, we can see immediately that if all lists have fewer than 4 elements, then waste/data = 0.

Therefore, without a loss of generality, we can restrict our analysis to lists with 4 or more elements. Moreover, since the insertion and deletion algorithms work on each list independently of all others, we can further restrict our analysis to the memory usage of a single list. So suppose that we have a list of n >= 4 elements. There are two analyses to perform, one for the case of unordered lists and another for ordered lists.

We first introduce some notation. Let k be the number of slots available for user data each bin. Let l be the number of slots taken up by links (or other administrative data) in each bin. Let n = kq + r, 0 <= r < k be the number of elements in a list. Finally, let m be the number of bins taken up by a list. In our case, k is equal to 4, and l is equal to 2. In the following analysis, assume that each data element and each link takes up 4 bytes.

It is easy to see that when a list is unordered all the bins except the last will be fully occupied. Hence, if a list has n = kq + r, 0 <= r < k, elements, then the overhead equals the sum of l slots of link data per bin and k-r data slots wasted in the last bin; that is,

             { lq + l + k - r   l + (l + k - r) / q
             { -------------- = -------------------, if r > 0    (1)
     waste   {     kq + r             k + r/q
     ----- = {
     data    { lq   l
             { -- = -, if r = 0                                  (2)
             { kq   k
      

As n goes to infinity, q goes to infinity, so that Eq. (1) approaches l/k, which is Eq. (2). In fact, Eq. (2) is the lower bound of Eq. (1) as Figure 6 below shows for the case when k = 4 and l = 2. The overhead is lowest whenever n is a multiple of k. The overhead is highest when n = k + 1. In Figure 6, we can see that the overhead peaks at 1.4 when n = 5, remains less than or equal to 1.0 for all other values of n, and sinks to a minimum of 0.5 whenever n is a multiple of 4.

Figure 6. Overhead in Unordered Lists

The analysis for ordered lists is a little more complicated. The restriction that the elements be kept in order means that the bins will not in general stay as well-packed as unordered lists in the face of arbitrary insertions and deletions. However, the insertion and deletion procedures ensure that the worst-case memory usage remains bound. They do this by maintaining the following two invariants on the structure of the chain of bins that make up a list:

  1. if m = 1, then u(0) = k;
  2. for each 0 <= i < m - 1, u(i) + u(i+1) > k.
The first says that if a list takes up only one bin, then that bin must be fully occupied. The second says that the usage of any two adjacent bins must add up to more than k. The second therefore implies that u(i) > 0 for all i. It is immediately obvious that these invariants hold when a list is first put into the 4-bins array, which happens exactly when the list has k elements. We can easily verify that these invariants hold as arbitrary elements are inserted and deleted.

First, consider the insertion algorithm when inserting an element into bin i. By hypothesis, the invariants hold true, and so u(i-1) + u(i) > k and u(i) + u(i+1) > k. It is easy to see that these relations will continue to hold if any of the u(j) values is increased by one, which is what would happen if an element was inserted into bin i-1, i, or i+1. So suppose that none of the three adjacent bins has room. Then u(i-1) = u(i) = u(i+1) = k, and bin i will be split into two bins, i' and i'', with the elements distributed in some manner between the two. The result will be that u(i') > 0, u(i'') > 0, and u(i') + u(i'') = u(i) + 1 = k + 1. This implies that u(i-1) + u(i') = k + u(i') > k and u(i'') + u(i+1) = u(i'') + k > k. Hence the invariants will be maintained.

Next, consider the deletion algorithm when deleting an element from bin i. First, consider the case where bin i has only one element. By hypothesis, the invariants hold true, and so u(i-1) + u(i) > k and u(i) + u(i+1) > k. That is, u(i-1) > k - 1 and u(i+1) > k - 1. This means that u(i-1) = u(i+1) = k, and hence u(i-1) + u(i+1) > k. Therefore the invariants will hold after bin i is unlinked from the chain. Next, consider the case where bin i has more than one element. Let i' be bin i after the element has been deleted. If u(i-1) + u(i') > k and u(i') + u(i+1) > k, then we are done. So suppose that u(i-1) + u(i') <= k. Then the deletion algorithm will combine bins i-1 and i' into bin i'' (see Figure 7 below). In the resulting chain, bins i-2 and i'' are adjacent, and bins i'' and i+1 are adjacent. We need to show that the second invariant holds in each pair. First, note that u(i'') > u(i-1) and u(i'') > u(i'), since u(i'') = u(i-1) + u(i'). Now consider the first pair. By hypothesis, u(i-2) + u(i-1) > k. Therefore, u(i-2) + u(i'') > u(i-2) + u(i-1) > k. Now consider the second pair. Similar reasoning shows that u(i'') + u(i+1) > u(i') + u(i+1) > k. Therefore the invariants hold in each pair. The other possibility when bin i has more than one element--namely, u(i') + u(i+1) <= k--is similar. Hence, in all cases, the invariants are maintained after deletion.

Figure 7. Steps in the Deletion of an Element

The invariants ensure that the worst-case memory usage is bounded for any given list length. Intuitively, the invariants impose a lower limit on how thinly the elements may be spread out in a list by a sequence of insertions and deletions. The worst-case memory usage is determined only by the number of bins used, and in particular is independent of the arrangement of the elements within any bin. It is easy to see that the arrangement of the elements does not matter, since a list of n elements taking up m bins always has a waste of (k+l)m - n slots, regardless of the arrangement. Hence, using the maximum number of bins possible for a given list length leads to the most waste. Conversely, if a list is stored in such a way that it creates the most waste, then it will necessarily span the maximum number of bins possible for its length. For suppose not. Then a list of n elements is stored with the maximum waste in x bins, and yet it is possible to store the same list in y > x bins. The waste for y bins is (k+l)y - n >= (k+l)(x+1) - n = (k+l)x - n + (k+1) which is greater than the waste for x bins--a contradiction.

Therefore, the worst-case memory usage for a list of a given length is determined precisely by the maximum number of bins that the list can span under the invariants. Suppose a list has length n = (k+1)q + r, 0 <= r < k+1. Then the maximum number of bins z that the list can span is given by:

                   { 2q + 1, if r > 0                            (3)
               z = {
                   { 2q, if r = 0                                (4)
      
This is a consequence of the second invariant. The reasoning is as follows. For a list spanning m bins, the invariant requires that u(i) + u(i+1) > k, 0 <= i < m-1. In particular, it requires that the sum of the usage of each successive pair of bins be greater than k; that is, u(0) + u(1) >= k+1, u(2) + u(3) >= k+1, and so on (see
Figure 8 below). In light of this, consider the case where r = 0, that is, where n = (k+1)q. Suppose on the contrary that the list spans z' > z = 2q bins. Then the list spans the bins 0, 1, ... , 2q-1, 2q, ... , z'-1. In particular, it must be the case that u(2q) >= 1. Now, as we saw, the second invariant requires that u(0) + u(1) >= k+1, u(2) + u(3) >= k+1, ... , u(2q-2) + u(2q-1) >= k+1. Hence, u(0) + u(1) + ... + u(2q-1) + u(2q) >= (k+1)q + u(2q) > (k+1)q. This implies that the list has more than (k+1)q elements, which is a contradiction. Therefore, if n = (k+1)q, then the list can span no more than z bins. Next, consider the case where r > 0. For this case also, suppose the contrary. Then the list spans z' > z = 2q + 1 bins. Reasoning as in the previous case establishes that u(0) + u(1) + ... + u(2q-1) >= (k+1)q. This means that the remainder of the elements, which number at most 0 < r < k+1, must be in the bins 2q, ... , z'-1. Since z' > z, bins 2q and 2q+1 at least are a part of the list. They are also adjacent bins, and so the second invariant requires that u(2q) + u(2q+1) >= k+1. However, there are only r elements left with which to fill them, and r < k+1. Therefore z' cannot be greater than z, and this finally shows that z is an upper bound in all cases.

Figure 8. Usage of Successive Pairs of Bins

We have shown that z is an upper bound on the maximum number of bins that a list can span. It is in fact exactly the maximum value. There is a distribution of elements allowed by the invariants which causes a list to span exactly z bins. The distribution is as follows. Suppose a list has length n = (k+1)q + r, 0 <= r < k+1. Then distribute the leading (k+1)q elements in q adjacent pairs of bins such that the first bin of the pair has a usage of 1, and the second has usage k. If there are any remaining elements, that is, if r > 0, then put them all in a final bin (note that r <= k). See Figure 9 below.

Figure 9. Worst-case Distribution of Elements

Knowing now the maximum number of bins z that a list may span, and that the waste for a list of length n spanning z bins is (k + l)z - n, we can compute the ratio of waste to data as follows:

     waste    (k + l)z - n   (k + l)z
     ----- = ------------- = -------- - 1                        (5)
     data          n            n

             { (k + l)(2q + 1)       2(k + l) + (k + l)/q
             { --------------- - 1 = -------------------- - 1, if r > 0 (6)
             {  (k + 1)q + r             k + 1 + r/q
           = {
             { (k + l)2q       2(k + l)
             { --------- - 1 = -------- - 1, if r = 0            (7)
             {  (k + 1)q         k + 1
      
Note how the ratio depends only on k and l in Eq. (7), and how Eq. (6) approaches Eq. (7) as q goes to infinity, which happens when n goes to infinity.
Figure 10 below shows the worst-case memory usage plotted alongside the best-case memory usage. The best-case memory usage of ordered lists is the same as the memory usage of unordered lists of the same length. For lists of length 4, 5 or 9, the figures for the best and worst cases coincide. The worst-case memory usage peaks at length 6 to 2.0 and stays less than 2.0 for all other lengths. We can also see the worst-case memory usage converging to the value of Eq. (7), namely 1.4.

Figure 10. Overhead in Ordered Lists

Conclusion

Bin-arrays can store large amounts of data more compactly than in common implementations of doubly-linked lists, while still allowing modifications in constant time. The best straightforward implementation of doubly-linked lists--the array-based linked-list implementation--has a fixed waste-to-data ratio of 2. Bin-arrays, even in the worst case, do no worse. The amount of overhead in real use depends too much on the actual dataset and the actual sequence of insertions and deletions to be estimated. However, we can still make a few observations. If all the lists have 3 or fewer elements, then memory usage will be just a negligible amount above the minimum. If the lists are unordered, then memory usage will always be better than the array-based linked-list implementation. In this case, the waste-to-data ratio reaches a maximum of 1.4 for a list of 5 elements and remains at or below 1.0 for all other lengths. If the lists are ordered, and they are only appended to, then memory usage will be the same as for the case of unordered lists. If the lists are ordered, and elements are inserted into arbitrary positions, and elements at arbitrary positions are deleted, then worst-case memory usage can equal that of array-based linked lists in the case of lists of length 6. For lists of all other lengths, even the worst-case memory usage will be better than that of array-based linked lists. Hence, in situations where reducing memory usage is critical, bin-arrays may be a promising approach to implementing a collection of doubly-linked lists.


Young Hyun
Last modified: Mon Dec 11 11:05:32 PST 2000