Filter experiments

Intro

This page documents a new kind of bucket digest that lets us efficiently (in time and space) determine the amount of overlap (shared users) between two buckets. The data record takes a little less than 1kB per bucket. The interface is:

digest = makeDigest(bucket)

overlap = computeOverlap(digest1, digest2)

makeDigest() doesn't require any sorting, so runs O(N), where N is the number of users in the bucket. computeOverlap() costs around a dozen bitwise compare operations.

computerOverlap() can compare buckets of different size. In this case, the reported overlap is the percentage of users in the smaller bucket that are also in the larger bucket. computeOverlap() only works for sure if the larger bucket is no more than 16X larger than the smaller bucket, and never works if the larger bucket is more than 64X larger than the smaller.

is quite accurate for bucket that have nearly full overlap, or nearly zero overlap. computerOverlap() is less accurate for buckets with partial overlap. A pair of buckets with, say, 80% overlap might be computed as having between 50% and 100% overlap. The average of 15 or so such compares, however, converges pretty well at around 80%. (More measurements to come.)

Attacks

The reason we need to measure overlap between buckets is that there are a number of sophisticated attacks that involve buckets that overlap, but are not identical matches (and so won't be found by the bucket hash compare). The attacks are summarized here (note that none of these attacks require only the ability to isolate the victim...no other knowledge of the database is required):

Attack 1: Create two buckets. Both buckets have the same set of "base" users B1 (doesn't matter what users or how many, as long as they are the same), but one bucket contains the victim if the victim has a given attribute, and the other bucket contains the victim is the victim does not have the attribute. Generate 40 or 50 of these, each with a different set of base users. The take the sum of all the first buckets, and the sum of all the second buckets. The higher sum contains the victim:

Set of first buckets	Set of second buckets
[B1+V]	[B1+(!V)]

Note that creating the set of base users is as simple as hashing some attribute (location), and counting the users where the last X bits of the hash value are Y. Each different Y generates a different set of orthogonal (non-overlapping) base users.

To defend against this attack, we identify bucket pairs that have high overlap (with pretty high probability), regenerate the full buckets by repeating the queries, then compare the full buckets. If they differ by one or a few users, we can remove those users.

Attack 2: Create a bucket with base users B1, and another bucket with base users B2, where B1 and B2 are orthogonal. Then create a third bucket that contains both B1 and B2, as well as the victim if the victim has the desired attribute (and without the victim otherwise). Generate 40 or 50 of these three-bucket tuples. Take the sum of the individual B1 and B2 buckets. Take the sum of the combined B1+B2 buckets. Divide each of the sums by the number of tuples. The latter sum will be roughly 1 greater than the former sum if the victim has the attribute:

Set of first buckets	Set of second buckets
[B1],[B2]	[B1+B2+V]

Here we take V to mean V OR !V, depending on whether or not V has the attribute. Alternatively, we could put V in the first set if it has the attribute, and in the second set if not:

Set of first buckets	Set of second buckets
[B1+V],[B2+V]	[B1+B2+(!V)]

(Moving forward, we'll assume that the attack can take either approach to inserting the victim, so we won't mention it again.)

Note that for this attack, B1 and B2 do not have to be the same size. We can detect this attack by finding tuples of buckets A, B, and C, whereby:

both A and B have complete or nearly complete overlap with C,
A and B have no overlap or almost no overlap with each other, and
the sum of the sizes of A and B is very nearly the same as the size of C.

I say 'nearly' and 'almost' in the above because the attacker can introduce some chaff into the buckets if he can isolate additional users to use as plants. The more plants, however, the more obvious the attack looks in the query code.

When we find this situation, we can re-construct the buckets, and look for the attack.

Attack 3: This is a simple variant of attack 2, but where more than 2 bases are used:

Set of first buckets	Set of second buckets
[B1],[B2],...[Bn]	[B1+B2+...+Bn+V]

This approach is harder for us to detect. There is less overlap, which means that we might miss some of the pairs. When this happens, it will be harder to generate a complete N-tuple, because the sum of users on the left won't equal the sum of users on the right. On the other hand, it isn't necessary to generate the complete N-tuple. Indeed, if we only find only two buckets on the left and the bucket on the right, we'll have a very strong signal: zero overlap between left-hand buckets, and complete overlap left to right.

Note that this attack requires more samples from the attacker, because the left-hand buckets produce more noise (one noise component per bucket).

Attack 4: Another variant of the above is to have combinations of base groups on both the left and the right. For instance:

Set of first buckets	Set of second buckets
[B1+B2],[B3+B4]	[B1+B3+V],[B2+B4+V]

Here the signal is partial overlap left-to-right, no overlap among left-side buckets, and no overlap among right-side buckets. This is pretty messy, because one can expect a legitimate set of buckets to often have partial overlap and no overlap. What I think we'll have to do is to generate a graph, whereby a link in the graph denotes partial overlap, and then try to discover bipartite sub-graphs.

Note that this attack can be taken to an extreme. One could, for instance, make 16 buckets on each side, each with 1/16 overlap with buckets on the other side (using 256 blocks). I don't think we'll be able to use overlap to detect something like this. Rather, we'll have to look for some other signal. (For instance, in this specific case, we would see some set of buckets with roughly the same size and little overlap.)

Basic Compare Mechanism

The basic mechanism is a variant on the bloom filter. The principle is simple: we set certain bits in the filter according to the set of users in the corresponding bucket. The more bits that match between two filters, the higher the overlap in the corresponding buckets.

I want to keep the filters relatively small. Of course, if every user (in a large bucket) is placed into a small filter, then the large majority of bits will be set to 1, and the filter won't work well. Therefore, what we do is sample from the users in a deterministic way, and build our filters from the sampled users. We can sample by taking only the users whose last X bits of the User ID are zero (here assuming that the User IDs are random uniformly distributed....otherwise we hash them first). The larger the bucket, the bigger X is used, resulting in a smaller proportion of sampled users.

In fact, we generate three filters per bucket, each filter 1024 bits wide, and each using a different sampling rate. We use the following numbers:

Bucket Size	Sampling Rates
< 128	1
>= 128, < 512	1, 1/4
>= 512, < 2048	1, 1/4, 1/16
>= 2048, < 8192	1/4, 1/16, 1/64
>= 8192, < 32768	1/16, 1/64, 1/256
>= 32768, < 131072	1/64, 1/256, 1/1024
. . .	. . .

Two filters can only be compared if they have the same sampling rate (and therefore are sampling from the same subset of users). What this means in practice is that if one bucket is more than 16X larger than another bucket, it may not be possible to compare them. As an example, a bucket with 200 users can be compared with a bucket with 3000 users, because they share a sampling rate of 1/4. The 200-user bucket, however, cannot be compared with a 9000-user bucket.

Reference code for selecting sampling rates is as follows (C language....sorry):

/*
 * level=0 means no filter
 * level=1 means all bucket entries are put into filter
 * level=2 means 1/4 the bucket entries are put into filter
 * level=3 means 1/16 the bucket entries are put into the filter
 * level=N means that 1/(2^^2(N-1)) bucket entries are put into the filter
 */
#define EMPTY_FILTER 0
typedef struct one_filter_t {
  uint8_t level;
  uint64_t filter[FILTER_LEN];  // 1024 bits wide.
} one_filter;

typedef struct bucket_t {
  one_filter filters[FILTERS_PER_BUCKET];
  int bsize;	    // number of entries
  uint64_t *list;   // pointer to first entry
} bucket;

#define F_ONE_MAX 0x80
#define F_TWO_MAX (F_ONE_MAX<<2)
#define F_THREE_MAX (F_TWO_MAX<<2)
#define MAX_LEVEL 24  // absurdly large bucket (must be 24)

setLevelsBasedOnBucketSize(bucket *bp)
{
  int min, max, level;

  if (bp->bsize < F_ONE_MAX) {
    // special case, only one level possible
    bp->filters[0].level = 1;
    bp->filters[1].level = 0;
    bp->filters[2].level = 0;
  } 
  else if (bp->bsize < F_TWO_MAX) {
    // special case, only one level possible
    bp->filters[0].level = 1;
    bp->filters[1].level = 2;
    bp->filters[2].level = 0;
  } 
  else {
    min = F_TWO_MAX;
    max = F_THREE_MAX;
    for (level = 1; level < MAX_LEVEL; level++) {
      if ((bp->bsize >= min) && (bp->bsize < max)) {
        bp->filters[0].level = level;
        bp->filters[1].level = level+1;
        bp->filters[2].level = level+2;
        break;
      }
      min <<= 2;
      max <<= 2;
    }
  }
  // MAX_LEVEL is so high, we'll never get here without
  // setting the levels
}

The reference code for building a single filter (one of the three) is as follows:

makeOneFilter(bucket *bp, int i)
{
  uint64_t mask_presets[MAX_LEVEL] = {
    0,
    0,
    0xc000000000000000,
    0xf000000000000000,
    0xfc00000000000000,
    0xff00000000000000,
    0xffc0000000000000,
    0xfff0000000000000,
    0xfffc000000000000,
    0xffff000000000000,
    0xffffc00000000000,
    0xfffff00000000000,
    0xfffffc0000000000,
    0xffffff0000000000,
    0xffffffc000000000,
    0xfffffff000000000,
    0xfffffffc00000000,
    0xffffffff00000000,
    0xffffffffc0000000,
    0xfffffffff0000000,
    0xfffffffffc000000,
    0xffffffffff000000,
    0xffffffffffc00000,
    0xffffffffff000000};
  uint64_t mask;
  int j, k;
  uint64_t *lp;

  mask = mask_presets[bp->filters[i].level];
  k = 0;
  lp = bp->list;
  for (j = 0; j < bp->bsize; j++) {
    if ((*lp & mask) == 0) {
      setFilterBit(&(bp->filters[i]), (*lp & 0x3ff));
    }
    lp++;
  }
}

Given that two filters may have multiple matching sampling rates, we have the question of which matching rate to use. In particular, we want to avoid filters with too few bits set, and filters with too many bits set. After some experimentation, I settled on favoring a sampling rate where the bucket size of the larger bucket is between 300 and 800 (after sampling), and the smaller bucket is at least 100 (after sampling). Note that, although I settled on this, in fact the system is pretty robust to these numbers, as long as the basic principle is followed. Here is the reference code:

typedef struct compare_t {
  int common;		// bits common between both filters
  int first;		// total bits the first filter has
  int second;		// total bits the second filter has
  int numFirst;		// number of entries in first bucket
  int numSecond;	// number of entries in second bucket
  int level;		// the level used for the comparison
			// 0 means no comparison made
  int index1;
  int index2;
} compare;

/* 
 * Returns 1 if a pair of filters with matching level were
 * found.  Else returns 0.
 */
compareFullFilters(bucket *bp1, bucket *bp2, compare *c)
{
  int i, j;
  int save_i=0, save_j=0, firstTime=1;
  int high, low, temp;

  c->level = 0;    // indicates that no comparison was made
  c->numFirst = bp1->bsize;
  c->numSecond = bp2->bsize;
  for (i = 0; i < FILTERS_PER_BUCKET; i++) {
    for (j = 0; j < FILTERS_PER_BUCKET; j++) {
      if (bp1->filters[i].level == bp2->filters[j].level) {
        if (bp1->filters[i].level == 0) {goto done;}
        // Ideally, the largest user count is between 300 and 800,
        // and the smallest is least 100.
        // This avoids excessive set bits due to simply large numbers
        // (should consider avoiding this by having a 4th filter????)
        high = bp1->bsize >> ((bp1->filters[i].level * 2) - 2);
        low = bp2->bsize >> ((bp2->filters[j].level * 2) - 2);
        if (low > high) {
          temp = low;
          low = high;
          high = temp;
        }
        if (firstTime == 1) {
          firstTime = 0;
          save_i = i;
          save_j = j;
          if ((high >= 300) && (high <= 800) && (low >= 100)) {goto done;}
        }
        else if ((high >= 300) && (high <= 800) && (low >= 100)) {
          save_i = i;
          save_j = j;
          goto done;
        }
      }
    }
  }
done:
  if (firstTime == 0) {
    c->level = bp1->filters[save_i].level;
    c->index1 = save_i;
    c->index2 = save_j;
    compareFilterPair(&(bp1->filters[save_i]), &(bp2->filters[save_j]), c);
  }
}

Finally, the reference code for comparing two filters, once they are selected, is:

/* 
 * Finds 1) the number of bit positions in common between the
 * two filters, and the unique bits of the first and second
 * filters.  Happily (erroneously) compares two filters of different
 * levels.  Up to calling routing to ensure this doesn't
 * happen.  Never fails.
 */
compareFilterPair(one_filter *of1, one_filter *of2, compare *c)
{
  unsigned int i;
  uint64_t anded;

  c->common = 0;  c->first = 0; c->second = 0;
  for (i = 0; i < FILTER_LEN; i++) {
    c->first += __builtin_popcountll(of1->filter[i]);
    c->second += __builtin_popcountll(of2->filter[i]);
    c->common += __builtin_popcountll(of1->filter[i] & of2->filter[i]);
  }
}

Filter Performance

We ran an experiment where we created 10M pairs of buckets, where the smaller bucket ranged in size from 1 user to 10K users, and the larger bucket was between 1X and 16X times larger than the smaller bucket. For each pair, we varied the overlap between the two buckets between 0% and 100%. All random numbers are uniform distributed.

For each pair, we measured a number of things:

The number of bits set in the filter of each bucket.
The number of "common bits", which means the number of bits set in both filters for each bit position (the bits that are set when the two filters are AND'd).
The number of bits that are 1 in one filter, and 0 in the other filter.
The number of bits that are zero in both filters for each bit position.

Following is a scatterplot of the total number of set bits in the filter for the smaller bucket (the smaller filter) on the X axis, and the total number of bits where the bit position is set in both filters on the y axis, and the color is based on the amount of overlap (ground truth). (The x-axis label of "Number One / Star Bits" means the number of bits that are one for the small bucket, and * for the large bucket.)

This is a beautiful picture, because it shows that there is a strong correlation between the two measures (number of smaller-filter bits and number of common bits). This correlation exists across different bucket sizes, and different amounts of overlap. This means that we can use this method to get reasonable estimates on overlap given pretty much any two buckets (as long as within 16X size of each other).

Further, we see that for complete overlap and zero overlap, we get clear signals. (The X=Y diagonal is pure blue, and the bottom of the graph is pure green.) In the middle, however, it is a bit noisy, with a range of overlap values interspersed.

The following is a similar scatter plot for the same experiment, except now each dot corresponds to the average overlap for a set of up to 16 bucket pairs. Here we see that much, though not all, of the noise of the pure scatter plot is removed:

The following is a scatter plot where each dot gives the standard deviation for the set of up-to-sixteen samples.

Here again, we see that the complete-overlap and no-overlap edges of the plot are clean, but in the middle there can be substantial variation between measurements (upwards of 25%!).

The following two graphs show the max and min of the up-to-16 samples:

This shows pretty clearly that there is a substantial middle range where a pair with some overlap may be reported as full or no overlap, in the extreme. This is unfortunate in that it means we might often re-query and compare buckets when we really shouldn't have to. It remains to be seen whether we can minimize this (by looking for averages over a number of buckets).

Note that I tried hard to find out if any particular factors are causing this variation, for instance bucket size, bucket size ratio, sampling rate, number of set bits, but I couldn't find anything. To now, this is the best I can do.

Note also that I couldn't find any other basis on which to make the measurement (i.e. the number of bit positions that are 1 in one filter and 0 in the other). I think there is substantial work still to be done on finding a good filter, but my hope is that this is good enough for practical purposes.

Completing computeOverlap()

To complete the picture, we need to be able to go from two buckets with unknown overlap to an estimate of overlap. To do this, we can use the values in the average-overlap scatter plot above as a lookup table to get from filter bits to overlap estimate.

One way to do this would be to simply build a 1024 X 1024 array, each cell in the array representing a possible value of small-bucket-bits and common bits, and then index that array after computing the two values. A slightly more space-efficient approach is to recognize that most of the values in that array are either not possible, and build an array that is small-bucket-bits on one axis, and the difference between small-bucket-bits and common-bits on the other.

The following graph shows that array, assuming the range 0-800 for the small-bucket-bits, and the range 0-256 for the difference between small-bucket-bits and common-bits, where each bit position is the average of up-to-16 samples.

The array isn't completely filled in because many of the bit positions are rarely if ever reached. We can artificially fill them in. To do this, everything to the left is given the value of '0' (for no overlap). Everything else can be filled in with the nearest value to its left. The following graph shows how the empty cells would be filled in using this procedure:

The final array (combining initially filled cells and artificially filled cells) looks like this:

This 800 X 256 2-dimensional array can be used as the lookup table for estimating overlap.

In some testing of this array as a lookup table, I found that often the X,0 position in the array is not 63 (perfect match) as I had expected. Indeed, sometimes the table tells me that a pair of perfect matches are, say, 91% or 92% matching. The reason for this is that I populated the table with many bucket pairs that differ by 16X in size. So, for instance, we can have cases where the small bucket filter has say 25 bits set, and large bucket filter has 400 bits set. If these two buckets have 90% overlap, then 22 or 23 of the matching bits will be due to the overlap, but the remaining 2 or 3 bits might match by chance (since there is a 1/3 chance that a given bit will be set in the larger filter). I think this may be pulling the average down. The following scatter plot has the number of 1 bits in the smaller bucket filter (numFirst), against the number of 1 bits in the larger bucket filter (numSecond). The color shows the size ratio between the two buckets. You can see that pairs with a large size ratio contribute to cases where we have a small number of filter bits against a large number of filter bits.

I'm not sure I need to do anything about this or not (other than understand in the defense mechanism that this can be the case).

An enhancement

Working with the above design, I found that for buckets of similar size and relatively low overlap, the amount of overlap was being under-estimated, thus making it hard to detect attacks. This particular case is interesting because attacks that have relatively low overlap on all buckets also tend to have a similar size.

As a result, I calibrated a second overlap table. Unlike the first table, which covers a wide range of bucket sizes and variations in bucket sizes (ratio up to 1/16), this second table only deals with buckets where the smaller bucket is 2000 users or less, and the larger bucket is no more than 50% larger than the smaller bucket. Within these limitations, this second table is much more accurate than the first table. Specifically, the overlap estimates look like this:

and the standard deviation among the samples is much smaller than the previous case:

As a result of having two tables, the routing that compares two digests first checks to see if the conditions stated above hold true (smaller bucket < 2000 entries, larger bucket < 1.5x the smaller bucket), and chooses the new table to lookup the overlap estimate. The resulting modified reference code is here:

/* 
 * Doesn't matter which bucket is bigger.
 */
compareFullFilters(bucket *bp1, bucket *bp2, compare *c)
{
  int i, j;
  int save_i=0, save_j=0, firstTime=1;
  int high, low;
  bucket *bl, *bs;  // large and small

  // force bp1 to be the smaller bucket
  if (bp1->bsize > bp2->bsize) {
    bl = bp1;
    bs = bp2;
  }
  else {
    bl = bp2;
    bs = bp1;
  }
  c->level = 0;    // indicates that no comparison was made
  c->numFirst = bs->bsize;
  c->numSecond = bl->bsize;
  for (i = 0; i < FILTERS_PER_BUCKET; i++) {
    for (j = 0; j < FILTERS_PER_BUCKET; j++) {
      if (bs->filters[i].level == bl->filters[j].level) {
        if (bs->filters[i].level == 0) {goto done;}
        // Ideally, the largest user count is between 300 and 800,
        // and the smallest is least 100.
        // This avoids excessive set bits due to simply large numbers
        // (should consider avoiding this by having a 4th filter????)
        low = bs->bsize >> ((bs->filters[i].level * 2) - 2);
        high = bl->bsize >> ((bl->filters[j].level * 2) - 2);
        if (firstTime == 1) {
          firstTime = 0;
          save_i = i;
          save_j = j;
          if ((high >= 300) && (high <= 800) && (low >= 100)) {goto done;}
        }
        else if ((high >= 300) && (high <= 800) && (low >= 100)) {
          save_i = i;
          save_j = j;
          goto done;
        }
      }
    }
  }
done:
  if (firstTime == 0) {
    c->level = bs->filters[save_i].level;
    c->index1 = save_i;
    c->index2 = save_j;
    compareFilterPair(&(bs->filters[save_i]), &(bl->filters[save_j]), c);
    if ((i = c->first) >= 800) {
      i = 799;
    }
    if ((j = (c->first - c->common)) >= 256) {
      j = 255;
    }
    // to avoid floating point math in this operation, we use
    // (bs->bsize + (bs->bsize>>1)) instead of (bs->bsize * 1.5)
    if ((bs->bsize <= 2000) && ((bs->bsize + (bs->bsize >> 1)) >= bl->bsize)) {
      c->overlap = partial_overlap_array[i][j];
    }
    else {
      c->overlap = full_overlap_array[i][j];
    }
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filter experiments

Intro

Attacks

Basic Compare Mechanism

Filter Performance

Completing computeOverlap()

An enhancement

Uh oh!

Uh oh!

Clone this wiki locally