PyNNDescent with different metrics

In the initial tutorial we looked at how to get PyNNDescent running on your data, and how to query the indexes it builds. Implicit in all of that was the measure of distance used to determine what counts as the “nearest” neighbors. By default PyNNDescent uses the euclidean metric (because that is what people generally expect when they talk about distance). This is not the only way to measure distance however, and is often not the right choice for very high dimensional data for example. Let’s look at how to use PyNNDescent with other metrics.

First we’ll need some libraries, and some test data. As before we will use ann-benchmarks for data, so we will reuse the data download function from the previous tutorial.

[1]:
import pynndescent
import numpy as np
import h5py
from urllib.request import urlretrieve
import os
[2]:
def get_ann_benchmark_data(dataset_name):
    if not os.path.exists(f"{dataset_name}.hdf5"):
        print(f"Dataset {dataset_name} is not cached; downloading now ...")
        urlretrieve(f"http://ann-benchmarks.com/{dataset_name}.hdf5", f"{dataset_name}.hdf5")
    hdf5_file = h5py.File(f"{dataset_name}.hdf5", "r")
    return np.array(hdf5_file['train']), np.array(hdf5_file['test']), hdf5_file.attrs['distance']

Built in metrics

Let’s grab some data where euclidean distance doesn’t make sense. We’ll use the NY-Times dataset, which is a TF-IDF matrix of data generated from NY-Times news stories. The particulars are less important here, but what matters is that the most sensible way to measure distance on this data is with an angular metric, such as cosine distance.

[3]:
nytimes_train, nytimes_test, distance = get_ann_benchmark_data('nytimes-256-angular')
nytimes_train.shape
[3]:
(290000, 256)

Now that we have the data we can check the distance measure suggested by ann-benchmarks

[4]:
distance
[4]:
'angular'

So an angular measure of distance – cosine distance will suffice. How do we manage to get PyNNDescent working with cosine distance (which isn’t even a real metric! it violates the triangle inequality) instead of standard euclidean?

[5]:
%%time
index = pynndescent.NNDescent(nytimes_train, metric="cosine")
index.prepare()
/Users/leland/anaconda3/envs/umap_0.5dev/lib/python3.8/site-packages/scipy/sparse/_index.py:126: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)
CPU times: user 5min 29s, sys: 6.4 s, total: 5min 35s
Wall time: 2min 9s

That’s right, it uses the scikit-learn standard of the metric keyword and accepts a string that names the metric. We can now query the index, and it will use that metric in the query as well.

[6]:
%%time
neighbors = index.query(nytimes_train)
CPU times: user 20.3 s, sys: 387 ms, total: 20.7 s
Wall time: 21.2 s
/Users/leland/anaconda3/envs/umap_0.5dev/lib/python3.8/site-packages/pynndescent/pynndescent_.py:1628: RuntimeWarning: invalid value encountered in correct_alternative_cosine
  dists = self._distance_correction(dists)

It is worth noting at this point that these results will probably be a little sub-optimal since angular distances are harder to index, and as a result to get the same level accuracy in the nearest neighbor approximation we should be using a larger value than the default 30 for n_neighbors. Beyond that, however, nothing else changes from the tutorial earlier – except that we can’t use kd-trees to learn the true neighbors, since they require distances that respect the triangle inequality.

How many metrics does PyNNDescent support out of the box? Quite a few actually:

[7]:
for dist in pynndescent.distances.named_distances:
    print(dist)
euclidean
l2
sqeuclidean
manhattan
taxicab
l1
chebyshev
linfinity
linfty
linf
minkowski
seuclidean
standardised_euclidean
wminkowski
weighted_minkowski
mahalanobis
canberra
cosine
dot
correlation
hellinger
haversine
braycurtis
spearmanr
kantorovich
wasserstein
tsss
true_angular
hamming
jaccard
dice
matching
kulsinski
rogerstanimoto
russellrao
sokalsneath
sokalmichener
yule

Some of these are repeats or alternate names for the same metric, and some of these are fairly simple, but others, such as spearmanr, or hellinger are useful statistical measures not often implemented elsewhere, and others, such as wasserstein are complex and hard to compute metrics. Having all of these readily available in a fast approximate nearest neighbor library is one of PyNNDescent’s strengths.

Custom metrics

We can go even further in terms of interesting metrics however. You can write your own custom metrics and hand them to PyNNDescent to use on your data. There, of course, a few caveats with this. Many nearest neighbor libraries allow for the possibility of user defined metrics. If you are using Python this often ends up coming in two flavours:

  1. Write some C, C++ or Cython code and compile it against the library itself

  2. Write a python distance function, but lose almost all performance

With PyNNDescent we get a different trade-off. Because we use Numba for just-in-time compiling of Python code instead of a C or C++ backend you don’t need to do an offline compilation step and can instead have your custom Python distance function compiled and used on the fly. The cost for that is that the custom distance function you write must be a numba jitted function. This, in turn, means that you can only use Python functionality that is `supported by numba <>`__. That is still a fairly large amount of functionality, especially when we are talking about numerical work, but it is a limit. It also means that you will need to import numba and decorate your custom distance function accordingly. Let’s look at how to do that.

[8]:
import numba

Let’s start by simply implementing euclidean distance where \(d(\mathbf{x},\mathbf{y}) = \sqrt{\sum_i (\mathbf{x}_i - \mathbf{y}_i)^2}\). This is already implemented in PyNNDescent, but it is a simple distance measure that everyone knows and will serve to illustrate the process. First let’s write the function – using numpy functionality this will be fairly short:

[9]:
def euclidean(x, y):
    return np.sqrt(np.sum((x - y)**2))

Now we need to get the function compiled so PyNNDescent can use it. That is actually as easy as adding a decorator to the top of the function telling numba that it should compile the function when it gets called.

[10]:
@numba.jit
def euclidean(x, y):
    return np.sqrt(np.sum((x - y)**2))

We can now pass this function directly to PyNNdescent as a metric and everything will “just work”. We’ll just train on the smaller test set since it will take a while.

[11]:
%%time
index = pynndescent.NNDescent(nytimes_test, metric=euclidean)
CPU times: user 21.5 s, sys: 220 ms, total: 21.7 s
Wall time: 12 s

This is a little slower than we might have expected, and that’s because a great deal of the computation time is spent evaluating that metric. While numba will compile what we wrote we can make it a little faster if we look through the numba performance tips documentation. The two main things to note are that we can use explicit loops instead of numpy routines, and we can add arguments to the decorator such as fastmath=True to speed things up a little. Let’s rewrite it:

[12]:
@numba.jit(fastmath=True)
def euclidean(x, y):
    result = 0.0
    for i in range(x.shape[0]):
        result += (x[i] - y[i])**2
    return np.sqrt(result)
[13]:
%%time
index = pynndescent.NNDescent(nytimes_test, metric=euclidean)
CPU times: user 12.3 s, sys: 116 ms, total: 12.4 s
Wall time: 8.53 s

That is faster! If we are really on the hunt for performance however, you might note that, for the purposes of finding nearest neighbors the exact values of the distance are not as important as the ordering on distances. In other words we could use the square of euclidean distance and we would get all the same neighbors (since the square root is a monotonic order preserving function of squared euclidean distance). That would, for example, save us a square root computation. We could do the square roots afterwards to just the distances to the nearest neighbors. Let’s reproduce what PyNNDescent actually uses internally for euclidean distance:

[14]:
@numba.njit(
    [
        "f4(f4[::1],f4[::1])",
        numba.types.float32(
            numba.types.Array(numba.types.float32, 1, "C", readonly=True),
            numba.types.Array(numba.types.float32, 1, "C", readonly=True),
        ),
    ],
    fastmath=True,
    locals={
        "result": numba.types.float32,
        "diff": numba.types.float32,
        "dim": numba.types.uint32,
        "i": numba.types.uint16,
    },
)
def squared_euclidean(x, y):
    r"""Squared euclidean distance.

    .. math::
        D(x, y) = \sum_i (x_i - y_i)^2
    """
    result = 0.0
    dim = x.shape[0]
    for i in range(dim):
        diff = x[i] - y[i]
        result += diff * diff

    return result

That is definitely more complicated! Most of it, however, is arguments to the decorator giving it extra typing information to let it squeeze out every drop of performance possible. By default numba will infer types, or even compile different versions for the different types it sees. With a little extra information, however, it can make smarter decisions and optimizations during compilation. Let’s see how fast that goes:

[15]:
%%time
index = pynndescent.NNDescent(nytimes_test, metric=squared_euclidean)
CPU times: user 10.6 s, sys: 96.2 ms, total: 10.7 s
Wall time: 7.64 s

Definitely faster again – so there are significant gains to be had if you are willing to put in some work to write your function. Still, the naive approach we started with, just decorating the obvious implementation, did very well, so unless you desperately need top tier performance for your custom metric a straightforward approach will suffice. And for comparison here is the tailored C++ implementation that libraries like nmslib and hnswlib use:

static float
L2SqrSIMD16Ext(const void *pVect1v, const void *pVect2v, const void *qty_ptr) {
    float *pVect1 = (float *) pVect1v;
    float *pVect2 = (float *) pVect2v;
    size_t qty = *((size_t *) qty_ptr);
    float PORTABLE_ALIGN32 TmpRes[8];
    size_t qty16 = qty >> 4;

    const float *pEnd1 = pVect1 + (qty16 << 4);

    __m256 diff, v1, v2;
    __m256 sum = _mm256_set1_ps(0);

    while (pVect1 < pEnd1) {
        v1 = _mm256_loadu_ps(pVect1);
        pVect1 += 8;
        v2 = _mm256_loadu_ps(pVect2);
        pVect2 += 8;
        diff = _mm256_sub_ps(v1, v2);
        sum = _mm256_add_ps(sum, _mm256_mul_ps(diff, diff));

        v1 = _mm256_loadu_ps(pVect1);
        pVect1 += 8;
        v2 = _mm256_loadu_ps(pVect2);
        pVect2 += 8;
        diff = _mm256_sub_ps(v1, v2);
        sum = _mm256_add_ps(sum, _mm256_mul_ps(diff, diff));
    }

    _mm256_store_ps(TmpRes, sum);
    return TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3] + TmpRes[4] + TmpRes[5] + TmpRes[6] + TmpRes[7];
}

Comparatively, the python code, even with its extra numba decorations, looks pretty straightforward. Notably (at last testing) the numba code and this C++ code (when suitably compiled with AVX flags etc.) have essentially indistinguishable performance. Numba is awfully good at finding optimizations for numerical code.

Beware of bounded distances

There is one remaining caveat on custom distance functions that is important. Many distances, such as cosine distance and jaccard distance are bounded: the values always fall in some fixed finite range (in these cases between 0 and 1). When querying new data points against an index PyNNDescent bounds the search by some multiple (1 + epsilon) of the most distant of the the top k neighbors found so far. This allows a limited amount of backtracking and avoids getting stuck in local minima. It does, however, not play well with bounded distances – a small but non-zero epsilon can end up failing to bound the search at all (suppose epsilon is 0.2 and the most distant of the the top k neighbors has cosine distance 0.8 for example). The trick to getting around this is the same trick described above when we decided not to bother taking the square root of the euclidean distance – we can apply transform to the distance values that preserves all ordering. This means that, for example, internally PyNNDescent uses the negative log of the cosine similarity instead of cosine distance (and converts the distance values when done). You will want to use a similar trick if your distance function has a strict finite upper bound.