{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PyNNDescent with different metrics\n", "\n", "In the initial tutorial we looked at how to get PyNNDescent running on your data, and how to query the indexes it builds. Implicit in all of that was the measure of distance used to determine what counts as the \"nearest\" neighbors. By default PyNNDescent uses the euclidean metric (because that is what people generally expect when they talk about distance). This is not the only way to measure distance however, and is often not the right choice for very high dimensional data for example. Let's look at how to use PyNNDescent with other metrics.\n", "\n", "First we'll need some libraries, and some test data. As before we will use ann-benchmarks for data, so we will reuse the data download function from the previous tutorial." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "import pynndescent\n", "import numpy as np\n", "import h5py\n", "from urllib.request import urlretrieve\n", "import os" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "def get_ann_benchmark_data(dataset_name):\n", " if not os.path.exists(f\"{dataset_name}.hdf5\"):\n", " print(f\"Dataset {dataset_name} is not cached; downloading now ...\")\n", " urlretrieve(f\"http://ann-benchmarks.com/{dataset_name}.hdf5\", f\"{dataset_name}.hdf5\")\n", " hdf5_file = h5py.File(f\"{dataset_name}.hdf5\", \"r\")\n", " return np.array(hdf5_file['train']), np.array(hdf5_file['test']), hdf5_file.attrs['distance']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Built in metrics\n", "\n", "Let's grab some data where euclidean distance doesn't make sense. We'll use the NY-Times dataset, which is a [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) matrix of data generated from NY-Times news stories. The particulars are less important here, but what matters is that the most sensible way to measure distance on this data is with an angular metric, such as cosine distance." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset nytimes-256-angular is not cached; downloading now ...\n" ] }, { "data": { "text/plain": [ "(290000, 256)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nytimes_train, nytimes_test, distance = get_ann_benchmark_data('nytimes-256-angular')\n", "nytimes_train.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the data we can check the distance measure suggested by ann-benchmarks" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": [ "'angular'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So an angular measure of distance -- cosine distance will suffice. How do we manage to get PyNNDescent working with cosine distance (which isn't even a real metric! it violates the triangle inequality) instead of standard euclidean?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5min 2s, sys: 1min 22s, total: 6min 24s\n", "Wall time: 30.6 s\n" ] } ], "source": [ "%%time\n", "index = pynndescent.NNDescent(nytimes_train, metric=\"cosine\", parallel_batch_queries=True)\n", "index.prepare()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's right, it uses the scikit-learn standard of the ``metric`` keyword and accepts a string that names the metric. We can now query the index, and it will use that metric in the query as well." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 26.7 s, sys: 106 ms, total: 26.8 s\n", "Wall time: 26.8 s\n" ] } ], "source": [ "%%time\n", "neighbors = index.query(nytimes_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is worth noting at this point that these results will probably be a little sub-optimal since angular distances are harder to index, and as a result to get the same level accuracy in the nearest neighbor approximation we should be using a larger value than the default ``30`` for ``n_neighbors``. Beyond that, however, nothing else changes from the tutorial earlier -- except that we can't use kd-trees to learn the true neighbors, since they require distances that respect the triangle inequality.\n", "\n", "How many metrics does PyNNDescent support out of the box? Quite a few actually:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "euclidean\n", "l2\n", "sqeuclidean\n", "manhattan\n", "taxicab\n", "l1\n", "chebyshev\n", "linfinity\n", "linfty\n", "linf\n", "minkowski\n", "seuclidean\n", "standardised_euclidean\n", "wminkowski\n", "weighted_minkowski\n", "mahalanobis\n", "canberra\n", "cosine\n", "dot\n", "inner_product\n", "correlation\n", "haversine\n", "braycurtis\n", "spearmanr\n", "tsss\n", "true_angular\n", "hellinger\n", "kantorovich\n", "wasserstein\n", "wasserstein_1d\n", "wasserstein-1d\n", "kantorovich-1d\n", "kantorovich_1d\n", "circular_kantorovich\n", "circular_wasserstein\n", "sinkhorn\n", "jensen-shannon\n", "jensen_shannon\n", "symmetric-kl\n", "symmetric_kl\n", "symmetric_kullback_liebler\n", "hamming\n", "jaccard\n", "dice\n", "matching\n", "kulsinski\n", "rogerstanimoto\n", "russellrao\n", "sokalsneath\n", "sokalmichener\n", "yule\n", "bit_hamming\n", "bit_jaccard\n" ] } ], "source": [ "for dist in pynndescent.distances.named_distances:\n", " print(dist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of these are repeats or alternate names for the same metric, and some of these are fairly simple, but others, such as ``spearmanr``, or ``hellinger`` are useful statistical measures not often implemented elsewhere, and others, such as ``wasserstein`` are complex and hard to compute metrics. Having all of these readily available in a fast approximate nearest neighbor library is one of PyNNDescent's strengths." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom metrics\n", "\n", "We can go even further in terms of interesting metrics however. You can write your own custom metrics and hand them to PyNNDescent to use on your data. There, of course, a few caveats with this. Many nearest neighbor libraries allow for the possibility of user defined metrics. If you are using Python this often ends up coming in two flavours:\n", "\n", " 1. Write some C, C++ or Cython code and compile it against the library itself\n", " 2. Write a python distance function, but lose almost all performance\n", " \n", "With PyNNDescent we get a different trade-off. Because we use [Numba](http://numba.pydata.org/) for just-in-time compiling of Python code instead of a C or C++ backend you don't need to do an offline compilation step and can instead have your custom Python distance function compiled and used on the fly. The cost for that is that the custom distance function you write must be a numba jitted function. This, in turn, means that you can only use Python functionality that is [supported by numba](). That is still a fairly large amount of functionality, especially when we are talking about numerical work, but it is a limit. It also means that you will need to import numba and decorate your custom distance function accordingly. Let's look at how to do that." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "import numba" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by simply implementing euclidean distance where $d(\\mathbf{x},\\mathbf{y}) = \\sqrt{\\sum_i (\\mathbf{x}_i - \\mathbf{y}_i)^2}$. This is already implemented in PyNNDescent, but it is a simple distance measure that everyone knows and will serve to illustrate the process. First let's write the function -- using numpy functionality this will be fairly short:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "def euclidean(x, y):\n", " return np.sqrt(np.sum((x - y)**2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to get the function compiled so PyNNDescent can use it. That is actually as easy as adding a decorator to the top of the function telling numba that it should compile the function when it gets called." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "@numba.jit\n", "def euclidean(x, y):\n", " return np.sqrt(np.sum((x - y)**2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To ensure our timing doesn't include the numba compile time of compiling this function, let's run it a few times." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "[euclidean(nytimes_train[0], nytimes_train[i]) for i in range(100)];" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now pass this function directly to PyNNdescent as a metric and everything will \"just work\". We'll just train on the smaller test set since it will take a while." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 8.72 s, sys: 2.19 s, total: 10.9 s\n", "Wall time: 253 ms\n" ] } ], "source": [ "%%time\n", "index = pynndescent.NNDescent(nytimes_test, metric=euclidean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a little slower than we might have expected, and that's because a great deal of the computation time is spent evaluating that metric. While numba will compile what we wrote we can make it a little faster if we look through the [numba performance tips documentation](https://numba.readthedocs.io/en/stable/user/performance-tips.html). The two main things to note are that we can use explicit loops instead of numpy routines, and we can add arguments to the decorator such as ``fastmath=True`` to speed things up a little. Let's rewrite it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "@numba.jit(fastmath=True)\n", "def euclidean(x, y):\n", " result = 0.0\n", " for i in range(x.shape[0]):\n", " result += (x[i] - y[i])**2\n", " return np.sqrt(result)\n", "\n", "[euclidean(nytimes_train[0], nytimes_train[i]) for i in range(100)];" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.39 s, sys: 2.46 s, total: 8.85 s\n", "Wall time: 202 ms\n" ] } ], "source": [ "%%time\n", "index = pynndescent.NNDescent(nytimes_test, metric=euclidean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is faster! If we are really on the hunt for performance however, you might note that, for the purposes of finding nearest neighbors the exact values of the distance are not as important as the ordering on distances. In other words we could use the square of euclidean distance and we would get all the same neighbors (since the square root is a monotonic order preserving function of squared euclidean distance). That would, for example, save us a square root computation. We could do the square roots afterwards to just the distances to the nearest neighbors. Let's reproduce what PyNNDescent actually uses internally for euclidean distance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "@numba.njit(\n", " [\n", " \"f4(f4[::1],f4[::1])\",\n", " numba.types.float32(\n", " numba.types.Array(numba.types.float32, 1, \"C\", readonly=True),\n", " numba.types.Array(numba.types.float32, 1, \"C\", readonly=True),\n", " ),\n", " ],\n", " fastmath=True,\n", " locals={\n", " \"result\": numba.types.float32,\n", " \"diff\": numba.types.float32,\n", " \"dim\": numba.types.uint32,\n", " \"i\": numba.types.uint16,\n", " },\n", ")\n", "def squared_euclidean(x, y):\n", " r\"\"\"Squared euclidean distance.\n", "\n", " .. math::\n", " D(x, y) = \\sum_i (x_i - y_i)^2\n", " \"\"\"\n", " result = 0.0\n", " dim = x.shape[0]\n", " for i in range(dim):\n", " diff = x[i] - y[i]\n", " result += diff * diff\n", "\n", " return result\n", "\n", "[squared_euclidean(nytimes_train[0], nytimes_train[i]) for i in range(100)];" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is definitely more complicated! Most of it, however, is arguments to the decorator giving it extra typing information to let it squeeze out every drop of performance possible. By default numba will infer types, or even compile different versions for the different types it sees. With a little extra information, however, it can make smarter decisions and optimizations during compilation. Let's see how fast that goes:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.5 s, sys: 2.02 s, total: 7.52 s\n", "Wall time: 173 ms\n" ] } ], "source": [ "%%time\n", "index = pynndescent.NNDescent(nytimes_test, metric=squared_euclidean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Definitely faster again -- so there are significant gains to be had if you are willing to put in some work to write your function. Still, the naive approach we started with, just decorating the obvious implementation, did very well, so unless you desperately need top tier performance for your custom metric a straightforward approach will suffice. And for comparison here is the tailored C++ implementation that libraries like [nmslib](https://github.com/nmslib/nmslib) and [hnswlib](https://github.com/nmslib/hnswlib) use:\n", "\n", "```C++\n", "static float\n", "L2SqrSIMD16Ext(const void *pVect1v, const void *pVect2v, const void *qty_ptr) {\n", " float *pVect1 = (float *) pVect1v;\n", " float *pVect2 = (float *) pVect2v;\n", " size_t qty = *((size_t *) qty_ptr);\n", " float PORTABLE_ALIGN32 TmpRes[8];\n", " size_t qty16 = qty >> 4;\n", "\n", " const float *pEnd1 = pVect1 + (qty16 << 4);\n", "\n", " __m256 diff, v1, v2;\n", " __m256 sum = _mm256_set1_ps(0);\n", "\n", " while (pVect1 < pEnd1) {\n", " v1 = _mm256_loadu_ps(pVect1);\n", " pVect1 += 8;\n", " v2 = _mm256_loadu_ps(pVect2);\n", " pVect2 += 8;\n", " diff = _mm256_sub_ps(v1, v2);\n", " sum = _mm256_add_ps(sum, _mm256_mul_ps(diff, diff));\n", "\n", " v1 = _mm256_loadu_ps(pVect1);\n", " pVect1 += 8;\n", " v2 = _mm256_loadu_ps(pVect2);\n", " pVect2 += 8;\n", " diff = _mm256_sub_ps(v1, v2);\n", " sum = _mm256_add_ps(sum, _mm256_mul_ps(diff, diff));\n", " }\n", "\n", " _mm256_store_ps(TmpRes, sum);\n", " return TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3] + TmpRes[4] + TmpRes[5] + TmpRes[6] + TmpRes[7];\n", "}\n", "```\n", "\n", "Comparatively, the python code, even with its extra numba decorations, looks pretty straightforward. Notably (at last testing) the numba code and this C++ code (when suitably compiled with AVX flags etc.) have essentially indistinguishable performance. Numba is awfully good at finding optimizations for numerical code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Beware of bounded distances\n", "\n", "There is one remaining caveat on custom distance functions that is important. Many distances, such as cosine distance and jaccard distance are bounded: the values always fall in some fixed finite range (in these cases between 0 and 1). When querying new data points against an index PyNNDescent bounds the search by some multiple (1 + epsilon) of the most distant of the the top k neighbors found so far. This allows a limited amount of backtracking and avoids getting stuck in local minima. It does, however, not play well with bounded distances -- a small but non-zero epsilon can end up failing to bound the search at all (suppose epsilon is 0.2 and the most distant of the the top k neighbors has cosine distance 0.8 for example). The trick to getting around this is the same trick described above when we decided not to bother taking the square root of the euclidean distance -- we can apply transform to the distance values that preserves all ordering. This means that, for example, internally PyNNDescent uses the *negative log* of the cosine *similarity* instead of cosine distance (and converts the distance values when done). You will want to use a similar trick if your distance function has a strict finite upper bound." ] } ], "metadata": { "kernelspec": { "display_name": "pynndescent_dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 4 }