PyNNDescent API Guide

PyNNDescent has only two classes NNDescent and PyNNDescentTransformer.

PyNNDescent

class pynndescent.pynndescent_.PyNNDescentTransformer(n_neighbors=30, metric='euclidean', metric_kwds=None, n_trees=None, leaf_size=None, search_epsilon=0.1, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, random_state=None, n_jobs=None, low_memory=True, max_candidates=None, n_iters=None, early_termination_value=0.001, parallel_batch_queries=False, verbose=False)

PyNNDescentTransformer for fast approximate nearest neighbor transformer. It uses the NNDescent algorithm, and is thus very flexible and supports a wide variety of distances, including non-metric distances. NNDescent also scales well against high dimensional graph_data in many cases.

Transform X into a (weighted) graph of k nearest neighbors

The transformed graph_data is a sparse graph as returned by kneighbors_graph.

Parameters

n_neighbors: int (optional, default=5)

The number of neighbors to use in k-neighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.

metric: string or callable (optional, default=’euclidean’)

The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:

euclidean

manhattan

chebyshev

minkowski

canberra

braycurtis

mahalanobis

wminkowski

seuclidean

cosine

correlation

haversine

hamming

jaccard

dice

russelrao

kulsinski

rogerstanimoto

sokalmichener

sokalsneath

yule

hellinger

wasserstein-1d

Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

metric_kwds: dict (optional, default {})

Arguments to pass on to the metric, such as the p value for Minkowski distance.

n_trees: int (optional, default=None)

This implementation uses random projection forests for initialization of searches. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.

leaf_size: int (optional, default=None)

The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.

pruning_degree_multiplier: float (optional, default=1.5)

How aggressively to prune the graph. Since the search graph is undirected (and thus includes nearest neighbors and reverse nearest neighbors) vertices can have very high degree – the graph will be pruned such that no vertex has degree greater than pruning_degree_multiplier * n_neighbors.

diversify_prob: float (optional, default=1.0)

The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.

n_search_trees: int (optional, default=1)

The number of random projection trees to use in initializing searching or querying.

Deprecated since version 0.5.5.

search_epsilon: float (optional, default=0.1)

When searching for nearest neighbors of a query point this values controls the trade-off between accuracy and search cost. Larger values produce more accurate nearest neighbor results at larger computational cost for the search. Values should be in the range 0.0 to 0.5, but should probably not exceed 0.3 without good reason.

tree_init: bool (optional, default=True)

Whether to use random projection trees for initialization.

random_state: int, RandomState instance or None, optional (default: None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobs: int or None (optional, default=None)

The maximum number of parallel threads to be run at a time. If none this will default to using all the cores available. Note that there is not perfect parallelism, so at several pints the algorithm will be single threaded.

low_memory: boolean (optional, default=False)

Whether to use a lower memory, but more computationally expensive approach to index construction. This defaults to false as for most cases it speeds index construction, but if you are having issues with excessive memory use for your dataset consider setting this to True.

max_candidates: int (optional, default=20)

Internally each “self-join” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at non-negligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.

n_iters: int (optional, default=None)

The maximum number of NN-descent iterations to perform. The NN-descent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.

early_termination_value: float (optional, default=0.001)

Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.

parallel_batch_queries: bool (optional, default=False)

Whether to use parallelism of batched queries. This can be useful for large batches of queries on multicore machines, but results in performance degradation for single queries, so is poor for streaming use.

verbose: bool (optional, default=False)

Whether to print status graph_data during the computation.

Examples

>>> from sklearn.manifold import Isomap
>>> from pynndescent import PyNNDescentTransformer
>>> from sklearn.pipeline import make_pipeline
>>> estimator = make_pipeline(
...     PyNNDescentTransformer(n_neighbors=5),
...     Isomap(neighbors_algorithm='precomputed'))

fit(X, compress_index=True)

Fit the PyNNDescent transformer to build KNN graphs with neighbors given by the dataset X.

Parameters

Xarray-like, shape (n_samples, n_features): Sample graph_data

Returns

transformerPyNNDescentTransformer: The trained transformer

fit_transform(X, y=None, **fit_params)

Fit to graph_data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xnumpy array of shape (n_samples, n_features): Training set.

y : ignored

Returns

XtCSR sparse matrix, shape (n_samples, n_samples): Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value. The diagonal is always explicit.

set_fit_request(*, compress_index: bool | None | str = '$UNCHANGED$') → PyNNDescentTransformer

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

compress_indexstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for compress_index parameter in fit.

Returns

selfobject: The updated object.

transform(X, y=None)

Computes the (weighted) graph of Neighbors for points in X

Parameters

Xarray-like, shape (n_samples_transform, n_features): Sample graph_data

Returns

XtCSR sparse matrix, shape (n_samples_transform, n_samples_fit): Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value.

A number of internal functions can also be accessed separately for more fine tuned work.

Distance Functions

pynndescent.distances.alternative_cosine(x, y)

Alternative cosine distance using log transform.

A transformed version of cosine distance suitable for the bounded-radius search algorithm. Uses negative log of the cosine similarity.

\[D_{alt}(x, y) = \log_2\left(\frac{\|x\| \|y\|}{\langle x, y \rangle}\right)\]

Returns FLOAT32_MAX for non-positive cosine similarities (treating them as infinitely far). Use correct_alternative_cosine to convert back to standard cosine distance.

pynndescent.distances.alternative_dot(x, y)

Alternative dot product distance using log transform.

A transformed version of dot product distance suitable for the bounded-radius search algorithm. Uses negative log of the dot product.

\[D_{alt}(x, y) = -\log_2(\langle x, y \rangle)\]

Returns FLOAT32_MAX for non-positive dot products (treating them as infinitely far). Use correct_alternative_cosine to convert back to standard dot distance.

pynndescent.distances.alternative_hellinger(x, y)

Alternative Hellinger distance using log transform.

A transformed version of Hellinger distance suitable for the bounded-radius search algorithm.

\[D_{alt}(x, y) = \log_2\left(\frac{\sqrt{\sum_i x_i \cdot \sum_i y_i}}{\sum_i \sqrt{x_i y_i}}\right)\]

Use correct_alternative_hellinger to convert back to standard Hellinger distance.

pynndescent.distances.alternative_inner_product(x, y)

Alternative inner product distance using reciprocal transform.

This transforms the inner product into a positive distance suitable for the bounded-radius search algorithm. The transform is:

\[D_{alt}(x, y) = \frac{1}{\langle x, y \rangle}\]

This maps positive inner products to positive distances: - High inner product → small positive distance - Low positive inner product → large positive distance - Non-positive inner product → FLOAT32_MAX (treated as infinitely far)

In high-dimensional nearest neighbor search, we expect true neighbors to have positive inner products. Pairs with non-positive inner products are treated as maximally distant, similar to how alternative_cosine handles negative cosine similarities.

The correction function correct_alternative_inner_product converts back to the negative inner product.

pynndescent.distances.alternative_jaccard(x, y)

Alternative Jaccard distance using log transform.

A transformed version of Jaccard distance suitable for the bounded-radius search algorithm. Uses negative log of the Jaccard similarity coefficient.

\[D_{alt}(x, y) = -\log_2\left(\frac{|x \cap y|}{|x \cup y|}\right)\]

Use correct_alternative_jaccard to convert back to standard Jaccard distance.

pynndescent.distances.bit_hamming(x, y)

Hamming distance for bit-packed binary vectors.

Counts the number of differing bits between two uint8 arrays, where each byte contains 8 packed binary features. More efficient than standard Hamming for binary data.

\[D(x, y) = \sum_i \text{popcount}(x_i \oplus y_i)\]

Returns the total count of differing bits (not normalized).

pynndescent.distances.bit_jaccard(x, y)

Jaccard distance for bit-packed binary vectors.

Computes Jaccard distance for uint8 arrays where each byte contains 8 packed binary features. Uses negative log transform for compatibility with the bounded-radius search algorithm.

\[D(x, y) = -\log\left(\frac{\text{popcount}(x \land y)}{\text{popcount}(x \lor y)}\right)\]

More efficient than standard Jaccard for binary data.

pynndescent.distances.bray_curtis(x, y)

Bray-Curtis distance.

A distance measure commonly used in ecology to quantify the compositional dissimilarity between two samples.

\[D(x, y) = \frac{\sum_i |x_i - y_i|}{\sum_i |x_i + y_i|}\]

pynndescent.distances.canberra(x, y)

Canberra distance.

A weighted version of Manhattan distance where each term is divided by the sum of absolute values.

\[D(x, y) = \sum_i \frac{|x_i - y_i|}{|x_i| + |y_i|}\]

pynndescent.distances.chebyshev(x, y): Chebyshev or l-infinity distance.

\[D(x, y) = \max_i |x_i - y_i|\]

pynndescent.distances.circular_kantorovich(x, y, p=1)

Circular Kantorovich distance.

The Wasserstein distance for distributions on a circle (periodic domain). Useful for cyclic data like angles, time of day, or periodic histograms.

Parameters

x, yarray-like: Input vectors treated as probability distributions (will be normalized).
pint: The order of the Wasserstein distance (default 1).

pynndescent.distances.correlation(x, y)

Correlation distance.

One minus the Pearson correlation coefficient. Measures how linearly related two vectors are after centering (subtracting their means).

\[D(x, y) = 1 - \frac{\langle x - \bar{x}, y - \bar{y} \rangle}{\|x - \bar{x}\| \|y - \bar{y}\|}\]

Equivalent to cosine distance on mean-centered data.

pynndescent.distances.cosine(x, y)

Cosine distance.

One minus the cosine of the angle between two vectors. Measures the angular difference between vectors, independent of their magnitudes.

\[D(x, y) = 1 - \frac{\langle x, y \rangle}{\|x\| \|y\|}\]

Returns 0 if both vectors are zero, 1 if one is zero.

pynndescent.distances.dice(x, y)

Dice distance (Sørensen-Dice dissimilarity).

One minus twice the intersection divided by the sum of cardinalities. Commonly used for comparing the similarity of two samples.

\[D(x, y) = \frac{|x \oplus y|}{2|x \cap y| + |x \oplus y|}\]

where $\oplus$ denotes symmetric difference.

pynndescent.distances.dot(x, y)

Dot product distance for normalized vectors.

One minus the dot product. This is equivalent to cosine distance when vectors are normalized to unit length. For unnormalized vectors, use inner_product distance instead.

\[D(x, y) = 1 - \langle x, y \rangle\]

Returns 1.0 for non-positive dot products.

pynndescent.distances.euclidean(x, y): Standard euclidean distance.

\[\begin{split}D(x, y) = \\sqrt{\sum_i (x_i - y_i)^2}\end{split}\]

pynndescent.distances.hamming(x, y)

Hamming distance.

The proportion of elements that differ between two vectors.

\[D(x, y) = \frac{1}{n} \sum_i \mathbf{1}_{x_i \neq y_i}\]

pynndescent.distances.haversine(x, y)

Haversine (great circle) distance.

The angular distance between two points on a sphere, given their latitudes and longitudes in radians. Only valid for 2D data where x[0], y[0] are latitudes and x[1], y[1] are longitudes.

\[D(x, y) = 2 \arcsin\left(\sqrt{\sin^2\left(\frac{\phi_1 - \phi_2}{2}\right) + \cos(\phi_1)\cos(\phi_2)\sin^2\left(\frac{\lambda_1 - \lambda_2}{2}\right)}\right)\]

where $\phi$ is latitude and $\lambda$ is longitude.

pynndescent.distances.hellinger(x, y)

Hellinger distance.

A distance for probability distributions, based on the Bhattacharyya coefficient. Input vectors are treated as (unnormalized) probability distributions.

\[D(x, y) = \sqrt{1 - \frac{\sum_i \sqrt{x_i y_i}}{\sqrt{\sum_i x_i \cdot \sum_i y_i}}}\]

Returns values in [0, 1].

pynndescent.distances.inner_product(x, y)

Inner product distance (negative inner product).

This is useful for retrieval tasks where the inner product represents similarity (higher = more similar). The distance is simply the negation of the inner product, so that higher similarity becomes lower distance.

Note: Unlike dot product distance, this does NOT assume normalized vectors. For normalized vectors, use the dot distance instead which is bounded [0, 1].

\[D(x, y) = -\sum_i x_i y_i\]

pynndescent.distances.jaccard(x, y)

Jaccard distance.

One minus the Jaccard similarity coefficient. For binary vectors this is the size of the symmetric difference divided by the size of the union.

\[D(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}\]

For continuous vectors, non-zero values are treated as set membership.

pynndescent.distances.jensen_shannon_divergence(x, y)

Jensen-Shannon divergence.

A symmetrized and smoothed version of KL divergence. Measures the similarity between two probability distributions.

\[D(x, y) = \frac{1}{2} \left( D_{KL}(x \| m) + D_{KL}(y \| m) \right)\]

where $m = \frac{1}{2}(x + y)$ and $D_{KL}$ is KL divergence. Input vectors are normalized to probability distributions.

pynndescent.distances.kantorovich(x, y, cost=array([[0., 0.], [0., 0.]]), max_iter=100000)

Kantorovich distance (Earth Mover’s Distance / Wasserstein distance).

The optimal transport distance between two probability distributions. Computes the minimum cost to transform one distribution into another, given a cost matrix.

Parameters

x, yarray-like: Input vectors treated as probability distributions (will be normalized).
costarray-like: Cost matrix where cost[i,j] is the cost of moving mass from bin i to bin j.
max_iterint: Maximum number of iterations for the network simplex algorithm.

Returns

float: The optimal transport distance.

pynndescent.distances.kulsinski(x, y)

Kulsinski distance.

A variant of Jaccard distance that includes a count of all dimensions. For binary vectors, gives more weight to dimensions where both are false.

\[D(x, y) = \frac{|x \oplus y| - |x \cap y| + n}{|x \oplus y| + n}\]

where n is the number of dimensions.

pynndescent.distances.mahalanobis(x, y, vinv=array([[1., 0.], [0., 1.]], dtype=float32))

Mahalanobis distance.

\[D(x, y) = \sqrt{(x - y)^T V^{-1} (x - y)}\]

where V is the covariance matrix. This is equivalent to Euclidean distance after transforming the space by the inverse square root of the covariance.

pynndescent.distances.manhattan(x, y): Manhattan, taxicab, or l1 distance.

\[D(x, y) = \sum_i |x_i - y_i|\]

pynndescent.distances.matching(x, y)

Matching distance (simple matching dissimilarity).

The proportion of elements that differ in their boolean state. For binary vectors, counts positions where one is non-zero and the other is zero.

\[D(x, y) = \frac{1}{n} \sum_i \mathbf{1}_{(x_i \neq 0) \neq (y_i \neq 0)}\]

pynndescent.distances.minkowski(x, y, p=2)

Minkowski distance.

\[D(x, y) = \left(\sum_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]

This is a general distance. For p=1 it is equivalent to manhattan distance, for p=2 it is Euclidean distance, and for p=infinity it is Chebyshev distance. In general it is better to use the more specialised functions for those distances.

pynndescent.distances.proxy_circular_kantorovich(x, y)

A proxy for circular Kantorovich distance.

Uses mean-shifted CDF L1 distance instead of the more expensive median-shifted Minkowski distance. The mean is a reasonable approximation to the median for most distributions and avoids the expensive median computation.

Results should be reranked with true circular_kantorovich distance.

pynndescent.distances.proxy_inner_product(x, y)

A proxy for inner product distance (negative inner product).

Inner product distance has undesirable properties for nearest neighbor graph based search, and NNDescent in general. This is a proxy function that behaves similarly to inner product distance for ranking neighbors, but avoids some of the pitfalls.

This is to be used internally, and results should use reranking with true inner product distance.

pynndescent.distances.proxy_jensen_shannon(x, y)

A proxy for Jensen-Shannon divergence.

Jensen-Shannon requires computing logs and the mixture distribution, which is expensive. This proxy uses squared Hellinger distance, which is also a proper divergence on probability distributions and much cheaper to compute (no logs required).

\[D_{proxy}(x, y) = 1 - \left(\sum_i \sqrt{p_i q_i}\right)^2\]

where p, q are the normalized distributions.

Results should be reranked with true jensen_shannon_divergence.

pynndescent.distances.proxy_kantorovich(x, y)

A proxy for Kantorovich (Earth Mover’s) distance.

The full Kantorovich distance requires solving an optimal transport problem via network simplex, which is expensive. This proxy uses a combination of: 1. Total variation distance (L1 on normalized distributions) 2. Hellinger-like term for better correlation

This is much cheaper to compute and correlates reasonably well with true optimal transport distance for nearest neighbor search.

Results should be reranked with true kantorovich distance.

pynndescent.distances.proxy_sinkhorn(x, y)

A proxy for Sinkhorn (entropy-regularized optimal transport) distance.

Sinkhorn distance requires iterative matrix scaling which is expensive. This proxy uses the same combination as proxy_kantorovich since Sinkhorn approximates Kantorovich.

Results should be reranked with true sinkhorn distance.

pynndescent.distances.proxy_symmetric_kl(x, y)

A proxy for symmetric KL divergence.

Symmetric KL requires computing logs which is expensive. This proxy uses triangular discrimination (symmetric chi-squared divergence), which is a second-order approximation to KL divergence and much cheaper.

\[D_{proxy}(x, y) = \sum_i \frac{(p_i - q_i)^2}{p_i + q_i}\]

Results should be reranked with true symmetric_kl_divergence.

pynndescent.distances.proxy_wasserstein_1d(x, y)

A proxy for 1D Wasserstein distance.

Uses L1 distance on the cumulative distribution functions, which is exactly equal to Wasserstein-1 distance for 1D distributions. This avoids the more expensive Minkowski computation with allocation.

For Wasserstein-p with p > 1, this is a lower bound and correlates well for nearest neighbor search.

\[D_{proxy}(x, y) = \sum_i |F_x(i) - F_y(i)|\]

where $F_x, F_y$ are the cumulative distribution functions.

Results should be reranked with true wasserstein_1d distance if p > 1.

pynndescent.distances.quantized_uint4_alternative_cosine(x, y, quantized_values): Alternative cosine distance between a float vector x and a quantized uint8 vector y. The uint8 values in y are mapped back to floats using upper and lower nibbles and via the provided quantized_values array.

pynndescent.distances.quantized_uint4_alternative_dot(x, y, quantized_values): Alternative dot product distance between a float vector x and a quantized uint8 vector y. The uint8 values in y are mapped back to floats using upper and lower nibbles and via the provided quantized_values array. x and y are assumed to be normalized.

pynndescent.distances.quantized_uint4_sq_euclidean(x, y, quantized_values): Squared Euclidean distance between a float vector x and a quantized uint8 vector y. The uint8 values in y are mapped back to floats using upper and lower nibbles and via the provided quantized_values array.

pynndescent.distances.quantized_uint8_alternative_cosine(x, y, quantized_values): Alternative cosine distance between a float vector x and a quantized uint8 vector y. The uint8 values in y are mapped back to floats using the provided quantized_values array.

pynndescent.distances.quantized_uint8_alternative_dot(x, y, quantized_values): Alternative dot product distance between a float vector x and a quantized uint8 vector y. The uint8 values in y are mapped back to floats using the provided quantized_values array. x and y are assumed to be normalized.

pynndescent.distances.quantized_uint8_sq_euclidean(x, y, quantized_values): Squared Euclidean distance between a float vector x and a quantized uint8 vector y. The uint8 values in y are mapped back to floats using the provided quantized_values array.

pynndescent.distances.rogers_tanimoto(x, y)

Rogers-Tanimoto distance.

A distance measure for binary vectors that gives double weight to disagreements.

\[D(x, y) = \frac{2|x \oplus y|}{n + |x \oplus y|}\]

where n is the number of dimensions.

pynndescent.distances.russellrao(x, y)

Russell-Rao distance.

The proportion of dimensions where at least one vector has a false value.

\[D(x, y) = \frac{n - |x \cap y|}{n}\]

where n is the number of dimensions.

pynndescent.distances.sinkhorn(x, y, cost=array([[0., 0.], [0., 0.]]), regularization=1.0)

Sinkhorn distance (entropy-regularized optimal transport).

An approximation to the Kantorovich distance using entropy regularization. Faster to compute than exact optimal transport for large distributions.

Parameters

x, yarray-like: Input vectors treated as probability distributions (will be normalized).
costarray-like: Cost matrix where cost[i,j] is the cost of moving mass from bin i to bin j.
regularizationfloat: Entropy regularization parameter. Smaller values give results closer to exact Kantorovich distance but may be less stable.

Returns

float: The entropy-regularized optimal transport distance.

pynndescent.distances.sokal_michener(x, y)

Sokal-Michener distance.

Equivalent to Rogers-Tanimoto distance. A distance measure for binary vectors that gives double weight to disagreements.

\[D(x, y) = \frac{2|x \oplus y|}{n + |x \oplus y|}\]

where n is the number of dimensions.

pynndescent.distances.sokal_sneath(x, y)

Sokal-Sneath distance.

A binary distance that gives double weight to agreements (both true).

\[D(x, y) = \frac{|x \oplus y|}{0.5|x \cap y| + |x \oplus y|}\]

where $\oplus$ denotes symmetric difference.

pynndescent.distances.spearmanr(x, y)

Spearman rank correlation distance.

One minus the Spearman rank correlation coefficient. Measures the monotonic relationship between two vectors by computing correlation on their ranks.

\[D(x, y) = 1 - \rho(\text{rank}(x), \text{rank}(y))\]

where $\rho$ is Pearson correlation.

pynndescent.distances.squared_euclidean(x, y): Squared euclidean distance.

\[D(x, y) = \sum_i (x_i - y_i)^2\]

pynndescent.distances.standardised_euclidean(x, y, sigma=array([1., 1.], dtype=float32)): Euclidean distance standardised against a vector of standard deviations per coordinate.

\[D(x, y) = \sqrt{\sum_i \frac{(x_i - y_i)**2}{v_i}}\]

pynndescent.distances.symmetric_kl_divergence(x, y)

Symmetric Kullback-Leibler divergence.

The sum of KL divergences in both directions, making it symmetric.

\[D(x, y) = D_{KL}(x \| y) + D_{KL}(y \| x)\]

where $D_{KL}(p \| q) = \sum_i p_i \log(p_i / q_i)$. Input vectors are normalized to probability distributions.

pynndescent.distances.true_angular(x, y)

True angular distance.

The actual angle between two vectors, normalized to [0, 1]. Unlike cosine distance which uses 1 - cos(θ), this returns 1 - θ/π.

\[D(x, y) = 1 - \frac{\arccos\left(\frac{\langle x, y \rangle}{\|x\| \|y\|}\right)}{\pi}\]

Returns 0 for identical directions, approaches 1 for opposite directions.

pynndescent.distances.tsss(x, y)

Triangle Area Similarity - Sector Area Similarity (TS-SS) distance.

A distance metric that combines both magnitude and angular information. It multiplies a triangle area (capturing angular difference) by a sector area (capturing both angular and magnitude differences).

Useful when both the direction and magnitude of vectors are important.

pynndescent.distances.wasserstein_1d(x, y, p=1)

1-dimensional Wasserstein distance.

The p-Wasserstein distance for 1D distributions, computed efficiently via the CDF. Input vectors are treated as histograms over ordered bins.

\[W_p(x, y) = \left( \sum_i |F_x(i) - F_y(i)|^p \right)^{1/p}\]

where $F_x, F_y$ are the cumulative distribution functions.

Parameters

x, yarray-like: Input vectors treated as probability distributions (will be normalized).
pint: The order of the Wasserstein distance (default 1).

pynndescent.distances.weighted_minkowski(x, y, w=array([1., 1.], dtype=float32), p=2)

A weighted version of Minkowski distance.

\[D(x, y) = \left(\sum_i w_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]

If weights w_i are inverse standard deviations of graph_data in each dimension then this represented a standardised Minkowski distance (and is equivalent to standardised Euclidean distance for p=1).

pynndescent.distances.yule(x, y)

Yule distance.

A binary distance based on the Yule Q coefficient of association.

\[D(x, y) = \frac{2 \cdot n_{TF} \cdot n_{FT}}{n_{TT} \cdot n_{FF} + n_{TF} \cdot n_{FT}}\]

where $n_{TF}$ is the count of positions where x is true and y is false, etc.