PyNNDescent API Guide

PyNNDescent has only two classes NNDescent and PyNNDescentTransformer.

PyNNDescent

class pynndescent.pynndescent_.PyNNDescentTransformer(n_neighbors=30, metric='euclidean', metric_kwds=None, n_trees=None, leaf_size=None, search_epsilon=0.1, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, random_state=None, n_jobs=None, low_memory=True, max_candidates=None, n_iters=None, early_termination_value=0.001, parallel_batch_queries=False, verbose=False)

PyNNDescentTransformer for fast approximate nearest neighbor transformer. It uses the NNDescent algorithm, and is thus very flexible and supports a wide variety of distances, including non-metric distances. NNDescent also scales well against high dimensional graph_data in many cases.

Transform X into a (weighted) graph of k nearest neighbors

The transformed graph_data is a sparse graph as returned by kneighbors_graph.

Parameters

n_neighbors: int (optional, default=5)

The number of neighbors to use in k-neighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.

metric: string or callable (optional, default=’euclidean’)

The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:

  • euclidean

  • manhattan

  • chebyshev

  • minkowski

  • canberra

  • braycurtis

  • mahalanobis

  • wminkowski

  • seuclidean

  • cosine

  • correlation

  • haversine

  • hamming

  • jaccard

  • dice

  • russelrao

  • kulsinski

  • rogerstanimoto

  • sokalmichener

  • sokalsneath

  • yule

  • hellinger

  • wasserstein-1d

Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

metric_kwds: dict (optional, default {})

Arguments to pass on to the metric, such as the p value for Minkowski distance.

n_trees: int (optional, default=None)

This implementation uses random projection forests for initialization of searches. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.

leaf_size: int (optional, default=None)

The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.

pruning_degree_multiplier: float (optional, default=1.5)

How aggressively to prune the graph. Since the search graph is undirected (and thus includes nearest neighbors and reverse nearest neighbors) vertices can have very high degree – the graph will be pruned such that no vertex has degree greater than pruning_degree_multiplier * n_neighbors.

diversify_prob: float (optional, default=1.0)

The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.

n_search_trees: int (optional, default=1)

The number of random projection trees to use in initializing searching or querying.

Deprecated since version 0.5.5.

search_epsilon: float (optional, default=0.1)

When searching for nearest neighbors of a query point this values controls the trade-off between accuracy and search cost. Larger values produce more accurate nearest neighbor results at larger computational cost for the search. Values should be in the range 0.0 to 0.5, but should probably not exceed 0.3 without good reason.

tree_init: bool (optional, default=True)

Whether to use random projection trees for initialization.

random_state: int, RandomState instance or None, optional (default: None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobs: int or None (optional, default=None)

The maximum number of parallel threads to be run at a time. If none this will default to using all the cores available. Note that there is not perfect parallelism, so at several pints the algorithm will be single threaded.

low_memory: boolean (optional, default=False)

Whether to use a lower memory, but more computationally expensive approach to index construction. This defaults to false as for most cases it speeds index construction, but if you are having issues with excessive memory use for your dataset consider setting this to True.

max_candidates: int (optional, default=20)

Internally each “self-join” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at non-negligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.

n_iters: int (optional, default=None)

The maximum number of NN-descent iterations to perform. The NN-descent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.

early_termination_value: float (optional, default=0.001)

Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.

parallel_batch_queries: bool (optional, default=False)

Whether to use parallelism of batched queries. This can be useful for large batches of queries on multicore machines, but results in performance degradation for single queries, so is poor for streaming use.

verbose: bool (optional, default=False)

Whether to print status graph_data during the computation.

Examples

>>> from sklearn.manifold import Isomap
>>> from pynndescent import PyNNDescentTransformer
>>> from sklearn.pipeline import make_pipeline
>>> estimator = make_pipeline(
...     PyNNDescentTransformer(n_neighbors=5),
...     Isomap(neighbors_algorithm='precomputed'))
fit(X, compress_index=True)

Fit the PyNNDescent transformer to build KNN graphs with neighbors given by the dataset X.

Parameters

Xarray-like, shape (n_samples, n_features)

Sample graph_data

Returns

transformerPyNNDescentTransformer

The trained transformer

fit_transform(X, y=None, **fit_params)

Fit to graph_data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xnumpy array of shape (n_samples, n_features)

Training set.

y : ignored

Returns

XtCSR sparse matrix, shape (n_samples, n_samples)

Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value. The diagonal is always explicit.

set_fit_request(*, compress_index: bool | None | str = '$UNCHANGED$') PyNNDescentTransformer

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

compress_indexstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for compress_index parameter in fit.

Returns

selfobject

The updated object.

transform(X, y=None)

Computes the (weighted) graph of Neighbors for points in X

Parameters

Xarray-like, shape (n_samples_transform, n_features)

Sample graph_data

Returns

XtCSR sparse matrix, shape (n_samples_transform, n_samples_fit)

Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value.

A number of internal functions can also be accessed separately for more fine tuned work.

Distance Functions

pynndescent.distances.chebyshev(x, y)

Chebyshev or l-infinity distance.

\[D(x, y) = \max_i |x_i - y_i|\]
pynndescent.distances.euclidean(x, y)

Standard euclidean distance.

\[\begin{split}D(x, y) = \\sqrt{\sum_i (x_i - y_i)^2}\end{split}\]
pynndescent.distances.manhattan(x, y)

Manhattan, taxicab, or l1 distance.

\[D(x, y) = \sum_i |x_i - y_i|\]
pynndescent.distances.minkowski(x, y, p=2)

Minkowski distance.

\[D(x, y) = \left(\sum_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]

This is a general distance. For p=1 it is equivalent to manhattan distance, for p=2 it is Euclidean distance, and for p=infinity it is Chebyshev distance. In general it is better to use the more specialised functions for those distances.

pynndescent.distances.squared_euclidean(x, y)

Squared euclidean distance.

\[D(x, y) = \sum_i (x_i - y_i)^2\]
pynndescent.distances.standardised_euclidean(x, y, sigma=array([1., 1.], dtype=float32))

Euclidean distance standardised against a vector of standard deviations per coordinate.

\[D(x, y) = \sqrt{\sum_i \frac{(x_i - y_i)**2}{v_i}}\]
pynndescent.distances.weighted_minkowski(x, y, w=array([1., 1.], dtype=float32), p=2)

A weighted version of Minkowski distance.

\[D(x, y) = \left(\sum_i w_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]

If weights w_i are inverse standard deviations of graph_data in each dimension then this represented a standardised Minkowski distance (and is equivalent to standardised Euclidean distance for p=1).