PyNNDescent API Guide
PyNNDescent has only two classes NNDescent
and PyNNDescentTransformer
.
PyNNDescent
- class pynndescent.pynndescent_.PyNNDescentTransformer(n_neighbors=30, metric='euclidean', metric_kwds=None, n_trees=None, leaf_size=None, search_epsilon=0.1, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, random_state=None, n_jobs=None, low_memory=True, max_candidates=None, n_iters=None, early_termination_value=0.001, parallel_batch_queries=False, verbose=False)
PyNNDescentTransformer for fast approximate nearest neighbor transformer. It uses the NNDescent algorithm, and is thus very flexible and supports a wide variety of distances, including non-metric distances. NNDescent also scales well against high dimensional graph_data in many cases.
Transform X into a (weighted) graph of k nearest neighbors
The transformed graph_data is a sparse graph as returned by kneighbors_graph.
Parameters
- n_neighbors: int (optional, default=5)
The number of neighbors to use in k-neighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.
- metric: string or callable (optional, default=’euclidean’)
The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:
euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
mahalanobis
wminkowski
seuclidean
cosine
correlation
haversine
hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
hellinger
wasserstein-1d
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
- metric_kwds: dict (optional, default {})
Arguments to pass on to the metric, such as the
p
value for Minkowski distance.- n_trees: int (optional, default=None)
This implementation uses random projection forests for initialization of searches. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.
- leaf_size: int (optional, default=None)
The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.
- pruning_degree_multiplier: float (optional, default=1.5)
How aggressively to prune the graph. Since the search graph is undirected (and thus includes nearest neighbors and reverse nearest neighbors) vertices can have very high degree – the graph will be pruned such that no vertex has degree greater than
pruning_degree_multiplier * n_neighbors
.- diversify_prob: float (optional, default=1.0)
The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.
- n_search_trees: int (optional, default=1)
The number of random projection trees to use in initializing searching or querying.
Deprecated since version 0.5.5.
- search_epsilon: float (optional, default=0.1)
When searching for nearest neighbors of a query point this values controls the trade-off between accuracy and search cost. Larger values produce more accurate nearest neighbor results at larger computational cost for the search. Values should be in the range 0.0 to 0.5, but should probably not exceed 0.3 without good reason.
- tree_init: bool (optional, default=True)
Whether to use random projection trees for initialization.
- random_state: int, RandomState instance or None, optional (default: None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- n_jobs: int or None (optional, default=None)
The maximum number of parallel threads to be run at a time. If none this will default to using all the cores available. Note that there is not perfect parallelism, so at several pints the algorithm will be single threaded.
- low_memory: boolean (optional, default=False)
Whether to use a lower memory, but more computationally expensive approach to index construction. This defaults to false as for most cases it speeds index construction, but if you are having issues with excessive memory use for your dataset consider setting this to True.
- max_candidates: int (optional, default=20)
Internally each “self-join” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at non-negligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.
- n_iters: int (optional, default=None)
The maximum number of NN-descent iterations to perform. The NN-descent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.
- early_termination_value: float (optional, default=0.001)
Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.
- parallel_batch_queries: bool (optional, default=False)
Whether to use parallelism of batched queries. This can be useful for large batches of queries on multicore machines, but results in performance degradation for single queries, so is poor for streaming use.
- verbose: bool (optional, default=False)
Whether to print status graph_data during the computation.
Examples
>>> from sklearn.manifold import Isomap >>> from pynndescent import PyNNDescentTransformer >>> from sklearn.pipeline import make_pipeline >>> estimator = make_pipeline( ... PyNNDescentTransformer(n_neighbors=5), ... Isomap(neighbors_algorithm='precomputed'))
- fit(X, compress_index=True)
Fit the PyNNDescent transformer to build KNN graphs with neighbors given by the dataset X.
Parameters
- Xarray-like, shape (n_samples, n_features)
Sample graph_data
Returns
- transformerPyNNDescentTransformer
The trained transformer
- fit_transform(X, y=None, **fit_params)
Fit to graph_data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
- Xnumpy array of shape (n_samples, n_features)
Training set.
y : ignored
Returns
- XtCSR sparse matrix, shape (n_samples, n_samples)
Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value. The diagonal is always explicit.
- set_fit_request(*, compress_index: bool | None | str = '$UNCHANGED$') PyNNDescentTransformer
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters
- compress_indexstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
compress_index
parameter infit
.
Returns
- selfobject
The updated object.
- transform(X, y=None)
Computes the (weighted) graph of Neighbors for points in X
Parameters
- Xarray-like, shape (n_samples_transform, n_features)
Sample graph_data
Returns
- XtCSR sparse matrix, shape (n_samples_transform, n_samples_fit)
Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value.
A number of internal functions can also be accessed separately for more fine tuned work.
Distance Functions
- pynndescent.distances.chebyshev(x, y)
Chebyshev or l-infinity distance.
\[D(x, y) = \max_i |x_i - y_i|\]
- pynndescent.distances.euclidean(x, y)
Standard euclidean distance.
\[\begin{split}D(x, y) = \\sqrt{\sum_i (x_i - y_i)^2}\end{split}\]
- pynndescent.distances.manhattan(x, y)
Manhattan, taxicab, or l1 distance.
\[D(x, y) = \sum_i |x_i - y_i|\]
- pynndescent.distances.minkowski(x, y, p=2)
Minkowski distance.
\[D(x, y) = \left(\sum_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]This is a general distance. For p=1 it is equivalent to manhattan distance, for p=2 it is Euclidean distance, and for p=infinity it is Chebyshev distance. In general it is better to use the more specialised functions for those distances.
- pynndescent.distances.squared_euclidean(x, y)
Squared euclidean distance.
\[D(x, y) = \sum_i (x_i - y_i)^2\]
- pynndescent.distances.standardised_euclidean(x, y, sigma=array([1., 1.], dtype=float32))
Euclidean distance standardised against a vector of standard deviations per coordinate.
\[D(x, y) = \sqrt{\sum_i \frac{(x_i - y_i)**2}{v_i}}\]
- pynndescent.distances.weighted_minkowski(x, y, w=array([1., 1.], dtype=float32), p=2)
A weighted version of Minkowski distance.
\[D(x, y) = \left(\sum_i w_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]If weights w_i are inverse standard deviations of graph_data in each dimension then this represented a standardised Minkowski distance (and is equivalent to standardised Euclidean distance for p=1).