PyNNDescent API Guide¶
PyNNDescent has only two classes NNDescent
and PyNNDescentTransformer
.
PyNNDescent¶

class
pynndescent.pynndescent_.
NNDescent
(data, metric='euclidean', metric_kwds=None, n_neighbors=30, n_trees=None, leaf_size=None, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, init_graph=None, init_dist=None, random_state=None, low_memory=True, max_candidates=None, n_iters=None, delta=0.001, n_jobs=None, compressed=False, parallel_batch_queries=False, verbose=False)¶ NNDescent for fast approximate nearest neighbor queries. NNDescent is very flexible and supports a wide variety of distances, including nonmetric distances. NNDescent also scales well against high dimensional graph_data in many cases. This implementation provides a straightfoward interface, with access to some tuning parameters.
 data: array of shape (n_samples, n_features)
 The training graph_data set to find nearest neighbors in.
 metric: string or callable (optional, default=’euclidean’)
The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:
 euclidean
 manhattan
 chebyshev
 minkowski
 canberra
 braycurtis
 mahalanobis
 wminkowski
 seuclidean
 cosine
 correlation
 haversine
 hamming
 jaccard
 dice
 russelrao
 kulsinski
 rogerstanimoto
 sokalmichener
 sokalsneath
 yule
 hellinger
 wasserstein1d
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
 metric_kwds: dict (optional, default {})
 Arguments to pass on to the metric, such as the
p
value for Minkowski distance.  n_neighbors: int (optional, default=30)
 The number of neighbors to use in kneighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.
 n_trees: int (optional, default=None)
 This implementation uses random projection forests for initializing the index build process. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.
 leaf_size: int (optional, default=None)
 The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.
 pruning_degree_multiplier: float (optional, default=1.5)
 How aggressively to prune the graph. Since the search graph is undirected
(and thus includes nearest neighbors and reverse nearest neighbors) vertices
can have very high degree – the graph will be pruned such that no
vertex has degree greater than
pruning_degree_multiplier * n_neighbors
.  diversify_prob: float (optional, default=1.0)
 The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.
 n_search_trees: int (optional, default=1)
The number of random projection trees to use in initializing searching or querying.
Deprecated since version 0.5.5.
 tree_init: bool (optional, default=True)
 Whether to use random projection trees for initialization.
 init_graph: np.ndarray (optional, default=None)
 2D array of indices of candidate neighbours of the shape (data.shape[0], n_neighbours). If the jth neighbour of the ith instances is unknown, use init_graph[i, j] = 1
 init_dist: np.ndarray (optional, default=None)
 2D array with the same shape as init_graph, such that metric(data[i], data[init_graph[i, j]]) equals init_dist[i, j]
 random_state: int, RandomState instance or None, optional (default: None)
 If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
 algorithm: string (optional, default=’standard’)
 This implementation provides an alternative algorithm for
construction of the kneighbors graph used as a search index. The
alternative algorithm can be fast for large
n_neighbors
values. The``’alternative’`` algorithm has been deprecated and is no longer available.  low_memory: boolean (optional, default=True)
 Whether to use a lower memory, but more computationally expensive approach to index construction.
 max_candidates: int (optional, default=None)
 Internally each “selfjoin” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at nonnegligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.
 n_iters: int (optional, default=None)
 The maximum number of NNdescent iterations to perform. The NNdescent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.
 delta: float (optional, default=0.001)
 Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.
 n_jobs: int or None, optional (default=None)
 The number of parallel jobs to run for neighbors index construction.
None
means 1 unless in ajoblib.parallel_backend
context.1
means using all processors.  compressed: bool (optional, default=False)
 Whether to prune out data not needed for searching the index. This will result in a significantly smaller index, particularly useful for saving, but will remove information that might otherwise be useful.
 parallel_batch_queries: bool (optional, default=False)
 Whether to use parallelism of batched queries. This can be useful for large batches of queries on multicore machines, but results in performance degradation for single queries, so is poor for streaming use.
 verbose: bool (optional, default=False)
 Whether to print status graph_data during the computation.

query
(query_data, k=10, epsilon=0.1)¶ Query the training graph_data for the k nearest neighbors
 query_data: arraylike, last dimension self.dim
 An array of points to query
 k: integer (default = 10)
 The number of nearest neighbors to return
 epsilon: float (optional, default=0.1)
 When searching for nearest neighbors of a query point this values controls the tradeoff between accuracy and search cost. Larger values produce more accurate nearest neighbor results at larger computational cost for the search. Values should be in the range 0.0 to 0.5, but should probably not exceed 0.3 without good reason.
 indices, distances: array (n_query_points, k), array (n_query_points, k)
The first array,
indices
, provides the indices of the graph_data points in the training set that are the nearest neighbors of each query point. Thusindices[i, j]
is the index into the training graph_data of the jth nearest neighbor of the ith query points.Similarly
distances
provides the distances to the neighbors of the query points such thatdistances[i, j]
is the distance from the ith query point to its jth nearest neighbor in the training graph_data.

update
(xs_fresh=None, xs_updated=None, updated_indices=None)¶ Updates the index with a) fresh data (that is appended to the existing data), and b) data that was only updated (but should not be appended to the existing data).
Not applicable to sparse data yet.
 xs_fresh: np.ndarray (optional, default=None)
 2D array of the shape (n_fresh, dim) where dim is the dimension of the data from which we built self.
 xs_updated: np.ndarray (optional, default=None)
 2D array of the shape (n_updates, dim) where dim is the dimension of the data from which we built self.
 updated_indices: arraylike of size n_updates (optional, default=None)
 Something that is convertable to list of ints. If self is currently built from xs, then xs[update_indices[i]] will be replaced by xs_updated[i].
None

class
pynndescent.pynndescent_.
PyNNDescentTransformer
(n_neighbors=30, metric='euclidean', metric_kwds=None, n_trees=None, leaf_size=None, search_epsilon=0.1, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, random_state=None, n_jobs=None, low_memory=True, max_candidates=None, n_iters=None, early_termination_value=0.001, parallel_batch_queries=False, verbose=False)¶ PyNNDescentTransformer for fast approximate nearest neighbor transformer. It uses the NNDescent algorithm, and is thus very flexible and supports a wide variety of distances, including nonmetric distances. NNDescent also scales well against high dimensional graph_data in many cases.
Transform X into a (weighted) graph of k nearest neighbors
The transformed graph_data is a sparse graph as returned by kneighbors_graph.
 n_neighbors: int (optional, default=5)
 The number of neighbors to use in kneighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.
 metric: string or callable (optional, default=’euclidean’)
The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:
 euclidean
 manhattan
 chebyshev
 minkowski
 canberra
 braycurtis
 mahalanobis
 wminkowski
 seuclidean
 cosine
 correlation
 haversine
 hamming
 jaccard
 dice
 russelrao
 kulsinski
 rogerstanimoto
 sokalmichener
 sokalsneath
 yule
 hellinger
 wasserstein1d
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
 metric_kwds: dict (optional, default {})
 Arguments to pass on to the metric, such as the
p
value for Minkowski distance.  n_trees: int (optional, default=None)
 This implementation uses random projection forests for initialization of searches. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.
 leaf_size: int (optional, default=None)
 The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.
 pruning_degree_multiplier: float (optional, default=1.5)
 How aggressively to prune the graph. Since the search graph is undirected
(and thus includes nearest neighbors and reverse nearest neighbors) vertices
can have very high degree – the graph will be pruned such that no
vertex has degree greater than
pruning_degree_multiplier * n_neighbors
.  diversify_prob: float (optional, default=1.0)
 The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.
 n_search_trees: int (optional, default=1)
The number of random projection trees to use in initializing searching or querying.
Deprecated since version 0.5.5.
 search_epsilon: float (optional, default=0.1)
 When searching for nearest neighbors of a query point this values controls the tradeoff between accuracy and search cost. Larger values produce more accurate nearest neighbor results at larger computational cost for the search. Values should be in the range 0.0 to 0.5, but should probably not exceed 0.3 without good reason.
 tree_init: bool (optional, default=True)
 Whether to use random projection trees for initialization.
 random_state: int, RandomState instance or None, optional (default: None)
 If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
 n_jobs: int or None (optional, default=None)
 The maximum number of parallel threads to be run at a time. If none this will default to using all the cores available. Note that there is not perfect parallelism, so at several pints the algorithm will be single threaded.
 low_memory: boolean (optional, default=False)
 Whether to use a lower memory, but more computationally expensive approach to index construction. This defaults to false as for most cases it speeds index construction, but if you are having issues with excessive memory use for your dataset consider setting this to True.
 max_candidates: int (optional, default=20)
 Internally each “selfjoin” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at nonnegligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.
 n_iters: int (optional, default=None)
 The maximum number of NNdescent iterations to perform. The NNdescent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.
 early_termination_value: float (optional, default=0.001)
 Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.
 parallel_batch_queries: bool (optional, default=False)
 Whether to use parallelism of batched queries. This can be useful for large batches of queries on multicore machines, but results in performance degradation for single queries, so is poor for streaming use.
 verbose: bool (optional, default=False)
 Whether to print status graph_data during the computation.
>>> from sklearn.manifold import Isomap >>> from pynndescent import PyNNDescentTransformer >>> from sklearn.pipeline import make_pipeline >>> estimator = make_pipeline( ... PyNNDescentTransformer(n_neighbors=5), ... Isomap(neighbors_algorithm='precomputed'))

fit
(X, compress_index=True)¶ Fit the PyNNDescent transformer to build KNN graphs with neighbors given by the dataset X.
 X : arraylike, shape (n_samples, n_features)
 Sample graph_data
 transformer : PyNNDescentTransformer
 The trained transformer

fit_transform
(X, y=None, **fit_params)¶ Fit to graph_data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 X : numpy array of shape (n_samples, n_features)
 Training set.
y : ignored
 Xt : CSR sparse matrix, shape (n_samples, n_samples)
 Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value. The diagonal is always explicit.

transform
(X, y=None)¶ Computes the (weighted) graph of Neighbors for points in X
 X : arraylike, shape (n_samples_transform, n_features)
 Sample graph_data
 Xt : CSR sparse matrix, shape (n_samples_transform, n_samples_fit)
 Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value.
A number of internal functions can also be accessed separately for more fine tuned work.
Distance Functions¶

pynndescent.distances.
chebyshev
¶ Chebyshev or linfinity distance.
\[D(x, y) = \max_i x_i  y_i\]

pynndescent.distances.
euclidean
¶ Standard euclidean distance.
\[\begin{split}D(x, y) = \\sqrt{\sum_i (x_i  y_i)^2}\end{split}\]

pynndescent.distances.
manhattan
¶ Manhattan, taxicab, or l1 distance.
\[D(x, y) = \sum_i x_i  y_i\]

pynndescent.distances.
minkowski
¶ Minkowski distance.
\[D(x, y) = \left(\sum_i x_i  y_i^p\right)^{\frac{1}{p}}\]This is a general distance. For p=1 it is equivalent to manhattan distance, for p=2 it is Euclidean distance, and for p=infinity it is Chebyshev distance. In general it is better to use the more specialised functions for those distances.

pynndescent.distances.
squared_euclidean
¶ Squared euclidean distance.
\[D(x, y) = \sum_i (x_i  y_i)^2\]

pynndescent.distances.
standardised_euclidean
¶ Euclidean distance standardised against a vector of standard deviations per coordinate.
\[D(x, y) = \sqrt{\sum_i \frac{(x_i  y_i)**2}{v_i}}\]

pynndescent.distances.
weighted_minkowski
¶ A weighted version of Minkowski distance.
\[D(x, y) = \left(\sum_i w_i x_i  y_i^p\right)^{\frac{1}{p}}\]If weights w_i are inverse standard deviations of graph_data in each dimension then this represented a standardised Minkowski distance (and is equivalent to standardised Euclidean distance for p=1).