PyNNDescent API Guide¶

PyNNDescent has only two classes NNDescent and PyNNDescentTransformer.

PyNNDescent¶

class pynndescent.pynndescent_.NNDescent(data, metric='euclidean', metric_kwds=None, n_neighbors=30, n_trees=None, leaf_size=None, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, init_graph=None, init_dist=None, random_state=None, low_memory=True, max_candidates=None, n_iters=None, delta=0.001, n_jobs=None, compressed=False, parallel_batch_queries=False, verbose=False)¶

NNDescent for fast approximate nearest neighbor queries. NNDescent is very flexible and supports a wide variety of distances, including non-metric distances. NNDescent also scales well against high dimensional graph_data in many cases. This implementation provides a straightfoward interface, with access to some tuning parameters.

data: array of shape (n_samples, n_features)

The training graph_data set to find nearest neighbors in.

metric: string or callable (optional, default=’euclidean’)

The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:

euclidean

manhattan

chebyshev

minkowski

canberra

braycurtis

mahalanobis

wminkowski

seuclidean

cosine

correlation

haversine

hamming

jaccard

dice

russelrao

kulsinski

rogerstanimoto

sokalmichener

sokalsneath

yule

hellinger

wasserstein-1d

Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

metric_kwds: dict (optional, default {})

Arguments to pass on to the metric, such as the p value for Minkowski distance.

n_neighbors: int (optional, default=30)

The number of neighbors to use in k-neighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.

n_trees: int (optional, default=None)

This implementation uses random projection forests for initializing the index build process. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.

leaf_size: int (optional, default=None)

The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.

pruning_degree_multiplier: float (optional, default=1.5)

How aggressively to prune the graph. Since the search graph is undirected (and thus includes nearest neighbors and reverse nearest neighbors) vertices can have very high degree – the graph will be pruned such that no vertex has degree greater than pruning_degree_multiplier * n_neighbors.

diversify_prob: float (optional, default=1.0)

The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.

n_search_trees: int (optional, default=1)

The number of random projection trees to use in initializing searching or querying.

Deprecated since version 0.5.5.

tree_init: bool (optional, default=True)

Whether to use random projection trees for initialization.

init_graph: np.ndarray (optional, default=None)

2D array of indices of candidate neighbours of the shape (data.shape[0], n_neighbours). If the j-th neighbour of the i-th instances is unknown, use init_graph[i, j] = -1

init_dist: np.ndarray (optional, default=None)

2D array with the same shape as init_graph, such that metric(data[i], data[init_graph[i, j]]) equals init_dist[i, j]

random_state: int, RandomState instance or None, optional (default: None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

algorithm: string (optional, default=’standard’)

This implementation provides an alternative algorithm for construction of the k-neighbors graph used as a search index. The alternative algorithm can be fast for large n_neighbors values. The``’alternative’`` algorithm has been deprecated and is no longer available.

low_memory: boolean (optional, default=True)

Whether to use a lower memory, but more computationally expensive approach to index construction.

max_candidates: int (optional, default=None)

Internally each “self-join” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at non-negligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.

n_iters: int (optional, default=None)

The maximum number of NN-descent iterations to perform. The NN-descent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.

delta: float (optional, default=0.001)

Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.

n_jobs: int or None, optional (default=None)

The number of parallel jobs to run for neighbors index construction. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

compressed: bool (optional, default=False)

Whether to prune out data not needed for searching the index. This will result in a significantly smaller index, particularly useful for saving, but will remove information that might otherwise be useful.

parallel_batch_queries: bool (optional, default=False)

Whether to use parallelism of batched queries. This can be useful for large batches of queries on multicore machines, but results in performance degradation for single queries, so is poor for streaming use.

verbose: bool (optional, default=False)

Whether to print status graph_data during the computation.

query(query_data, k=10, epsilon=0.1)¶

Query the training graph_data for the k nearest neighbors

query_data: array-like, last dimension self.dim: An array of points to query
k: integer (default = 10): The number of nearest neighbors to return
epsilon: float (optional, default=0.1): When searching for nearest neighbors of a query point this values controls the trade-off between accuracy and search cost. Larger values produce more accurate nearest neighbor results at larger computational cost for the search. Values should be in the range 0.0 to 0.5, but should probably not exceed 0.3 without good reason.

indices, distances: array (n_query_points, k), array (n_query_points, k)

The first array, indices, provides the indices of the graph_data points in the training set that are the nearest neighbors of each query point. Thus indices[i, j] is the index into the training graph_data of the jth nearest neighbor of the ith query points.

Similarly distances provides the distances to the neighbors of the query points such that distances[i, j] is the distance from the ith query point to its jth nearest neighbor in the training graph_data.

update(xs_fresh=None, xs_updated=None, updated_indices=None)¶

Updates the index with a) fresh data (that is appended to the existing data), and b) data that was only updated (but should not be appended to the existing data).

Not applicable to sparse data yet.

xs_fresh: np.ndarray (optional, default=None): 2D array of the shape (n_fresh, dim) where dim is the dimension of the data from which we built self.
xs_updated: np.ndarray (optional, default=None): 2D array of the shape (n_updates, dim) where dim is the dimension of the data from which we built self.
updated_indices: array-like of size n_updates (optional, default=None): Something that is convertable to list of ints. If self is currently built from xs, then xs[update_indices[i]] will be replaced by xs_updated[i].

None

class pynndescent.pynndescent_.PyNNDescentTransformer(n_neighbors=30, metric='euclidean', metric_kwds=None, n_trees=None, leaf_size=None, search_epsilon=0.1, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, random_state=None, n_jobs=None, low_memory=True, max_candidates=None, n_iters=None, early_termination_value=0.001, parallel_batch_queries=False, verbose=False)¶

PyNNDescentTransformer for fast approximate nearest neighbor transformer. It uses the NNDescent algorithm, and is thus very flexible and supports a wide variety of distances, including non-metric distances. NNDescent also scales well against high dimensional graph_data in many cases.

Transform X into a (weighted) graph of k nearest neighbors

The transformed graph_data is a sparse graph as returned by kneighbors_graph.

n_neighbors: int (optional, default=5)

The number of neighbors to use in k-neighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.

metric: string or callable (optional, default=’euclidean’)

The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:

euclidean

manhattan

chebyshev

minkowski

canberra

braycurtis

mahalanobis

wminkowski

seuclidean

cosine

correlation

haversine

hamming

jaccard

dice

russelrao

kulsinski

rogerstanimoto

sokalmichener

sokalsneath

yule

hellinger

wasserstein-1d

Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

metric_kwds: dict (optional, default {})

Arguments to pass on to the metric, such as the p value for Minkowski distance.

n_trees: int (optional, default=None)

This implementation uses random projection forests for initialization of searches. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.

leaf_size: int (optional, default=None)

The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.

pruning_degree_multiplier: float (optional, default=1.5)

How aggressively to prune the graph. Since the search graph is undirected (and thus includes nearest neighbors and reverse nearest neighbors) vertices can have very high degree – the graph will be pruned such that no vertex has degree greater than pruning_degree_multiplier * n_neighbors.

diversify_prob: float (optional, default=1.0)

The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.

n_search_trees: int (optional, default=1)

The number of random projection trees to use in initializing searching or querying.

Deprecated since version 0.5.5.

search_epsilon: float (optional, default=0.1)

When searching for nearest neighbors of a query point this values controls the trade-off between accuracy and search cost. Larger values produce more accurate nearest neighbor results at larger computational cost for the search. Values should be in the range 0.0 to 0.5, but should probably not exceed 0.3 without good reason.

tree_init: bool (optional, default=True)

Whether to use random projection trees for initialization.

random_state: int, RandomState instance or None, optional (default: None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobs: int or None (optional, default=None)

The maximum number of parallel threads to be run at a time. If none this will default to using all the cores available. Note that there is not perfect parallelism, so at several pints the algorithm will be single threaded.

low_memory: boolean (optional, default=False)

Whether to use a lower memory, but more computationally expensive approach to index construction. This defaults to false as for most cases it speeds index construction, but if you are having issues with excessive memory use for your dataset consider setting this to True.

max_candidates: int (optional, default=20)

Internally each “self-join” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at non-negligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.

n_iters: int (optional, default=None)

The maximum number of NN-descent iterations to perform. The NN-descent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.

early_termination_value: float (optional, default=0.001)

Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.

parallel_batch_queries: bool (optional, default=False)

Whether to use parallelism of batched queries. This can be useful for large batches of queries on multicore machines, but results in performance degradation for single queries, so is poor for streaming use.

verbose: bool (optional, default=False)

Whether to print status graph_data during the computation.

>>> from sklearn.manifold import Isomap
>>> from pynndescent import PyNNDescentTransformer
>>> from sklearn.pipeline import make_pipeline
>>> estimator = make_pipeline(
...     PyNNDescentTransformer(n_neighbors=5),
...     Isomap(neighbors_algorithm='precomputed'))

fit(X, compress_index=True)¶

Fit the PyNNDescent transformer to build KNN graphs with neighbors given by the dataset X.

X : array-like, shape (n_samples, n_features): Sample graph_data

transformer : PyNNDescentTransformer: The trained transformer

fit_transform(X, y=None, **fit_params)¶

Fit to graph_data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape (n_samples, n_features): Training set.

y : ignored

Xt : CSR sparse matrix, shape (n_samples, n_samples): Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value. The diagonal is always explicit.

transform(X, y=None)¶

Computes the (weighted) graph of Neighbors for points in X

X : array-like, shape (n_samples_transform, n_features): Sample graph_data

Xt : CSR sparse matrix, shape (n_samples_transform, n_samples_fit): Xt[i, j] is assigned the weight of edge that connects i to j. Only the neighbors have an explicit value.

A number of internal functions can also be accessed separately for more fine tuned work.

Distance Functions¶

pynndescent.distances.chebyshev¶: Chebyshev or l-infinity distance.

\[D(x, y) = \max_i |x_i - y_i|\]

pynndescent.distances.euclidean¶: Standard euclidean distance.

\[\begin{split}D(x, y) = \\sqrt{\sum_i (x_i - y_i)^2}\end{split}\]

pynndescent.distances.manhattan¶: Manhattan, taxicab, or l1 distance.

\[D(x, y) = \sum_i |x_i - y_i|\]

pynndescent.distances.minkowski¶

Minkowski distance.

\[D(x, y) = \left(\sum_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]

This is a general distance. For p=1 it is equivalent to manhattan distance, for p=2 it is Euclidean distance, and for p=infinity it is Chebyshev distance. In general it is better to use the more specialised functions for those distances.

pynndescent.distances.squared_euclidean¶: Squared euclidean distance.

\[D(x, y) = \sum_i (x_i - y_i)^2\]

pynndescent.distances.standardised_euclidean¶: Euclidean distance standardised against a vector of standard deviations per coordinate.

\[D(x, y) = \sqrt{\sum_i \frac{(x_i - y_i)**2}{v_i}}\]

pynndescent.distances.weighted_minkowski¶

A weighted version of Minkowski distance.

\[D(x, y) = \left(\sum_i w_i |x_i - y_i|^p\right)^{\frac{1}{p}}\]

If weights w_i are inverse standard deviations of graph_data in each dimension then this represented a standardised Minkowski distance (and is equivalent to standardised Euclidean distance for p=1).