Working with sparse data

Not all data conveniently fits in numpy arrays; sometimes a lot of the data entries are zero and we want to use a sparse data storage format. This is especially common for extremely high dimensional data (data with thousands, or even hundreds of thousands of dimensions). Such data is a lot harder to work with for many tasks, including nearest neighbor search. Let’s see how we can work with sparse data like this in PyNNDescent.

First we’ll need some data. For that let’s use a standard NLP dataset that we can pull together with scikit-learn.

[9]:

import pynndescent
import sklearn.datasets
import sys

We need to fetch the train and test sets separately, but conveniently the data has already been converted from the text of newsgroup messages into TF-IDF matrices. This means that we have a feature column for each word in the vocabulary – though often the vocabulary is pruned a little.

[2]:

news_train = sklearn.datasets.fetch_20newsgroups_vectorized(subset='train')
news_test = sklearn.datasets.fetch_20newsgroups_vectorized(subset='test')

Now that we have the data let’s see what it looks like:

[3]:

news_train.data

[3]:

<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
        with 1787565 stored elements in Compressed Sparse Row format>

Not a numpy array! Instead it is a SciPy sparse matrix. It has 11314 rows (so not many data samples), but 130107 columns (a lot of features – as noted, one for each word in the vocabulary). Despite that large size there are only 17876565 non-zero entries. The trick with sparse matrices is that they only store information about those entries that aren’t zero – they need to keep track of where they are, and what the value is, but they can ignore all the zero entries. If this were a raw numpy array with all those zeros in place it would have …

[7]:

11314 * 130107

[7]:

1472030598

… a lot of entries. To store all of that in memory would require (at 4 bytes per entry) …

[8]:

(11314 * 130107 * 4) / 1024**3

[8]:

5.48374130576849

… almost 5.5 GB! That is possible, but likely impractical on a laptop. And this is for a case with a small number of data samples. With more samples the size would grow enormous very quickly indeed. Instead we have an object that uses …

[13]:

(
    news_train.data.data.nbytes
     + news_train.data.indices.size
     + news_train.data.indptr.nbytes
) / 1024**2

[13]:

15.385956764221191

… only 15 MB. You will also note that to extract that information required poking at some of the internal attributes of the sparse matrix (data, indices, and indptr). This is the downside of the sparse format – they are more complicated to work with. It is certainly the case that many tools are simply not able to deal with these sparse structures at all, and the data would need to be cast to a numpy array and take up that full amount of memory.

Fortunately PyNNDescent is built to work with sparse matrix data. To see that in practice let’s hand the sparse matrix directly to NNDescent and watch it work.

[4]:

%%time
index = pynndescent.NNDescent(news_train.data, metric="cosine")

CPU times: user 7min 3s, sys: 1.77 s, total: 7min 5s
Wall time: 2min 24s

You will note that this is a much longer index construction time than we would normally expect with only around eleven thousand data points – there is overhead in working with sparse matrices that makes it slower. That, combined with the fact that the data has over a hundred and thirty-thousand dimensions means this is a computationally intensive task. Still it is likely better than working with the full 5.5 GB numpy array, and certainly better when dealing with larger sparse matrices where there is simply no way to instantiate a numpy array large enough to hold the data.

We can query the index – but we have to use the same sparse matrix structure (we can’t query with numpy arrays for am index built with sparse data). Fortunately the test data is already in that format so we can simply perform the query:

[5]:

%%time
neighbors = index.query(news_test.data)

CPU times: user 1min 25s, sys: 1.5 s, total: 1min 26s
Wall time: 1min 26s

And that’s it! Everything works essentially transparently with sparse data – it is just slower. Still, slower is a lot better than not working at all.

[6]:

neighbors

[6]:

(array([[ 1635,  4487,  9220, ..., 10071,  9572,  1793],
        [ 8648,   567,  2123, ...,   783,  6031,  9275],
        [ 7345,  4852,  4674, ...,  1679,  7518,  4228],
        ...,
        [ 6137, 10518,  6469, ...,  6937, 11083,  6164],
        [ 2926,  2011,  1679, ...,  7229,  1635, 11270],
        [ 3215,  3665,  4899, ..., 10810,  9907,  9311]], dtype=int32),
 array([[0.34316891, 0.34799916, 0.35074973, ..., 0.35952544, 0.36456954,
         0.36472946],
        [0.25881797, 0.26937056, 0.28056568, ..., 0.29210907, 0.29795945,
         0.29900843],
        [0.41124642, 0.42213005, 0.4426493 , ..., 0.46922857, 0.4724912 ,
         0.47429329],
        ...,
        [0.21533132, 0.22482073, 0.24046886, ..., 0.26193857, 0.26805884,
         0.26866162],
        [0.19485909, 0.19515198, 0.19891578, ..., 0.20851403, 0.21159202,
         0.21265447],
        [0.43528455, 0.43638128, 0.4378109 , ..., 0.45176154, 0.452402  ,
         0.45243692]]))

One final caveat is that custom distance metrics for sparse data need to be able to work with sparse data and thus have a different function signature. In practice this is really something you only want to try if you are familiar with working with sparse data structures. If that’s the case then you can look through pynndescent.sparse.py for examples of many common distance functions and it will quickly become clear what is required.