Note: vaex.ml is under heavy development, consider this document as a sneak preview.

Vaex-ml - Machine Learning

The vaex.ml package brings some machine learning algorithms to vaex. Install it by running pip install vaex-ml.

Vaex.ml stays close to the authoritative ML package: scikit-learn. We will first show two examples, KMeans and PCA, to see how they compare and differ, and what the gain is in performance.

[4]:
import vaex.ml.cluster
import vaex.ml.datasets
import numpy as np
%matplotlib inline

We use the well known iris flower dataset, a classical for machine learning.

[6]:
df = vaex.ml.datasets.load_iris()
df.scatter(df.petal_width, df.petal_length, c_expr=df.class_)
[6]:
<matplotlib.collections.PathCollection at 0x11fecbdd8>
_images/ml_5_1.png
[8]:
df
[8]:
# sepal_width petal_length sepal_length petal_width class_ random_index
0 3.0 4.2 5.9 1.5 1 114
1 3.0 4.6 6.1 1.4 1 74
2 2.9 4.6 6.6 1.3 1 37
3 3.3 5.7 6.7 2.1 2 116
4 4.2 1.4 5.5 0.2 0 61
... ... ... ... ... ... ...
1453.4 1.4 5.2 0.2 0 119
1463.8 1.6 5.1 0.2 0 15
1472.6 4.0 5.8 1.2 1 22
1483.8 1.7 5.7 0.3 0 144
1492.9 4.3 6.2 1.3 1 102

KMeans

We use two features to do a KMeans, and roughly put the two features on the same scale by a simple division. We then construct a KMeans object, quite similar to what you would do in sklearn, and fit it.

[10]:
features = ['petal_width/2', 'petal_length/5']
init = [[0, 1/5], [1.2/2, 4/5], [2.5/2, 6/5]] #
kmeans = vaex.ml.cluster.KMeans(features=features, init=init, verbose=True)
kmeans.fit(df)
Iteration    0, inertia  6.2609999999999975
Iteration    1, inertia  2.5062184444444435
Iteration    2, inertia  2.443455900151798
Iteration    3, inertia  2.418136327962199
Iteration    4, inertia  2.4161501474358995
Iteration    5, inertia  2.4161501474358995

We now transform the original DataFrame, similar to sklearn. However, we now end up with a new DataFrame, which contains an extra column (prediction_kmeans).

[11]:
df_predict = kmeans.transform(df)
df_predict
[11]:
# sepal_width petal_length sepal_length petal_width class_ random_index prediction_kmeans
0 3.0 4.2 5.9 1.5 1 114 1
1 3.0 4.6 6.1 1.4 1 74 1
2 2.9 4.6 6.6 1.3 1 37 1
3 3.3 5.7 6.7 2.1 2 116 2
4 4.2 1.4 5.5 0.2 0 61 0
... ... ... ... ... ... ... ...
1453.4 1.4 5.2 0.2 0 119 0
1463.8 1.6 5.1 0.2 0 15 0
1472.6 4.0 5.8 1.2 1 22 1
1483.8 1.7 5.7 0.3 0 144 0
1492.9 4.3 6.2 1.3 1 102 1

Although this column is special, it is actually a virtual column, it does not use up any memory and will be computed on the fly when needed, saving us precious ram. Note that the other columns reference the original data as well, so this new DataFrame (ds_predict) almost takes up no memory at all, which is ideal for very large datasets, and quite different from what sklearn will do.

[12]:
df_predict.virtual_columns['prediction_kmeans']
[12]:
'kmean_predict_function(petal_width/2, petal_length/5)'

By making a simple scatter plot we can see the KMeans does a pretty good job.

[16]:
import matplotlib.pylab as plt
fig, ax = plt.subplots(1, 2, figsize=(12,5))

plt.sca(ax[0])
plt.title('original classes')
df.scatter(df.petal_width, df.petal_length, c_expr=df.class_)

plt.sca(ax[1])
plt.title('predicted classes')
df_predict.scatter(df_predict.petal_width, df_predict.petal_length, c_expr=df_predict.prediction_kmeans)
[16]:
<matplotlib.collections.PathCollection at 0x12333eeb8>
_images/ml_14_1.png

KMeans benchmark

To demonstrate the performance and scaling of vaex, we continue with a special version of the iris dataset that has \(\sim10^7\) rows, by repeating the rows many times.

[28]:
df = vaex.ml.datasets.load_iris_1e7()

We now use random initial conditions, and execute 10 runs in parallel (n_init), for a maximum of 5 iterations and benchmark it.

[29]:
features = ['petal_width/2', 'petal_length/5']
kmeans = vaex.ml.cluster.KMeans(features=features, n_clusters=3, init='random', random_state=1,
                                max_iter=5, verbose=True, n_init=10)
[30]:
%%timeit -n1 -r1 -o
kmeans.fit(df)
Iteration    0, inertia  1784973.7999986452 |  1548329.799999016 |  354711.39999875583 |  434173.39999885217 |  1005871.0000026902 |  1312114.6000003854 |  1989377.3999927905 |  577104.4999989534 |  2747388.6000027955 |  628486.7999971791
Iteration    1, inertia  481645.0225601919 |  233311.807648651 |  214794.26525253727 |  175205.9965848818 |  490218.5413715277 |  816598.0811733825 |  285786.2566865457 |  456305.0601529535 |  1205488.9851008556 |  262443.28449456714
Iteration    2, inertia  458443.873920266 |  162015.13397359708 |  173081.69460305249 |  162580.06671935317 |  488402.97447322187 |  436698.8939923954 |  162626.5498899455 |  394680.5108569788 |  850103.6561417003 |  198213.0961053151
Iteration    3, inertia  394680.5108569788 |  161882.05987810466 |  162580.0667193532 |  161882.05987810466 |  487435.98983613256 |  214098.28159484005 |  161882.05987810466 |  275282.3731570135 |  594451.8937940609 |  169525.19719336918
Iteration    4, inertia  275282.3731570135 |  161882.05987810463 |  161882.05987810463 |  161882.05987810463 |  486000.83124050766 |  169097.2713565477 |  161882.05987810463 |  201144.2611065195 |  512055.1808623869 |  162023.37977993558
3.98 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[30]:
<TimeitResult : 3.98 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
[31]:
time_vaex = _

We now do the same using sklearn.

[32]:
from sklearn.cluster import KMeans
kmeans_sk = kmeans = KMeans(n_clusters=3, init='random', max_iter=5, verbose=True, algorithm='full', n_jobs=-1,
                           precompute_distances=False, n_init=10)
# Doing an unfortunate memory copy
X = np.array(df[features])
[33]:
%%timeit -n1 -r1 -o
kmeans_sk.fit(X)
Initialization complete
Iteration  0, inertia 538264.600
Iteration  1, inertia 488488.457
Iteration  2, inertia 477825.973
Iteration  3, inertia 458443.874
Iteration  4, inertia 394680.511
Initialization complete
Iteration  0, inertia 1478542.600
Iteration  1, inertia 488488.457
Iteration  2, inertia 477825.973
Iteration  3, inertia 458443.874
Iteration  4, inertia 394680.511
Initialization complete
Iteration  0, inertia 422756.600
Iteration  1, inertia 182182.175
Iteration  2, inertia 164120.408
Iteration  3, inertia 162023.380
Iteration  4, inertia 161882.060
Converged at iteration 4: center shift 0.000000e+00 within tolerance 1.341649e-05
Initialization complete
Iteration  0, inertia 1873065.400
Iteration  1, inertia 260752.951
Iteration  2, inertia 161882.060
Converged at iteration 2: center shift 0.000000e+00 within tolerance 1.341649e-05
Initialization complete
Iteration  0, inertia 808489.000
Iteration  1, inertia 275282.373
Iteration  2, inertia 201144.261
Iteration  3, inertia 171177.750
Iteration  4, inertia 162580.067
Initialization complete
Iteration  0, inertia 3983719.500
Iteration  1, inertia 1112312.157
Iteration  2, inertia 550309.867
Iteration  3, inertia 261374.998
Iteration  4, inertia 178472.171
Initialization complete
Iteration  0, inertia 952003.000
Iteration  1, inertia 367032.453
Iteration  2, inertia 212341.557
Iteration  3, inertia 174578.392
Iteration  4, inertia 165938.240
Initialization complete
Iteration  0, inertia 635682.600
Iteration  1, inertia 448595.010
Iteration  2, inertia 382619.025
Iteration  3, inertia 275282.373
Iteration  4, inertia 201144.261
Initialization complete
Iteration  0, inertia 950998.000
Iteration  1, inertia 490981.793
Iteration  2, inertia 490578.465
Iteration  3, inertia 489369.579
Iteration  4, inertia 488391.841
Initialization complete
Iteration  0, inertia 1329668.600
Iteration  1, inertia 353015.441
Iteration  2, inertia 168858.217
Iteration  3, inertia 162015.134
Iteration  4, inertia 161882.060
Converged at iteration 4: center shift 0.000000e+00 within tolerance 1.341649e-05
46.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[33]:
<TimeitResult : 46.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
[34]:
time_sklearn = _

We see that vaex is quite fast:

[35]:
print('vaex is approx', time_sklearn.best / time_vaex.best, 'times faster for KMeans')
vaex is approx 11.77461207454833 times faster for KMeans

But also, sklean will need to copy the data, while vaex will be very careful not to do unnecessary copies, and minimal amounts of passes of the data (Out-of-core). Therefore vaex will happily scale to massive datasets, while with sklearn you will be limited to the size of the RAM.

PCA Benchmark

We now continue with benchmarking a PCA on 4 features:

[36]:
features = [k.expression for k in [df.col.petal_width, df.col.petal_length, df.col.sepal_width, df.col.sepal_length]]
pca = df.ml.pca(features=features)
[37]:
%%timeit -n1 -r3 -o
pca = df.ml.pca(features=features)
226 ms ± 30.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
[37]:
<TimeitResult : 226 ms ± 30.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>
[38]:
time_vaex = _

Since sklearn takes too much memory with this dataset, we only use 10% for sklearn, and correct later.

[40]:
# on my laptop this takes too much memory with sklearn, use only a subset
factor = 0.1
df.set_active_fraction(factor)
len(df)
[40]:
1005000
[41]:
from sklearn.decomposition import PCA
pca_sk = PCA(n_components=2, random_state=33, svd_solver='full', whiten=False)
X = np.array(df.trim()[features])
[42]:
%%timeit -n1 -r3 -o
pca_sk.fit(X)
130 ms ± 25 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
[42]:
<TimeitResult : 130 ms ± 25 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>
[43]:
time_sklearn = _
[44]:
print('vaex is approx', time_sklearn.best / time_vaex.best / factor, 'times faster for a PCA')
vaex is approx 5.4043995278391295 times faster for a PCA

Again we see that vaex not only will outperform sklearn, but more importantly it will scale to much larger datasets.

A billion row PCA

We now run a PCA on a billion rows.

[51]:
df_big = vaex.ml.datasets.load_iris_1e9()
[52]:
%%timeit -n1 -r2 -o
pca = df_big.ml.pca(features=features)
3min 9s ± 20.5 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
[52]:
<TimeitResult : 3min 9s ± 20.5 s per loop (mean ± std. dev. of 2 runs, 1 loop each)>

Note the although this dataset is \(10\times\) larger, it takes more than \(10\times\) to execute. This is because this dataset did not fit into memory this time, and is limited to the harddrive speed. But note that it possible to actually run it, instead of giving a MemoryError!

XGBoost

This example shows integration with xgboost, this is work in progress.

[3]:
import vaex.ml.xgboost
[4]:
df = vaex.ml.datasets.load_iris()
[5]:
features = [k.expression for k in [df.col.petal_width, df.col.petal_length, df.col.sepal_width, df.col.sepal_length]]
[6]:
df_train, df_test = df.ml.train_test_split()
[7]:
param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softmax',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
xgmodel = vaex.ml.xgboost.XGBModel(features=features, num_round=10, param=param)
[8]:
xgmodel.fit(df_train, df_train.class_, copy=True)
[10]:
df_predict = xgmodel.transform(df_test)
df_predict
[10]:
# sepal_width petal_length sepal_length petal_width class_ random_index xgboost_prediction
0 3.0 4.2 5.9 1.5 1 114 1.0
1 3.0 4.6 6.1 1.4 1 74 1.0
2 2.9 4.6 6.6 1.3 1 37 1.0
3 3.3 5.7 6.7 2.1 2 116 2.0
4 4.2 1.4 5.5 0.2 0 61 0.0
... ... ... ... ... ... ... ...
252.5 4.0 5.5 1.3 1 83 1.0
262.7 3.9 5.8 1.2 1 94 1.0
272.9 1.4 4.4 0.2 0 54 0.0
282.3 1.3 4.5 0.3 0 145 0.0
293.2 5.7 6.9 2.3 2 84 2.0
[12]:
import matplotlib.pylab as plt
fig, ax = plt.subplots(1, 2, figsize=(12,5))

plt.sca(ax[0])
plt.title('original classes')
df_predict.scatter(df_predict.petal_width, df_predict.petal_length, c_expr=df_predict.class_)

plt.sca(ax[1])
plt.title('predicted classes')
df_predict.scatter(df_predict.petal_width, df_predict.petal_length, c_expr=df_predict.xgboost_prediction)
[12]:
<matplotlib.collections.PathCollection at 0x1185c9a58>
_images/ml_50_1.png

One hot encoding

Shortly showing one hot encoding

[27]:
encoder = df.ml_one_hot_encoder([df.col.class_])
df_encoded = encoder.transform(df)
[28]:
df_encoded
[28]:
# sepal_width petal_length sepal_length petal_width class_ random_index class__0 class__1 class__2
0 3.0 4.2 5.9 1.5 1 114 0 1 0
1 3.0 4.6 6.1 1.4 1 74 0 1 0
2 2.9 4.6 6.6 1.3 1 37 0 1 0
3 3.3 5.7 6.7 2.1 2 116 0 0 1
4 4.2 1.4 5.5 0.2 0 61 1 0 0
... ... ... ... ... ... ... ... ... ...
1453.4 1.4 5.2 0.2 0 119 1 0 0
1463.8 1.6 5.1 0.2 0 15 1 0 0
1472.6 4.0 5.8 1.2 1 22 0 1 0
1483.8 1.7 5.7 0.3 0 144 1 0 0
1492.9 4.3 6.2 1.3 1 102 0 1 0
[ ]: