API documentation for vaex library

Quick lists

Opening/reading in your data.

vaex.open(path[, convert, shuffle, copy_index]) Open a DataFrame from file given by path.
vaex.from_arrow_table(table) Creates a vaex DataFrame from an arrow Table.
vaex.from_arrays(**arrays) Create an in memory DataFrame from numpy arrays.
vaex.from_csv(filename_or_buffer[, copy_index]) Shortcut to read a csv file using pandas and convert to a DataFrame directly.
vaex.from_ascii(path[, seperator, names, …]) Create an in memory DataFrame from an ascii file (whitespace seperated by default).
vaex.from_pandas(df[, name, copy_index, …]) Create an in memory DataFrame from a pandas DataFrame.
vaex.from_astropy_table(table) Create a vaex DataFrame from an Astropy Table.

Visualization.

vaex.dataframe.DataFrame.plot([x, y, z, …]) Viz data in a 2d histogram/heatmap.
vaex.dataframe.DataFrame.plot1d([x, what, …]) Viz data in 1d (histograms, running means etc)
vaex.dataframe.DataFrame.scatter(x, y[, …]) Viz (small amounts) of data in 2d using a scatter plot
vaex.dataframe.DataFrame.plot_widget(x, y[, …]) Viz 1d, 2d or 3d in a Jupyter notebook
vaex.dataframe.DataFrame.healpix_plot([…]) Viz data in 2d using a healpix column.

Statistics.

vaex.dataframe.DataFrame.count([expression, …]) Count the number of non-NaN values (or all, if expression is None or “*”).
vaex.dataframe.DataFrame.mean(expression[, …]) Calculate the mean for expression, possibly on a grid defined by binby.
vaex.dataframe.DataFrame.std(expression[, …]) Calculate the standard deviation for the given expression, possible on a grid defined by binby
vaex.dataframe.DataFrame.var(expression[, …]) Calculate the sample variance for the given expression, possible on a grid defined by binby
vaex.dataframe.DataFrame.cov(x[, y, binby, …]) Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.
vaex.dataframe.DataFrame.correlation(x[, y, …]) Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possibly on a grid defined by binby.
vaex.dataframe.DataFrame.median_approx(…) Calculate the median , possibly on a grid defined by binby.
vaex.dataframe.DataFrame.mode(expression[, …]) Calculate/estimate the mode.
vaex.dataframe.DataFrame.min(expression[, …]) Calculate the minimum for given expressions, possibly on a grid defined by binby.
vaex.dataframe.DataFrame.max(expression[, …]) Calculate the maximum for given expressions, possibly on a grid defined by binby.
vaex.dataframe.DataFrame.minmax(expression) Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.
vaex.dataframe.DataFrame.mutual_information(x) Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.

vaex-core

Vaex is a library for dealing with larger than memory DataFrames (out of core).

The most important class (datastructure) in vaex is the DataFrame. A DataFrame is obtained by either, opening the example dataset:

>>> import vaex
>>> df = vaex.example()

Or using open() to open a file.

>>> df1 = vaex.open("somedata.hdf5")
>>> df2 = vaex.open("somedata.fits")
>>> df2 = vaex.open("somedata.arrow")
>>> df4 = vaex.open("somedata.csv")

Or connecting to a remove server:

>>> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")

A few strong features of vaex are:

  • Performance: Works with huge tabular data, process over a billion (> 10:sup:9) rows/second.
  • Expression system / Virtual columns: compute on the fly, without wasting ram.
  • Memory efficient: no memory copies when doing filtering/selections/subsets.
  • Visualization: directly supported, a one-liner is often enough.
  • User friendly API: You will only need to deal with a DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.
  • Very fast statiscs on N dimensional grids such as histograms, running mean, heatmaps.

Follow the tutorial at https://docs.vaex.io/en/latest/tutorial.html to learn how to use vaex.

vaex.open(path, convert=False, shuffle=False, copy_index=True, *args, **kwargs)[source]

Open a DataFrame from file given by path.

Example:

>>> ds = vaex.open('sometable.hdf5')
>>> ds = vaex.open('somedata*.csv', convert='bigdata.hdf5')
Parameters:
  • path (str) – local or absolute path to file, or glob string
  • convert – convert files to an hdf5 file for optimization, can also be a path
  • shuffle (bool) – shuffle converted DataFrame or not
  • args – extra arguments for file readers that need it
  • kwargs – extra keyword arguments
  • copy_index (bool) – copy index when source is read via pandas
Returns:

return a DataFrame on succes, otherwise None

Return type:

DataFrame

vaex.from_arrays(**arrays)[source]

Create an in memory DataFrame from numpy arrays.

Example

>>> import vaex, numpy as np
>>> x = np.arange(5)
>>> y = x ** 2
>>> vaex.from_arrays(x=x, y=y)
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
>>> some_dict = {'x': x, 'y': y}
>>> vaex.from_arrays(**some_dict)  # in case you have your columns in a dict
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
Parameters:arrays – keyword arguments with arrays
Return type:DataFrame
vaex.from_items(*items)[source]

Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).

Example

>>> import vaex, numpy as np
>>> x = np.arange(5)
>>> y = x ** 2
>>> vaex.from_items(('x', x), ('y', y))
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
Parameters:items – list of [(name, numpy array), …]
Return type:DataFrame
vaex.from_arrow_table(table)[source]

Creates a vaex DataFrame from an arrow Table.

Return type:DataFrame
vaex.from_csv(filename_or_buffer, copy_index=True, **kwargs)[source]

Shortcut to read a csv file using pandas and convert to a DataFrame directly.

Return type:DataFrame
vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]

Create an in memory DataFrame from an ascii file (whitespace seperated by default).

>>> ds = vx.from_ascii("table.asc")
>>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
Parameters:
  • path – file path
  • seperator – value seperator, by default whitespace, use “,” for comma seperated values.
  • names – If True, the first line is used for the column names, otherwise provide a list of strings with names
  • skip_lines – skip lines at the start of the file
  • skip_after – skip lines at the end of the file
  • kwargs
Return type:

DataFrame

vaex.from_pandas(df, name='pandas', copy_index=True, index_name='index')[source]

Create an in memory DataFrame from a pandas DataFrame.

Param:pandas.DataFrame df: Pandas DataFrame
Param:name: unique for the DataFrame
>>> import vaex, pandas as pd
>>> df_pandas = pd.from_csv('test.csv')
>>> df = vaex.from_pandas(df_pandas)
Return type:DataFrame
vaex.from_astropy_table(table)[source]

Create a vaex DataFrame from an Astropy Table.

vaex.from_samp(username=None, password=None)[source]

Connect to a SAMP Hub and wait for a single table load event, disconnect, download the table and return the DataFrame.

Useful if you want to send a single table from say TOPCAT to vaex in a python console or notebook.

vaex.open_many(filenames)[source]

Open a list of filenames, and return a DataFrame with all DataFrames cocatenated.

Parameters:filenames (list[str]) – list of filenames/paths
Return type:DataFrame
vaex.server(url, **kwargs)[source]

Connect to hostname supporting the vaex web api.

Parameters:hostname (str) – hostname or ip address of server
Return vaex.dataframe.ServerRest:
 returns a server object, note that it does not connect to the server yet, so this will always succeed
Return type:ServerRest
vaex.example(download=True)[source]

Returns an example DataFrame which comes with vaex for testing/learning purposes.

Return type:DataFrame
vaex.app(*args, **kwargs)[source]

Create a vaex app, the QApplication mainloop must be started.

In ipython notebook/jupyter do the following:

>>> import vaex.ui.main # this causes the qt api level to be set properly
>>> import vaex

Next cell:

>>> %gui qt

Next cell:

>>> app = vaex.app()

From now on, you can run the app along with jupyter

vaex.delayed(f)[source]

Decorator to transparantly accept delayed computation.

Example:

>>> delayed_sum = ds.sum(ds.E, binby=ds.x, limits=limits,
>>>                   shape=4, delay=True)
>>> @vaex.delayed
>>> def total_sum(sums):
>>>     return sums.sum()
>>> sum_of_sums = total_sum(delayed_sum)
>>> ds.execute()
>>> sum_of_sums.get()
See the tutorial for a more complete example https://docs.vaex.io/en/latest/tutorial.html#Parallel-computations

DataFrame class

class vaex.dataframe.DataFrame(name, column_names, executor=None)[source]

All local or remote datasets are encapsulated in this class, which provides a pandas like API to your dataset.

Each DataFrame (df) has a number of columns, and a number of rows, the length of the DataFrame.

All DataFrames have multiple ‘selection’, and all calculations are done on the whole DataFrame (default) or for the selection. The following example shows how to use the selection.

>>> df.select("x < 0")
>>> df.sum(df.y, selection=True)
>>> df.sum(df.y, selection=[df.x < 0, df.x > 0])
__delitem__(item)[source]

Removes a (virtual) column from the DataFrame.

Note: this does not remove check if the column is used in a virtual expression or in the filter and may lead to issues. It is safer to use drop().

__getitem__(item)[source]

Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering.

Examples:

>>> df['Lz']  # the expression 'Lz
>>> df['Lz/2'] # the expression 'Lz/2'
>>> df[["Lz", "E"]] # a shallow copy with just two columns
>>> df[df.Lz < 0]  # a shallow copy with the filter Lz < 0 applied
__init__(name, column_names, executor=None)[source]

Initialize self. See help(type(self)) for accurate signature.

__iter__()[source]

Iterator over the column names.

__len__()[source]

Returns the number of rows in the DataFrame (filtering applied).

__setitem__(name, value)[source]

Convenient way to add a virtual column / expression to this DataFrame.

Example:

>>> import vaex, numpy as np
>>> df = vaex.example()
>>> df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)
>>> df.r
<vaex.expression.Expression(expressions='r')> instance at 0x121687e80 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]
__weakref__

list of weak references to the object (if defined)

add_column(name, f_or_array)[source]

Add an in memory array as a column.

add_variable(name, expression, overwrite=True)[source]

Add a variable to to a DataFrame.

A variable may refer to other variables, and virtual columns and expression may refer to variables.

Example

>>> df.add_variable('center', 0)
>>> df.add_virtual_column('x_prime', 'x-center')
>>> df.select('x_prime < 0')
Param:str name: name of virtual varible
Param:expression: expression for the variable
add_virtual_column(name, expression, unique=False)[source]

Add a virtual column to the DataFrame.

Example:

>>> df.add_virtual_column("r", "sqrt(x**2 + y**2 + z**2)")
>>> df.select("r < 10")
Param:str name: name of virtual column
Param:expression: expression for the column
Parameters:unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2
byte_size(selection=False, virtual=False)[source]

Return the size in bytes the whole DataFrame requires (or the selection), respecting the active_fraction.

cat(i1, i2, format='html')[source]

Display the DataFrame from row i1 till i2

For format, see https://pypi.org/project/tabulate/

Parameters:
  • i1 (int) – Start row
  • i2 (int) – End row.
  • format (str) – Format to use, e.g. ‘html’, ‘plain’, ‘latex’
close_files()[source]

Close any possible open file handles, the DataFrame will not be in a usable state afterwards.

col

Gives direct access to the columns only (useful for tab completion).

Convenient when working with ipython in combination with small DataFrames, since this gives tab-completion.

Columns can be accesed by there names, which are attributes. The attribues are currently expressions, so you can do computations with them.

Example

>>> ds = vaex.example()
>>> df.plot(df.col.x, df.col.y)
column_count()[source]

Returns the number of columns (including virtual columns).

combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]

Generate a list of combinations for the possible expressions for the given dimension.

Parameters:
  • expressions_list – list of list of expressions, where the inner list defines the subspace
  • dimensions – if given, generates a subspace with all possible combinations for that dimension
  • exclude – list of
correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None)[source]

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possibly on a grid defined by binby.

Examples:

>>> df.correlation("x**2+y**2+z**2", "-log(-E+1)")
array(0.6366637382215669)
>>> df.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 0.40594394,  0.69868851,  0.61394099,  0.65266318])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

count(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]

Count the number of non-NaN values (or all, if expression is None or “*”).

Examples:

>>> df.count()
330000
>>> df.count("*")
330000.0
>>> df.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])
Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

cov(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.

Either x and y are expressions, e.g:

>>> df.cov("x", "y")

Or only the x argument is given with a list of expressions, e,g.:

>>> df.cov(["x, "y, "z"])

Examples:

>>> df.cov("x", "y")
array([[ 53.54521742,  -3.8123135 ],
[ -3.8123135 ,  60.62257881]])
>>> df.cov(["x", "y", "z"])
array([[ 53.54521742,  -3.8123135 ,  -0.98260511],
[ -3.8123135 ,  60.62257881,   1.21381057],
[ -0.98260511,   1.21381057,  25.55517638]])
>>> df.cov("x", "y", binby="E", shape=2)
array([[[  9.74852878e+00,  -3.02004780e-02],
[ -3.02004780e-02,   9.99288215e+00]],
[[  8.43996546e+01,  -6.51984181e+00],
[ -6.51984181e+00,   9.68938284e+01]]])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – if previous argument is not a list, this argument should be given
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)

covar(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance cov[x,y] between and x and y, possibly on a grid defined by binby.

Examples:

>>> df.covar("x**2+y**2+z**2", "-log(-E+1)")
array(52.69461456005138)
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)")/(df.std("x**2+y**2+z**2") * df.std("-log(-E+1)"))
0.63666373822156686
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 10.17387143,  51.94954078,  51.24902796,  20.2163929 ])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

delete_variable(name)[source]

Deletes a variable from a DataFrame.

delete_virtual_column(name)[source]

Deletes a virtual column from a DataFrame.

describe(strings=True, virtual=True, selection=None)[source]

Give a description of the DataFrame.

>>> import vaex
>>> df = vaex.example()[['x', 'y', 'z']]
>>> df.describe()
                 x          y          z
dtype      float64    float64    float64
count       330000     330000     330000
missing          0          0          0
mean    -0.0671315 -0.0535899  0.0169582
std        7.31746    7.78605    5.05521
min       -128.294   -71.5524   -44.3342
max        271.366    146.466    50.7185
>>> df.describe(selection=df.x > 0)
                   x         y          z
dtype        float64   float64    float64
count         164060    164060     164060
missing       165940    165940     165940
mean         5.13572 -0.486786 -0.0868073
std          5.18701   7.61621    5.02831
min      1.51635e-05  -71.5524   -44.3342
max          271.366   78.0724    40.2191
Parameters:
  • strings (bool) – Describe string columns or not
  • virtual (bool) – Describe virtual columns or not
  • selection – Optional selection to use.
Returns:

Pandas dataframe

drop(columns, inplace=False, check=True)[source]

Drop columns (or a single column).

Parameters:
  • columns – List of columns or a single column name
  • inplace – Make modifications to self or return a new DataFrame
  • check – When true, it will check if the column is used in virtual columns or the filter, and hide it instead.
dropna(drop_nan=True, drop_masked=True, column_names=None)[source]

Create a shallow copy of a DataFrame, with filtering set using select_non_missing.

Parameters:
  • drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
  • drop_masked – drop rows when there is a masked value in any of the columns
  • column_names – The columns to consider, default: all (real, non-virtual) columns
Return type:

DataFrame

dtype(expression)[source]

Return the numpy dtype for the given expression, if not a column, the first row will be evaluated to get the dtype.

dtypes

Gives a Pandas series object containing all numpy dtypes of all columns (except hidden).

evaluate(expression, i1=None, i2=None, out=None, selection=None)[source]

Evaluate an expression, and return a numpy array with the results for the full column or a part of it.

Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.

To get partial results, use i1 and i2

Parameters:
  • expression (str) – Name/expression to evaluate
  • i1 (int) – Start row index, default is the start (0)
  • i2 (int) – End row index, default is the length of the DataFrame
  • out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to a memory mapped array)
  • selection – selection to apply
Returns:

evaluate_variable(name)[source]

Evaluates the variable given by name.

execute()[source]

Execute all delayed jobs.

extract()[source]

Return a DataFrame containing only the filtered rows.

Note

Note that no copy of the underlying data is made, only a view/reference is make.

The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered (contains just a small number of rows).

If no filtering is applied, it returns a trimmed view. For the returned df, len(df) == df.length_original() == df.length_unfiltered()

Return type:DataFrame
fillna(value, fill_nan=True, fill_masked=True, column_names=None, prefix='__original_', inplace=False)[source]

Return a DataFrame, where missing values/NaN are filled with ‘value’

Note

Note that no copy of the underlying data is made, only a view/reference is make.

Note

Note that filtering will be ignored (since they may change), you may want to consider running extract() first.

Example:

>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> df = vaex.from_arrays(a=a, x=x)
>>> df.sort('(x-1.8)**2', ascending=False)  # b, c, a will be the order of a
Parameters:
  • or expression by (str) – expression to sort by
  • ascending (bool) – ascending (default, True) or descending (False)
  • kind (str) – kind of algorithm to use (passed to numpy.argsort)
  • inplace – Make modifications to self or return a new DataFrame
first(expression, order_expression, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]

Count the number of non-NaN values (or all, if expression is None or “*”).

Examples:

>>> df.count()
330000.0
>>> df.count("*")
330000.0
>>> df.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])
Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

get_active_fraction()[source]

Value in the range (0, 1], to work only with a subset of rows.

get_column_names(virtual=True, strings=True, hidden=False, regex=None)[source]

Return a list of column names

Example:

>>> import vaex
>>> df = vaex.from_scalars(x=1, x2=2, y=3, s='string')
>>> df['r'] = (df.x**2 + df.y**2)**2
>>> df.get_column_names()
['x', 'x2', 'y', 's', 'r']
>>> df.get_column_names(virtual=False)
['x', 'x2', 'y', 's']
>>> df.get_column_names(regex='x.*')
['x', 'x2']
Parameters:
  • virtual – If False, skip virtual columns
  • hidden – If False, skip hidden columns
  • strings – If False, skip string columns
  • regex – Only return column names matching the (optional) regular expression
Return type:

list of str

Examples: >>> import vaex >>> df = vaex.from_scalars(x=1, x2=2, y=3, s=’string’) >>> df[‘r’] = (df.x**2 + df.y**2)**2 >>> df.get_column_names() [‘x’, ‘x2’, ‘y’, ‘s’, ‘r’] >>> df.get_column_names(virtual=False) [‘x’, ‘x2’, ‘y’, ‘s’] >>> df.get_column_names(regex=’x.*’) [‘x’, ‘x2’]

get_current_row()[source]

Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked.

get_private_dir(create=False)[source]

Each DataFrame has a directory where files are stored for metadata etc.

Example

>>> import vaex
>>> ds = vaex.example()
>>> vaex.get_private_dir()
'/Users/users/breddels/.vaex/dfs/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'
Parameters:create (bool) – is True, it will create the directory if it does not exist
get_selection(name='default')[source]

Get the current selection object (mostly for internal use atm).

get_variable(name)[source]

Returns the variable given by name, it will not evaluate it.

For evaluation, see DataFrame.evaluate_variable(), see also DataFrame.set_variable()

has_current_row()[source]

Returns True/False is there currently is a picked row.

has_selection(name='default')[source]

Returns True if there is a selection with the given name.

head(n=10)[source]

Return a shallow copy a DataFrame with the first n rows.

head_and_tail_print(n=5)[source]

Display the first and last n elements of a DataFrame.

healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]

Count non missing value for expression on an array which represents healpix data.

Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
  • healpix_expression – {healpix_max_level}
  • healpix_max_level – {healpix_max_level}
  • healpix_level – {healpix_level}
  • binby – {binby}, these dimension follow the first healpix dimension.
  • limits – {limits}
  • shape – {shape}
  • selection – {selection}
  • delay – {delay}
  • progress – {progress}
Returns:

healpix_plot(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0), **kwargs)[source]

Viz data in 2d using a healpix column.

Parameters:
  • healpix_expression – {healpix_max_level}
  • healpix_max_level – {healpix_max_level}
  • healpix_level – {healpix_level}
  • what – {what}
  • selection – {selection}
  • grid – {grid}
  • healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.
  • healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”.
  • f – function to apply to the data
  • colormap – matplotlib colormap
  • grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid)))
  • image_size – size for the image that healpy uses for rendering
  • nest – If the healpix data is in nested (True) or ring (False)
  • figsize – If given, modify the matplotlib figure size. Example (14,9)
  • interactive – (Experimental, uses healpy.mollzoom is True)
  • title – Title of figure
  • smooth – apply gaussian smoothing, in degrees
  • show – Call matplotlib’s show (True) or not (False, defaut)
  • rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
Returns:

is_category(column)[source]

Returns true if column is a category.

is_local()[source]

Returns True if the DataFrame is local, False when a DataFrame is remote.

is_masked(column)[source]

Return if a column is a masked (numpy.ma) column.

length_original()[source]

the full length of the DataFrame, independent what active_fraction is, or filtering. This is the real length of the underlying ndarrays.

length_unfiltered()[source]

The length of the arrays that should be considered (respecting active range), but without filtering.

limits(expression, value=None, square=False, selection=None, delay=False, shape=None)[source]

Calculate the [min, max] range for expression, as described by value, which is ‘99.7%’ by default.

If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.

Example:

>>> df.limits("x")
array([-28.86381927,  28.9261226 ])
>>> df.limits(["x", "y"])
(array([-28.86381927,  28.9261226 ]), array([-28.60476934,  28.96535249]))
>>> df.limits(["x", "y"], "minmax")
(array([-128.293991,  271.365997]), array([ -71.5523682,  146.465836 ]))
>>> df.limits(["x", "y"], ["minmax", "90%"])
(array([-128.293991,  271.365997]), array([-13.37438402,  13.4224423 ]))
>>> df.limits(["x", "y"], ["minmax", [0, 10]])
(array([-128.293991,  271.365997]), [0, 10])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • value – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

limits_percentage(expression, percentage=99.73, square=False, delay=False)[source]

Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.

The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:

Example:

>>> df.limits_percentage("x", 90)
array([-12.35081376,  12.14858052]
>>> df.percentile_approx("x", 5), df.percentile_approx("x", 95)
(array([-12.36813152]), array([ 12.13275818]))

NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • percentage (float) – Value between 0 and 100
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

materialize(virtual_column, inplace=False)[source]

Returns a new DataFrame where the virtual column is turned into an in memory numpy array.

Example:

>>> x = np.arange(1,4)
>>> y = np.arange(2,5)
>>> df = vaex.from_arrays(x=x, y=y)
>>> df['r'] = (df.x**2 + df.y**2)**0.5 # 'r' is a virtual column (computed on the fly)
>>> df = df.materialize('r')  # now 'r' is a 'real' column (i.e. a numpy array)
Parameters:inplace – {inplace}
max(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the maximum for given expressions, possibly on a grid defined by binby.

Example:

>>> df.max("x")
array(271.365997)
>>> df.max(["x", "y"])
array([ 271.365997,  146.465836])
>>> df.max("x", binby="x", shape=5, limits=[-10, 10])
array([-6.00010443, -2.00002384,  1.99998057,  5.99983597,  9.99984646])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mean(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the mean for expression, possibly on a grid defined by binby.

Examples:

>>> df.mean("x")
-0.067131491264005971
>>> df.mean("(x**2+y**2)**0.5", binby="E", shape=4)
array([  2.43483742,   4.41840721,   8.26742458,  15.53846476])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False)[source]

Calculate the median , possibly on a grid defined by binby.

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

min(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the minimum for given expressions, possibly on a grid defined by binby.

Example:

>>> df.min("x")
array(-128.293991)
>>> df.min(["x", "y"])
array([-128.293991 ,  -71.5523682])
>>> df.min("x", binby="x", shape=5, limits=[-10, 10])
array([-9.99919128, -5.99972439, -1.99991322,  2.0000093 ,  6.0004878 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

minmax(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.

Example:

>>> df.minmax("x")
array([-128.293991,  271.365997])
>>> df.minmax(["x", "y"])
array([[-128.293991 ,  271.365997 ],
           [ -71.5523682,  146.465836 ]])
>>> df.minmax("x", binby="x", shape=5, limits=[-10, 10])
array([[-9.99919128, -6.00010443],
           [-5.99972439, -2.00002384],
           [-1.99991322,  1.99998057],
           [ 2.0000093 ,  5.99983597],
           [ 6.0004878 ,  9.99984646]])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

ml_label_encoder(features=None, prefix='label_encoded_')

Requires vaex.ml: Create vaex.ml.transformations.LabelEncoder and fit it.

ml_lightgbm_model(label, num_round, features=None, copy=False, param={}, classifier=False, prediction_name='lightgbm_prediction')

Requires vaex.ml: create a lightgbm model and train/fit it.

Parameters:
  • label – label to train/fit on
  • num_round – number of rounds
  • features – list of features to train on
  • copy (bool) – Copy data or use the modified xgboost library for efficient transfer
  • classifier (bool) – If true, return a the classifier (will use argmax on the probabilities)
Return vaex.ml.lightgbm.LightGBMModel or LightGBMClassifier:
 

fitted LightGBM model

ml_minmax_scaler(features=None, feature_range=[0, 1])

Requires vaex.ml: Create vaex.ml.transformations.MinMaxScaler and fit it

ml_one_hot_encoder(features=None, one=1, zero=0, prefix='')

Requires vaex.ml: Create vaex.ml.transformations.OneHotEncoder and fit it.

Parameters:
  • features – list of features to one-hot encode
  • one – what value to use instead of “1”
  • zero – what value to use instead of “0”
Returns one_hot_encoder:
 

vaex.ml.transformations.OneHotEncoder object

ml_pca(n_components=2, features=None, progress=False)

Requires vaex.ml: Create vaex.ml.transformations.PCA and fit it

ml_pygbm_model(label, max_iter, features=None, param={}, classifier=False, prediction_name='pygbm_prediction', **kwargs)

Requires vaex.ml: create a pygbm model and train/fit it.

Parameters:
  • label – label to train/fit on
  • max_iter – max number of iterations/trees
  • features – list of features to train on
  • classifier (bool) – If true, return a the classifier (will use argmax on the probabilities)
Return vaex.ml.pygbm.PyGBMModel or vaex.ml.pygbm.PyGBMClassifier:
 

fitted PyGBM model

ml_standard_scaler(features=None, with_mean=True, with_std=True)

Requires vaex.ml: Create vaex.ml.transformations.StandardScaler and fit it

ml_to_xgboost_dmatrix(label, features=None, selection=None, blocksize=1000000)

label: ndarray containing the labels

ml_train_test_split(test_size=0.2, strings=True, virtual=True, verbose=True)

Will split the dataset in train and test part, assuming it is shuffled.

ml_xgboost_model(label, num_round, features=None, copy=False, param={}, prediction_name='xgboost_prediction')

Requires vaex.ml: create a XGBoost model and train/fit it.

Parameters:
  • label – label to train/fit on
  • num_round – number of rounds
  • features – list of features to train on
  • copy (bool) – Copy data or use the modified xgboost library for efficient transfer
Return vaex.ml.xgboost.XGBModel:
 

fitted XGBoost model

mode(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]

Calculate/estimate the mode.

mutual_information(x, y=None, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.

If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order.

Examples:

>>> df.mutual_information("x", "y")
array(0.1511814526380327)
>>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]])
array([ 0.15118145,  0.18439181,  1.07067379])
>>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True)
(array([ 1.07067379,  0.18439181,  0.15118145]),
[['E', 'Lz'], ['x', 'z'], ['x', 'y']])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,

nbytes

Alias for df.byte_size(), see DataFrame.byte_size().

percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False)[source]

Calculate the percentile given by percentage, possibly on a grid defined by binby.

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits.

Example:

>>> df.percentile_approx("x", 10), df.percentile_approx("x", 90)
(array([-8.3220355]), array([ 7.92080358]))
>>> df.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10])
array([[-7.56462982],
           [-3.61036641],
           [-0.01296306],
           [ 3.56697863],
           [ 7.45838367]])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

plot(x=None, y=None, z=None, what='count(*)', vwhat=None, reduce=['colormap'], f=None, normalize='normalize', normalize_axis='what', vmin=None, vmax=None, shape=256, vshape=32, limits=None, grid=None, colormap='afmhot', figsize=None, xlabel=None, ylabel=None, aspect='auto', tight_layout=True, interpolation='nearest', show=False, colorbar=True, colorbar_label=None, selection=None, selection_labels=None, title=None, background_color='white', pre_blend=False, background_alpha=1.0, visual={'column': 'what', 'fade': 'selection', 'layer': 'z', 'row': 'subspace', 'x': 'x', 'y': 'y'}, smooth_pre=None, smooth_post=None, wrap=True, wrap_columns=4, return_extra=False, hardcopy=None)

Viz data in a 2d histogram/heatmap.

Declarative plotting of statistical plots using matplotlib, supports subplots, selections, layers.

Instead of passing x and y, pass a list as x argument for multiple panels. Give what a list of options to have multiple panels. When both are present then will be origanized in a column/row order.

This methods creates a 6 dimensional ‘grid’, where each dimension can map the a visual dimension. The grid dimensions are:

  • x: shape determined by shape, content by x argument or the first dimension of each space
  • y: ,,
  • z: related to the z argument
  • selection: shape equals length of selection argument
  • what: shape equals length of what argument
  • space: shape equals length of x argument if multiple values are given

By default, this its shape is (1, 1, 1, 1, shape, shape) (where x is the last dimension)

The visual dimensions are

  • x: x coordinate on a plot / image (default maps to grid’s x)
  • y: y ,, (default maps to grid’s y)
  • layer: each image in this dimension is blended togeher to one image (default maps to z)
  • fade: each image is shown faded after the next image (default mapt to selection)
  • row: rows of subplots (default maps to space)
  • columns: columns of subplot (default maps to what)

All these mappings can be changes by the visual argument, some examples:

>>> df.plot('x', 'y', what=['mean(x)', 'correlation(vx, vy)'])

Will plot each ‘what’ as a column.

>>> df.plot('x', 'y', selection=['FeH < -3', '(FeH >= -3) & (FeH < -2)'], visual=dict(column='selection'))

Will plot each selection as a column, instead of a faded on top of each other.

Parameters:
  • x – Expression to bin in the x direction (by default maps to x), or list of pairs, like [[‘x’, ‘y’], [‘x’, ‘z’]], if multiple pairs are given, this dimension maps to rows by default
  • y – y (by default maps to y)
  • z – Expression to bin in the z direction, followed by a :start,end,shape signature, like ‘FeH:-3,1:5’ will produce 5 layers between -10 and 10 (by default maps to layer)
  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
  • reduce
  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
  • normalize – normalization function, currently only ‘normalize’ is supported
  • normalize_axis – which axes to normalize on, None means normalize by the global maximum.
  • vmin – instead of automatic normalization, (using normalize and normalization_axis) scale the data between vmin and vmax to [0, 1]
  • vmax – see vmin
  • shape – shape/size of the n-D histogram grid
  • limits – list of [[xmin, xmax], [ymin, ymax]], or a description such as ‘minmax’, ‘99%’
  • grid – if the binning is done before by yourself, you can pass it
  • colormap – matplotlib colormap to use
  • figsize – (x, y) tuple passed to pylab.figure for setting the figure size
  • xlabel
  • ylabel
  • aspect
  • tight_layout – call pylab.tight_layout or not
  • colorbar – plot a colorbar or not
  • interpolation – interpolation for imshow, possible options are: ‘nearest’, ‘bilinear’, ‘bicubic’, see matplotlib for more
  • return_extra
Returns:

plot1d(x=None, what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, **kwargs)

Viz data in 1d (histograms, running means etc)

Example

>>> df.plot1d(df.x)
>>> df.plot1d(df.x, limits=[0, 100], shape=100)
>>> df.plot1d(df.x, what='mean(y)', limits=[0, 100], shape=100)

If you want to do a computation yourself, pass the grid argument, but you are responsible for passing the same limits arguments:

>>> counts = df.mean(df.y, binby=df.x, limits=[0, 100], shape=100)/100.
>>> df.plot1d(df.x, limits=[0, 100], shape=100, grid=means, label='mean(y)/100')
Parameters:
  • x – Expression to bin in the x direction
  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum
  • grid – If the binning is done before by yourself, you can pass it
  • facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)
  • limits – list of [xmin, xmax], or a description such as ‘minmax’, ‘99%’
  • figsize – (x, y) tuple passed to pylab.figure for setting the figure size
  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
  • n – normalization function, currently only ‘normalize’ is supported, or None for no normalization
  • normalize_axis – which axes to normalize on, None means normalize by the global maximum.
  • normalize_axis
  • xlabel – String for label on x axis (may contain latex)
  • ylabel – Same for y axis
  • kwargs – extra argument passed to pylab.plot
Param:

tight_layout: call pylab.tight_layout or not

Returns:

plot2d_contour(x=None, y=None, what='count(*)', limits=None, shape=256, selection=None, f='identity', figsize=None, xlabel=None, ylabel=None, aspect='auto', levels=None, fill=False, colorbar=False, colorbar_label=None, colormap=None, colors=None, linewidths=None, linestyles=None, vmin=None, vmax=None, grid=None, show=None, **kwargs)

Plot conting contours on 2D grid.

Parameters:
  • x – {expression}
  • y – {expression}
  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
  • limits – {limits}
  • shape – {shape}
  • selection – {selection}
  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
  • figsize – (x, y) tuple passed to pylab.figure for setting the figure size
  • xlabel – label of the x-axis (defaults to param x)
  • ylabel – label of the y-axis (defaults to param y)
  • aspect – the aspect ratio of the figure
  • levels – the contour levels to be passed on pylab.contour or pylab.contourf
  • colorbar – plot a colorbar or not
  • colorbar_label – the label of the colourbar (defaults to param what)
  • colormap – matplotlib colormap to pass on to pylab.contour or pylab.contourf
  • colors – the colours of the contours
  • linewidths – the widths of the contours
  • linestyles – the style of the contour lines
  • vmin – instead of automatic normalization, scale the data between vmin and vmax
  • vmax – see vmin
  • grid – {grid}
  • show
plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]

Use at own risk, requires ipyvolume

plot_bq(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]

Deprecated: use plot_widget

plot_widget(x, y, z=None, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, backend='bqplot', **kwargs)[source]

Viz 1d, 2d or 3d in a Jupyter notebook

Note

This API is not fully settled and may change in the future

Examples

>>> df.plot_widget(df.x, df.y, backend='bqplot')
>>> df.plot_widget(df.pickup_longitude, df.pickup_latitude, backend='ipyleaflet')
Parameters:backend – Widget backend to use: ‘bqplot’, ‘ipyleaflet’, ‘ipyvolume’, ‘matplotlib’
propagate_uncertainties(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]

Propagates uncertainties (full covariance matrix) for a set of virtual columns.

Covariance matrix of the depending variables is guessed by finding columns prefixed by “e” or “e_” or postfixed by “_error”, “_uncertainty”, “e” and “_e”. Off diagonals (covariance or correlation) by postfixes with “_correlation” or “_corr” for correlation or “_covariance” or “_cov” for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation.)

Example

>>> df = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2)
>>> df["u"] = df.x + df.y
>>> df["v"] = np.log10(df.x)
>>> df.propagate_uncertainties([df.u, df.v])
>>> df.u_uncertainty, df.v_uncertainty
Parameters:
  • columns – list of columns for which to calculate the covariance matrix.
  • depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties.
  • cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.
remove_virtual_meta()[source]

Removes the file with the virtual column etc, it does not change the current virtual columns etc.

rename_column(name, new_name, unique=False, store_in_state=True)[source]

Renames a column, not this is only the in memory name, this will not be reflected on disk

sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]

Returns a DataFrame with a random set of rows

Note

Note that no copy of the underlying data is made, only a view/reference is make.

Provide either n or frac.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df
  #  s      x
  0  a      1
  1  b      2
  2  c      3
  3  d      4
>>> df.sample(n=2, random_state=42) # 2 random rows, fixed seed
  #  s      x
  0  b      2
  1  d      4
>>> df.sample(frac=1, random_state=42) # 'shuffling'
  #  s      x
  0  c      3
  1  a      1
  2  d      4
  3  b      2
>>> df.sample(frac=1, replace=True, random_state=42) # useful for bootstrap (may contain repeated samples)
  #  s      x
  0  d      4
  1  a      1
  2  a      1
  3  d      4
Parameters:
  • n (int) – number of samples to take (default 1 if frac is None)
  • frac (float) – fractional number of takes to take
  • replace (bool) – If true, a row may be drawn multiple times
  • or expression weights (str) – (unnormalized) probability that a row can be drawn
  • or RandomState (int) – seed or RandomState for reproducability, when None a random seed it chosen
Returns:

Returns a new DataFrame with a shallow copy/view of the underlying data

Return type:

DataFrame

scatter(x, y, xerr=None, yerr=None, cov=None, corr=None, s_expr=None, c_expr=None, labels=None, selection=None, length_limit=50000, length_check=True, label=None, xlabel=None, ylabel=None, errorbar_kwargs={}, ellipse_kwargs={}, **kwargs)

Viz (small amounts) of data in 2d using a scatter plot

Convenience wrapper around pylab.scatter when for working with small DataFrames or selections

Parameters:
  • x – Expression for x axis
  • y – Idem for y
  • s_expr – When given, use if for the s (size) argument of pylab.scatter
  • c_expr – When given, use if for the c (color) argument of pylab.scatter
  • labels – Annotate the points with these text values
  • selection – Single selection expression, or None
  • length_limit – maximum number of rows it will plot
  • length_check – should we do the maximum row check or not?
  • label – label for the legend
  • xlabel – label for x axis, if None .label(x) is used
  • ylabel – label for y axis, if None .label(y) is used
  • errorbar_kwargs – extra dict with arguments passed to plt.errorbar
  • kwargs – extra arguments passed to pylab.scatter
Returns:

select(boolean_expression, mode='replace', name='default', executor=None)[source]

Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode.

Selections are recorded in a history tree, per name, undo/redo can be done for them separately.

Parameters:
  • boolean_expression (str) – Any valid column expression, with comparison operators
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) – history tree or selection ‘slot’ to use
  • executor
Returns:

select_box(spaces, limits, mode='replace', name='default')[source]

Select a n-dimensional rectangular box bounded by limits.

The following examples are equivalent:

>>> df.select_box(['x', 'y'], [(0, 10), (0, 1)])
>>> df.select_rectangle('x', 'y', [(0, 10), (0, 1)])
Parameters:
  • spaces – list of expressions
  • limits – sequence of shape [(x1, x2), (y1, y2)]
  • mode
  • name
Returns:

select_circle(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]

Select a circular region centred on xc, yc, with a radius of r.

Example:

>>> df.select_circle('x','y',2,3,1)
Parameters:
  • x – expression for the x space
  • y – expression for the y space
  • xc – location of the centre of the circle in x
  • yc – location of the centre of the circle in y
  • r – the radius of the circle
  • name – name of the selection
  • mode
Returns:

select_ellipse(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]

Select an elliptical region centred on xc, yc, with a certain width, height and angle.

Example:

>>> df.select_ellipse('x','y', 2, -1, 5,1, 30, name='my_ellipse')
Parameters:
  • x – expression for the x space
  • y – expression for the y space
  • xc – location of the centre of the ellipse in x
  • yc – location of the centre of the ellipse in y
  • width – the width of the ellipse (diameter)
  • height – the width of the ellipse (diameter)
  • angle – (degrees) orientation of the ellipse, counter-clockwise measured from the y axis
  • name – name of the selection
  • mode
Returns:

select_inverse(name='default', executor=None)[source]

Invert the selection, i.e. what is selected will not be, and vice versa

Parameters:
  • name (str) –
  • executor
Returns:

select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]

For performance reasons, a lasso selection is handled differently.

Parameters:
  • expression_x (str) – Name/expression for the x coordinate
  • expression_y (str) – Name/expression for the y coordinate
  • xsequence – list of x numbers defining the lasso, together with y
  • ysequence
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) –
  • executor
Returns:

select_non_missing(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]

Create a selection that selects rows having non missing values for all columns in column_names.

The name reflect Panda’s, no rows are really dropped, but a mask is kept to keep track of the selection

Parameters:
  • drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
  • drop_masked – drop rows when there is a masked value in any of the columns
  • column_names – The columns to consider, default: all (real, non-virtual) columns
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) – history tree or selection ‘slot’ to use
Returns:

select_nothing(name='default')[source]

Select nothing.

select_rectangle(x, y, limits, mode='replace', name='default')[source]

Select a 2d rectangular box in the space given by x and y, bounds by limits.

Example:

>>> df.select_box('x', 'y', [(0, 10), (0, 1)])
Parameters:
  • x – expression for the x space
  • y – expression fo the y space
  • limits – sequence of shape [(x1, x2), (y1, y2)]
  • mode
selected_length()[source]

Returns the number of rows that are selected.

selection_can_redo(name='default')[source]

Can selection name be redone?

selection_can_undo(name='default')[source]

Can selection name be undone?

selection_redo(name='default', executor=None)[source]

Redo selection, for the name.

selection_undo(name='default', executor=None)[source]

Undo selection, for the name.

set_active_fraction(value)[source]

Sets the active_fraction, set picked row to None, and remove selection.

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range(i1, i2)[source]

Sets the active_fraction, set picked row to None, and remove selection.

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_current_row(value)[source]

Set the current row, and emit the signal signal_pick.

set_selection(selection, name='default', executor=None)[source]

Sets the selection object

Parameters:
  • selection – Selection object
  • name – selection ‘slot’
  • executor
Returns:

set_variable(name, expression_or_value, write=True)[source]

Set the variable to an expression or value defined by expression_or_value.

Example

>>> df.set_variable("a", 2.)
>>> df.set_variable("b", "a**2")
>>> df.get_variable("b")
'a**2'
>>> df.evaluate_variable("b")
4.0
Parameters:
  • name – Name of the variable
  • write – write variable to meta file
  • expression – value or expression
sort(by, ascending=True, kind='quicksort')[source]

Return a sorted DataFrame, sorted by the expression ‘by’

Note

Note that no copy of the underlying data is made, only a view/reference is make.

Note

Note that filtering will be ignored (since they may change), you may want to consider running extract() first.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df['y'] = (df.x-1.8)**2
>>> df
  #  s      x     y
  0  a      1  0.64
  1  b      2  0.04
  2  c      3  1.44
  3  d      4  4.84
>>> df.sort('y', ascending=False)  # Note: passing '(x-1.8)**2' gives the same result
  #  s      x     y
  0  d      4  4.84
  1  c      3  1.44
  2  a      1  0.64
  3  b      2  0.04
Parameters:
  • or expression by (str) – expression to sort by
  • ascending (bool) – ascending (default, True) or descending (False)
  • kind (str) – kind of algorithm to use (passed to numpy.argsort)
state_get()[source]

Return the internal state of the DataFrame in a dictionary

Example:

>>> import vaex
>>> df = vaex.from_scalars(x=1, y=2)
>>> df['r'] = (df.x**2 + df.y**2)**0.5
>>> df.state_get()
{'active_range': [0, 1],
'column_names': ['x', 'y', 'r'],
'description': None,
'descriptions': {},
'functions': {},
'renamed_columns': [],
'selections': {'__filter__': None},
'ucds': {},
'units': {},
'variables': {},
'virtual_columns': {'r': '(((x ** 2) + (y ** 2)) ** 0.5)'}}
state_load(f, use_active_range=False)[source]

Load a state previously stored by DataFrame.state_store(), see also DataFrame.state_set().

state_set(state, use_active_range=False)[source]

Sets the internal state of the df

Example:

>>> import vaex
>>> df = vaex.from_scalars(x=1, y=2)
>>> df
  #    x    y        r
  0    1    2  2.23607
>>> df['r'] = (df.x**2 + df.y**2)**0.5
>>> state = df.state_get()
>>> state
{'active_range': [0, 1],
'column_names': ['x', 'y', 'r'],
'description': None,
'descriptions': {},
'functions': {},
'renamed_columns': [],
'selections': {'__filter__': None},
'ucds': {},
'units': {},
'variables': {},
'virtual_columns': {'r': '(((x ** 2) + (y ** 2)) ** 0.5)'}}
>>> df2 = vaex.from_scalars(x=3, y=4)
>>> df2.state_set(state)  # now the virtual functions are 'copied'
>>> df2
  #    x    y    r
  0    3    4    5
Parameters:
  • state – dict as returned by DataFrame.state_get().
  • use_active_range (bool) – Whether to use the active range or not.
state_write(f)[source]

Write the internal state to a json or yaml file (see DataFrame.state_get())

Example

>>> import vaex
>>> df = vaex.from_scalars(x=1, y=2)
>>> df['r'] = (df.x**2 + df.y**2)**0.5
>>> df.state_write('state.json')
>>> print(open('state.json').read())
{
"virtual_columns": {
    "r": "(((x ** 2) + (y ** 2)) ** 0.5)"
},
"column_names": [
    "x",
    "y",
    "r"
],
"renamed_columns": [],
"variables": {
    "pi": 3.141592653589793,
    "e": 2.718281828459045,
    "km_in_au": 149597870.7,
    "seconds_per_year": 31557600
},
"functions": {},
"selections": {
    "__filter__": null
},
"ucds": {},
"units": {},
"descriptions": {},
"description": null,
"active_range": [
    0,
    1
]
}
>>> df.state_write('state.yaml')
>>> print(open('state.yaml').read())
active_range:
- 0
- 1
column_names:
- x
- y
- r
description: null
descriptions: {}
functions: {}
renamed_columns: []
selections:
__filter__: null
ucds: {}
units: {}
variables:
pi: 3.141592653589793
e: 2.718281828459045
km_in_au: 149597870.7
seconds_per_year: 31557600
virtual_columns:
r: (((x ** 2) + (y ** 2)) ** 0.5)
Parameters:f (str) – filename (ending in .json or .yaml)
std(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the standard deviation for the given expression, possible on a grid defined by binby

>>> df.std("vz")
110.31773397535071
>>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

sum(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the sum for the given expression, possible on a grid defined by binby

Examples:

>>> df.sum("L")
304054882.49378014
>>> df.sum("L", binby="E", shape=4)
array([  8.83517994e+06,   5.92217598e+07,   9.55218726e+07,
                 1.40008776e+08])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

tail(n=10)[source]

Return a shallow copy a DataFrame with the last n rows.

take(indices)[source]

Returns a DataFrame containing only rows indexed by indices

Note

Note that no copy of the underlying data is made, only a view/reference is make.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df.take([0,2])
 #  s      x
 0  a      1
 1  c      3
Parameters:indices – sequence (list or numpy array) with row numbers
Returns:DataFrame which is a shallow copy of the original data.
Return type:DataFrame
to_arrow_table(column_names=None, selection=None, strings=True, virtual=False)[source]

Returns an arrow Table object containing the arrays corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to DataFrame.get_column_names when column_names is None
  • virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns:

pyarrow.Table object

to_astropy_table(column_names=None, selection=None, strings=True, virtual=False, index=None)[source]

Returns a astropy table object containing the ndarrays corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to DataFrame.get_column_names when column_names is None
  • virtual – argument passed to DataFrame.get_column_names when column_names is None
  • index – if this column is given it is used for the index of the DataFrame
Returns:

astropy.table.Table object

to_copy(column_names=None, selection=None, strings=True, virtual=False, selections=True)[source]

Return a copy of the DataFrame, if selection is None, it does not copy the data, it just has a reference

Parameters:
  • column_names – list of column names, to copy, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to DataFrame.get_column_names when column_names is None
  • virtual – argument passed to DataFrame.get_column_names when column_names is None
  • selections – copy selections to a new DataFrame
Returns:

dict

to_dict(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a dict containing the ndarray corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to DataFrame.get_column_names when column_names is None
  • virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns:

dict

to_items(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to DataFrame.get_column_names when column_names is None
  • virtual – argument passed to DataFrame.get_column_names when column_names is None
Returns:

list of (name, ndarray) pairs

to_pandas_df(column_names=None, selection=None, strings=True, virtual=False, index_name=None)[source]

Return a pandas DataFrame containing the ndarray corresponding to the evaluated data

If index is given, that column is used for the index of the dataframe.

Example

>>> df_pandas = df.to_pandas_df(["x", "y", "z"])
>>> df_copy = vaex.from_pandas(df_pandas)
Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to DataFrame.get_column_names when column_names is None
  • virtual – argument passed to DataFrame.get_column_names when column_names is None
  • index_column – if this column is given it is used for the index of the DataFrame
Returns:

pandas.DataFrame object

trim(inplace=False)[source]

Return a DataFrame, where all columns are ‘trimmed’ by the active range.

For the returned DataFrame, df.get_active_range() returns (0, df.length_original()).

Note

Note that no copy of the underlying data is made, only a view/reference is make.

Parameters:inplace – Make modifications to self or return a new DataFrame
Return type:DataFrame
ucd_find(ucds, exclude=[])[source]

Find a set of columns (names) which have the ucd, or part of the ucd.

Prefixed with a ^, it will only match the first part of the ucd.

Example

>>> df.ucd_find('pos.eq.ra', 'pos.eq.dec')
['RA', 'DEC']
>>> df.ucd_find('pos.eq.ra', 'doesnotexist')
>>> df.ucds[df.ucd_find('pos.eq.ra')]
'pos.eq.ra;meta.main'
>>> df.ucd_find('meta.main')]
'dec'
>>> df.ucd_find('^meta.main')]
unit(expression, default=None)[source]

Returns the unit (an astropy.unit.Units object) for the expression.

Example

>>> import vaex
>>> ds = vaex.example()
>>> df.unit("x")
Unit("kpc")
>>> df.unit("x*L")
Unit("km kpc2 / s")
Parameters:
  • expression – Expression, which can be a column name
  • default – if no unit is known, it will return this
Returns:

The resulting unit of the expression

Return type:

astropy.units.Unit

validate_expression(expression)[source]

Validate an expression (may throw Exceptions)

var(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the sample variance for the given expression, possible on a grid defined by binby

Examples:

>>> df.var("vz")
12170.002429456246
>>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 15271.90481083,   7284.94713504,   3738.52239232,   1449.63418988])
>>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
>>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

DataFrameLocal class

class vaex.dataframe.DataFrameLocal(name, path, column_names)[source]

Base class for DataFrames that work with local file/data

__array__(dtype=None)[source]

Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.

Note this returns the same result as:

>>> np.array(ds)

If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).

__call__(*expressions, **kwargs)[source]

The local implementation of DataFrame.__call__()

__init__(name, path, column_names)[source]

Initialize self. See help(type(self)) for accurate signature.

categorize(column, labels=None, check=True)[source]

Mark column as categorical, with given labels, assuming zero indexing

compare(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]

Compare two DataFrames and report their difference, use with care for large DataFrames

concat(other)[source]

Concatenates two DataFrames, adding the rows of one the other DataFrame to the current, returned in a new DataFrame.

No copy of the data is made.

Parameters:other – The other DataFrame that is concatenated with this DataFrame
Returns:New DataFrame with the rows concatenated
Return type:DataFrameConcatenated
data

Gives direct access to the data as numpy arrays.

Convenient when working with IPython in combination with small DataFrames, since this gives tab-completion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use DataFrame.evalulate(…).

Columns can be accesed by there names, which are attributes. The attribues are of type numpy.ndarray.

Example:

>>> df = vaex.example()
>>> r = np.sqrt(df.data.x**2 + df.data.y**2)
evaluate(expression, i1=None, i2=None, out=None, selection=None, filtered=True)[source]

The local implementation of DataFrame.evaluate()

export(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]

Exports the DataFrame to a file written with arrow

Parameters:
  • df (DataFrameLocal) – DataFrame to export
  • path (str) – path for file
  • column_names (lis[str]) – list of column names to export or None for all columns
  • byteorder (str) – = for native, < for little endian and > for big endian (not supported for fits)
  • shuffle (bool) – export rows in random order
  • selection (bool) – export selection or not
  • progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
  • sort (str) – expression used for sorting the output
  • ascending (bool) – sort ascending (True) or descending
Param:

bool virtual: When True, export virtual columns

Returns:

export_arrow(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]

Exports the DataFrame to a file written with arrow

Parameters:
  • df (DataFrameLocal) – DataFrame to export
  • path (str) – path for file
  • column_names (lis[str]) – list of column names to export or None for all columns
  • byteorder (str) – = for native, < for little endian and > for big endian
  • shuffle (bool) – export rows in random order
  • selection (bool) – export selection or not
  • progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
  • sort (str) – expression used for sorting the output
  • ascending (bool) – sort ascending (True) or descending
Param:

bool virtual: When True, export virtual columns

Returns:

export_fits(path, column_names=None, shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]

Exports the DataFrame to a fits file that is compatible with TOPCAT colfits format

Parameters:
  • df (DataFrameLocal) – DataFrame to export
  • path (str) – path for file
  • column_names (lis[str]) – list of column names to export or None for all columns
  • shuffle (bool) – export rows in random order
  • selection (bool) – export selection or not
  • progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
  • sort (str) – expression used for sorting the output
  • ascending (bool) – sort ascending (True) or descending
Param:

bool virtual: When True, export virtual columns

Returns:

export_hdf5(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True)[source]

Exports the DataFrame to a vaex hdf5 file

Parameters:
  • df (DataFrameLocal) – DataFrame to export
  • path (str) – path for file
  • column_names (lis[str]) – list of column names to export or None for all columns
  • byteorder (str) – = for native, < for little endian and > for big endian
  • shuffle (bool) – export rows in random order
  • selection (bool) – export selection or not
  • progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True
  • sort (str) – expression used for sorting the output
  • ascending (bool) – sort ascending (True) or descending
Param:

bool virtual: When True, export virtual columns

Returns:

is_local()[source]

The local implementation of DataFrame.evaluate(), always returns True.

join(other, on=None, left_on=None, right_on=None, lsuffix='', rsuffix='', how='left', inplace=False)[source]

Return a DataFrame joined with other DataFrames, matched by columns/expression on/left_on/right_on

If neither on/left_on/right_on is given, the join is done by simply adding the columns (i.e. on the implicit row index).

Note: The filters will be ignored when joining, the full DataFrame will be joined (since filters may change). If either DataFrame is heavily filtered (contains just a small number of rows) consider running DataFrame.extract() first.

Example:

>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds1 = vaex.from_arrays(a=a, x=x)
>>> b = np.array(['a', 'b', 'd'])
>>> y = x**2
>>> ds2 = vaex.from_arrays(b=b, y=y)
>>> ds1.join(ds2, left_on='a', right_on='b')
Parameters:
  • other – Other DataFrame to join with (the right side)
  • on – default key for the left table (self)
  • left_on – key for the left table (self), overrides on
  • right_on – default key for the right table (other), overrides on
  • lsuffix – suffix to add to the left column names in case of a name collision
  • rsuffix – similar for the right
  • how – how to join, ‘left’ keeps all rows on the left, and adds columns (with possible missing values) ‘right’ is similar with self and other swapped.
  • inplace – Make modifications to self or return a new DataFrame
Returns:

label_encode(column, values=None, inplace=False)[source]

Label encode column and mark it as categorical

The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].

length(selection=False)[source]

Get the length of the DataFrames, for the selection of the whole DataFrame.

If selection is False, it returns len(df).

TODO: Implement this in DataFrameRemote, and move the method up in DataFrame.length()

Parameters:selection – When True, will return the number of selected rows
Returns:
selected_length(selection='default')[source]

The local implementation of DataFrame.selected_length()

shallow_copy(virtual=True, variables=True)[source]

Creates a (shallow) copy of the DataFrame.

It will link to the same data, but will have its own state, e.g. virtual columns, variables, selection etc.

vaex.stat module

class vaex.stat.Expression[source]

Describes an expression for a statistic

calculate(ds, binby=[], shape=256, limits=None, selection=None)[source]

Calculate the statistic for a Dataset

vaex.stat.correlation(x, y)[source]

Creates a standard deviation statistic

vaex.stat.count(expression='*')[source]

Creates a count statistic

vaex.stat.covar(x, y)[source]

Creates a standard deviation statistic

vaex.stat.mean(expression)[source]

Creates a mean statistic

vaex.stat.std(expression)[source]

Creates a standard deviation statistic

vaex.stat.sum(expression)[source]

Creates a sum statistic

Machine learning with vaex.ml

Note that vaex.ml does not fall under the MIT, but the CC BY-CC-ND LICENSE, which means it’s ok for personal or academic use. You can install vaex-ml using pip install vaex-ml.

Clustering

class vaex.ml.cluster.KMeans(cluster_centers=traitlets.Undefined, features=traitlets.Undefined, inertia=None, init='random', max_iter=300, n_clusters=2, n_init=1, prediction_label='prediction_kmeans', random_state=None, verbose=False)[source]

The KMeans clustering algorithm.

>>> import vaex.ml
>>> import vaex.ml.cluster
>>> df = vaex.ml.datasets.load_iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = vaex.ml.train_test_split(df)
>>> cls = vaex.ml.cluster.KMeans(n_clusters=3, features=features, init='random', max_iter=10)
>>> df_train = cls.fit_transform(df_train)
>>> df_test = cls.transform(df_test)
Parameters:
  • cluster_centers – Coordinates of cluster centers.
  • features – List of features to cluster.
  • inertia – Sum of squared distances of samples to their closest cluster center.
  • init – Method for initializing the centroids.
  • max_iter – Maximum number of iterations of the KMeans algorithm for a single run.
  • n_clusters – Number of clusters to form.
  • n_init – Number of centroid initializations. The KMeans algorithm will be run for each initialization, and the final results will be the best output of the n_init consecutive runs in terms of inertia.
  • prediction_label – The name of the virtual column that houses the cluster labels for each point.
  • random_state – Random number generation for centroid initialization. If an int is specified, the randomness becomes deterministic.
  • verbose – If True, enable verbosity mode.
fit(dataset)[source]

Fit the KMeans model to the dataset.

Parameters:dataset – A vaex dataset.

self

transform(dataset)[source]

Label a dataset with a fitted KMeans model.

Parameters:dataset – A vaex dataset.
Return copy:A shallow copy of the dataset that includes the encodings.

PCA

class vaex.ml.transformations.PCA(features=traitlets.Undefined, n_components=2, prefix='PCA_')[source]

Transform a set of features using a Principal Component Analysis.

>>> import vaex.ml
>>> df = vaex.ml.datasets.load_iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = vaex.ml.train_test_split(df)
>>> pca = vaex.ml.PCA(features=features, n_components=3)
>>> df_train = pca.fit_transform(df_train)
>>> df_test = pca.transform(df_test)
Parameters:
  • features – List of features to transform.
  • n_components – Number of components to retain.
  • prefix – Prefix for the names of the transformed features.
fit(dataset, column_names=None, progress=False)[source]

Fit the PCA model to the dataset.

Parameters:
  • dataset – A vaex dataset.
  • columns – Deprecated and should be removed.
  • progress – bool, if True, display a progress bar of the fitting process.

self

transform(dataset, n_components=None)[source]

Apply the PCA transformation to the dataset.

Parameters:
  • dataset – A vaex dataset.
  • n_components – The number of components to retain.
Return copy:A shallow copy of the dataset that includes the PCA components.

Encoders

class vaex.ml.transformations.LabelEncoder(features=traitlets.Undefined, prefix='Prefix for the names of the transformed features.')[source]

Encode categorical columns with integer values between 0 and num_classes-1.

>>> import vaex.ml
>>> df = vaex.ml.datasets.load_titanic()
>>> df_train, df_test = vaex.ml.train_test_split(df)
>>> encoder = vaex.ml.LabelEncoder(features=['sex', 'embarked'])
>>> df_train = encoder.fit_transform(df_train)
>>> df_test = encoder.transform(df_test)
Parameters:
  • features – List of features to transform.
  • prefix
fit(dataset)[source]

Fit LabelEncoder to the dataset.

Parameters:dataset – A vaex dataset.

self

transform(dataset)[source]

Transform a dataset with a fitted LabelEncoder.

Parameters:dataset – A vaex dataset.
Return copy:A shallow copy of the dataset that includes the encodings.
class vaex.ml.transformations.OneHotEncoder(features=traitlets.Undefined, one=1, prefix='', zero=0)[source]

Encode categorical columns according ot the One-Hot scheme.

>>> import vaex.ml
>>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red'])
>>> df
 #  color
 0  red
 1  green
 2  green
 3  blue
 4  red
>>> encoder = vaex.ml.OneHotEncoder(features=['color'])
>>> encoder.fit_transform(df)
 #  color      color_blue    color_green    color_red
 0  red                 0              0            1
 1  green               0              1            0
 2  green               0              1            0
 3  blue                1              0            0
 4  red                 0              0            1
Parameters:
  • features – List of features to transform.
  • one – Value to encode when a category is present.
  • prefix – Prefix for the names of the transformed features.
  • zero – Value to encode when category is absent.
fit(dataset)[source]

Fit OneHotEncoder to the dataset.

Parameters:dataset – A vaex dataset.
transform(dataset)[source]

Transform a dataset with a fitted OneHotEncoder.

Parameters:dataset – A vaex dataset.
Returns:A shallow copy of the dataset that includes the encodings.
class vaex.ml.transformations.StandardScaler(features=traitlets.Undefined, prefix='standard_scaled_', with_mean=True, with_std=True)[source]

Standardize features by removing thir mean and scaling them to unit variance.

>>> import vaex.ml
>>> df = vaex.ml.datasets.load_iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = vaex.ml.train_test_split(df)
>>> scaler = vaex.ml.StandardScaler(features=features, with_mean=True, with_std=True)
>>> df_train = scaler.fit_transform(df_train)
>>> df_test = scaler.transform(df_test)
Parameters:
  • features – List of features to transform.
  • prefix – Prefix for the names of the transformed features.
  • with_mean – If True, remove the mean from each feature.
  • with_std – If True, scale each feature to unit variance.
fit(dataset)[source]

Fit StandardScaler to the dataset.

Parameters:dataset – A vaex dataset.

self

transform(dataset)[source]

Transform a dataset with a fitted StandardScaler.

Parameters:dataset – A vaex dataset.
Return copy:a shallow copy of the dataset that includes the scaled features.
class vaex.ml.transformations.MinMaxScaler(feature_range=traitlets.Undefined, features=traitlets.Undefined, prefix='minmax_scaled_')[source]

Will scale a set of features to a given range.

>>> import vaex.ml
>>> df = vaex.ml.datasets.load_iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = vaex.ml.train_test_split(df)
>>> scaler = vaex.ml.MinMaxScaler(features=features, feature_range=(0, 1))
>>> df_train = scaler.fit_transform(df_train)
>>> df_test = scaler.transform(df_test)
Parameters:
  • feature_range – The range the features are scaled to.
  • features – List of features to transform.
  • prefix – Prefix for the names of the transformed features.
fit(dataset)[source]

Fit MinMaxScaler to the dataset.

Parameters:dataset – A vaex dataset.

self

transform(dataset)[source]

Transform a dataset with a fitted MinMaxScaler.

Parameters:dataset – A vaex dataset.
Return copy:a shallow copy of the dataset that includes the scaled features.
class vaex.ml.transformations.MaxAbsScaler(features=traitlets.Undefined, prefix='absmax_scaled_')[source]

Scale features by their maximum absolute value.

>>> import vaex.ml
>>> df = vaex.ml.datasets.load_iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = vaex.ml.train_test_split(df)
>>> scaler = vaex.ml.MaxAbsScaler(features=features)
>>> df_train = scaler.fit_transform(df_train)
>>> df_test = scaler.transform(df_test)
Parameters:
  • features – List of features to transform.
  • prefix – Prefix for the names of the transformed features.
fit(dataset)[source]

Fit MinMaxScaler to the dataset.

Parameters:dataset – A vaex dataset.

self

transform(dataset)[source]

Transform a dataset with a fitted MaxAbsScaler.

Parameters:dataset – A vaex dataset.
Return copy:a shallow copy of the dataset that includes the scaled features.
class vaex.ml.transformations.RobustScaler(features=traitlets.Undefined, percentile_range=traitlets.Undefined, prefix='robust_scaled_', with_centering=True, with_scaling=True)[source]

The RobustScaler removes the median and scales the data according to a given percentile range. By default, the scaling is done between the 25th and the 75th percentile. Centering and scaling happens independently for each feature (column).

>>> import vaex.ml
>>> df = vaex.ml.datasets.load_iris()
>>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width']
>>> df_train, df_test = vaex.ml.train_test_split(df)
>>> scaler = vaex.ml.RobustScaler(features=features, percentile_range=(25, 75))
>>> df_train = scaler.fit_transform(df_train)
>>> df_test = scaler.transform(df_test)
Parameters:
  • features – List of features to transform.
  • percentile_range – The percentile range to which to scale each feature to.
  • prefix – Prefix for the names of the transformed features.
  • with_centering – If True, remove the median.
  • with_scaling – If True, scale each feature between the specified percentile range.
fit(dataset)[source]

Fit RobustScaler to the dataset.

Parameters:dataset – A vaex dataset.

self

transform(dataset)[source]

Transform a dataset with a fitted RobustScaler.

Parameters:dataset – A vaex dataset.
Return copy:a shallow copy of the dataset that includes the scaled features.

Boosted trees

class vaex.ml.lightgbm.LightGBMModel(features=traitlets.Undefined, num_round=0, param=traitlets.Undefined, prediction_name='lightgbm_prediction')[source]

The LightGBM algorithm.

This class provides an interface to the LightGBM aloritham, with some optimizations for better memory efficiency when training large datasets. The algorithm itself is not modified at all.

LightGBM is a fast gradient boosting algorithm based on decision trees and is mainly used for classification, regression and ranking tasks. It is under the umbrella of the Distributed Machine Learning Toolkit (DMTK) project of Microsoft. For more information, please visit https://github.com/Microsoft/LightGBM/.

import vaex.ml. >>> import vaex.ml.lightgbm >>> df = vaex.ml.datasets.load_iris() >>> features = [‘sepal_width’, ‘petal_length’, ‘sepal_length’, ‘petal_width’] >>> df_train, df_test = vaex.ml.train_test_split(df) >>> params = {

‘boosting’: ‘gbdt’, ‘max_depth’: 5, ‘learning_rate’: 0.1, ‘application’: ‘multiclass’, ‘num_class’: 3, ‘subsample’: 0.80, ‘colsample_bytree’: 0.80}
>>> booster = vaex.ml.lightgbm.LightGBMModel(features=features, num_rounds=100, param=params)
>>> booster.fit(df_train, 'class_')
>>> df_train = booster.transform(df_train)
>>> df_test = booster.transform(df_test)
Parameters:
  • features – List of features to use when fitting the LightGBMModel.
  • num_round – Number of boosting iterations.
  • param – parameters to be passed on the to the LightGBM model.
  • prediction_name – The name of the virtual column housing the predictions.
fit(dataset, label, copy=False)[source]

Fit the LightGBMModel to the dataset.

Parameters:
  • dataset – A vaex dataset.
  • label – The name of the column containing the target variable.
  • copy – bool, if True, make an in memory copy of the data before passing it to the LightGBMModel.

self

predict(dataset, copy=False)[source]

Get an in-memory numpy array with the predictions of the LightGBMModel on a vaex dataset

Parameters:
  • dataset – A vaex dataset.
  • copy – bool, if True, make an in memory copy of the data before passing it to the LightGBMModel.

A in-memory numpy array containing the LightGBMModel predictions.

transform(dataset)[source]

Transform the dataset such that it contains the predictions of the LightGBMModel in a form of a virtual columns.

Parameters:dataset – A vaex dataset.
Return copy:A shallow copy of the dataset that includes the LightGBMModel predictions as virtual columns.
class vaex.ml.lightgbm.LightGBMClassifier(features=traitlets.Undefined, num_round=0, param=traitlets.Undefined, prediction_name='lightgbm_prediction')[source]
Parameters:
  • features – List of features to use when fitting the LightGBMModel.
  • num_round – Number of boosting iterations.
  • param – parameters to be passed on the to the LightGBM model.
  • prediction_name – The name of the virtual column housing the predictions.
predict(dataset, copy=False)[source]

Get an in-memory numpy array with the predictions of the LightGBMModel on a vaex dataset

Parameters:
  • dataset – A vaex dataset.
  • copy – bool, if True, make an in memory copy of the data before passing it to the LightGBMModel.

A in-memory numpy array containing the LightGBMModel predictions.

class vaex.ml.xgboost.XGBModel(features=traitlets.Undefined, num_round=0, param=traitlets.Undefined, prediction_name='xgboost_prediction')[source]

XGBModel for vaex, using a faster memory saving copying mechanism

Nearest neighbour

Annoy support is in the incubator phase, which means support may disappear in future versions

class vaex.ml.incubator.annoy.ANNOYModel(features=traitlets.Undefined, metric='euclidean', n_neighbours=10, n_trees=10, predcition_name='annoy_prediction', prediction_name='annoy_prediction', search_k=-1)[source]
Parameters:
  • features – List of features to use.
  • metric – Metric to use for distance calculations
  • n_neighbours – Now many neighbours
  • n_trees – Number of trees to build.
  • predcition_name – Output column name for the neighbours when transforming a dataset
  • prediction_name – Output column name for the neighbours when transforming a dataset
  • search_k – Jovan?