API documentation for vaex library

Quick list for opening/reading in your data.

vaex.open(path[, convert, shuffle, copy_index]) Open a dataset from file given by path
vaex.from_arrays(**arrays) Create an in memory dataset from numpy arrays
vaex.from_csv(filename_or_buffer[, copy_index]) Shortcut to read a csv file using pandas and convert to a dataset directly
vaex.from_ascii(path[, seperator, names, …]) Create an in memory dataset from an ascii file (whitespace seperated by default).
vaex.from_pandas(df[, name, copy_index, …]) Create an in memory dataset from a pandas dataframe
vaex.from_astropy_table(table)

Quick list for visualization.

vaex.dataset.Dataset.plot(*args, **kwargs)
vaex.dataset.Dataset.plot1d(*args, **kwargs)
vaex.dataset.Dataset.scatter(*args, **kwargs)
vaex.dataset.Dataset.plot_widget(x, y[, z, …])
vaex.dataset.Dataset.healpix_plot([…])
param healpix_expression:
 {healpix_max_level}

Quick list for statistics.

vaex.dataset.Dataset.count([expression, …]) Count the number of non-NaN values (or all, if expression is None or “*”)
vaex.dataset.Dataset.mean(expression[, …]) Calculate the mean for expression, possibly on a grid defined by binby.
vaex.dataset.Dataset.std(expression[, …]) Calculate the standard deviation for the given expression, possible on a grid defined by binby
vaex.dataset.Dataset.var(expression[, …]) Calculate the sample variance for the given expression, possible on a grid defined by binby
vaex.dataset.Dataset.cov(x[, y, binby, …]) Calculate the covariance matrix for x and y or more expressions, possible on a grid defined by binby
vaex.dataset.Dataset.correlation(x[, y, …]) Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possible on a grid defined by binby
vaex.dataset.Dataset.median_approx(expression) Calculate the median , possible on a grid defined by binby
vaex.dataset.Dataset.mode(expression[, …])
vaex.dataset.Dataset.min(expression[, …]) Calculate the minimum for given expressions, possible on a grid defined by binby
vaex.dataset.Dataset.max(expression[, …]) Calculate the maximum for given expressions, possible on a grid defined by binby
vaex.dataset.Dataset.minmax(expression[, …]) Calculate the minimum and maximum for expressions, possible on a grid defined by binby
vaex.dataset.Dataset.mutual_information(x[, …]) Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possible on a grid defined by binby

vaex module

Vaex is a library for dealing with big tabular data.

The most important class (datastructure) in vaex is the Dataset. A dataset is obtained by either, opening the example dataset:

>>> import vaex
>>> t = vaex.example()

Or using open() or from_csv(), to open a file:

>>> t1 = vaex.open("somedata.hdf5")
>>> t2 = vaex.open("somedata.fits")
>>> t3 = vaex.from_csv("somedata.csv")

Or connecting to a remove server:

>>> tbig = vaex.open("http://bla.com/bigtable")

The main purpose of vaex is to provide statistics, such as mean, count, sum, standard deviation, per columns, possibly with a selection, and on a regular grid.

To count the number of rows:

>>> t = vaex.example()
>>> t.count()
330000.0

Or the number of valid values, which for this dataset is the same:

>>> t.count("x")
330000.0

Count them on a regular grid:

>>> t.count("x", binby=["x", "y"], shape=(4,4))
array([[   902.,   5893.,   5780.,   1193.],
       [  4097.,  71445.,  75916.,   4560.],
       [  4743.,  71131.,  65560.,   4108.],
       [  1115.,   6578.,   4382.,    821.]])

Visualise it using matplotlib:

>>> t.plot("x", "y", show=True)
<matplotlib.image.AxesImage at 0x1165a5090>
vaex.open(path, convert=False, shuffle=False, copy_index=True, *args, **kwargs)[source]

Open a dataset from file given by path

Example:

>>> ds = vaex.open('sometable.hdf5')
>>> ds = vaex.open('somedata*.csv', convert='bigdata.hdf5')
Parameters:
  • path (str) – local or absolute path to file, or glob string
  • convert – convert files to an hdf5 file for optimization, can also be a path
  • shuffle (bool) – shuffle converted dataset or not
  • args – extra arguments for file readers that need it
  • kwargs – extra keyword arguments
  • copy_index (bool) – copy index when source is read via pandas
Returns:

return dataset if file is supported, otherwise None

Return type:

Dataset

Example:
>>> import vaex as vx
>>> vx.open('myfile.hdf5')
<vaex.dataset.Hdf5MemoryMapped at 0x1136ee3d0>
>>> vx.open('gadget_file.hdf5', 3) # this will read only particle type 3
<vaex.dataset.Hdf5MemoryMappedGadget at 0x1136ef3d0>
vaex.from_arrays(**arrays)[source]

Create an in memory dataset from numpy arrays

Param:arrays: keyword arguments with arrays
Example:
>>> x = np.arange(10)
>>> y = x ** 2
>>> dataset = vx.from_arrays(x=x, y=y)
vaex.from_csv(filename_or_buffer, copy_index=True, **kwargs)[source]

Shortcut to read a csv file using pandas and convert to a dataset directly

vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]

Create an in memory dataset from an ascii file (whitespace seperated by default).

>>> ds = vx.from_ascii("table.asc")
>>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
Parameters:
  • path – file path
  • seperator – value seperator, by default whitespace, use “,” for comma seperated values.
  • names – If True, the first line is used for the column names, otherwise provide a list of strings with names
  • skip_lines – skip lines at the start of the file
  • skip_after – skip lines at the end of the file
  • kwargs
Returns:

vaex.from_pandas(df, name='pandas', copy_index=True, index_name='index')[source]

Create an in memory dataset from a pandas dataframe

Param:pandas.DataFrame df: Pandas dataframe
Param:name: unique for the dataset
>>> import pandas as pd
>>> df = pd.from_csv("test.csv")
>>> ds = vx.from_pandas(df, name="test")
vaex.from_astropy_table(table)[source]
vaex.from_samp(username=None, password=None)[source]

Connect to a SAMP Hub and wait for a single table load event, disconnect, download the table and return the dataset

Useful if you want to send a single table from say TOPCAT to vaex in a python console or notebook

vaex.open_many(filenames)[source]

Open a list of filenames, and return a dataset with all datasets cocatenated

Parameters:filenames (list[str]) – list of filenames/paths
Return type:Dataset
vaex.server(url, **kwargs)[source]

Connect to hostname supporting the vaex web api

Parameters:hostname (str) – hostname or ip address of server
Return vaex.dataset.ServerRest:
 returns a server object, note that it does not connect to the server yet, so this will always succeed
Return type:ServerRest
vaex.example(download=True)[source]

Returns an example dataset which comes with vaex for testing/learning purposes

Return type:vaex.dataset.Dataset
vaex.app(*args, **kwargs)[source]

Create a vaex app, the QApplication mainloop must be started.

In ipython notebook/jupyter do the following: import vaex.ui.main # this causes the qt api level to be set properly import vaex as xs Next cell: %gui qt Next cell app = vx.app()

From now on, you can run the app along with jupyter

vaex.zeldovich(dim=2, N=256, n=-2.5, t=None, scale=1, seed=None)[source]

Creates a zeldovich dataset

vaex.set_log_level_debug()[source]

set log level to debug

vaex.set_log_level_info()[source]

set log level to info

vaex.set_log_level_warning()[source]

set log level to warning

vaex.set_log_level_exception()[source]

set log level to exception

vaex.set_log_level_off()[source]

Disabled logging

vaex.delayed(f)[source]

Dataset class

class vaex.dataset.Dataset(name, column_names, executor=None)[source]

All datasets are encapsulated in this class, local or remote dataets

Each dataset has a number of columns, and a number of rows, the length of the dataset.

The most common operations are: Dataset.plot >>> >>>

All Datasets have one ‘selection’, and all calculations by Subspace are done on the whole dataset (default) or for the selection. The following example shows how to use the selection.

>>> some_dataset.select("x < 0")
>>> subspace_xy = some_dataset("x", "y")
>>> subspace_xy_selected = subspace_xy.selected()

TODO: active fraction, length and shuffled

__call__(*expressions, **kwargs)[source]

Alias/shortcut for Dataset.subspace()

__getitem__(item)[source]

Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering

Examples
>> ds[‘Lz’] # the expression ‘Lz >> ds[‘Lz/2’] # the expression ‘Lz/2’ >> ds[[“Lz”, “E”]] # a shallow copy with just two columns >> ds[ds.Lz < 0] # a shallow copy with the filter Lz < 0 applied
__iter__()[source]

Iterator over the column names

__len__()[source]

Returns the number of rows in the dataset (filtering applied)

__setitem__(name, value)[source]

Convenient way to add a virtual column / expression to this dataset

Examples:
>>> ds['r'] = np.sqrt(ds.x**2 + ds.y**2 + ds.z**2)
__weakref__

list of weak references to the object (if defined)

add_column(name, f_or_array)[source]

Add an in memory array as a column

add_column_healpix(name='healpix', longitude='ra', latitude='dec', degrees=True, healpix_order=12, nest=True)[source]

Add a healpix (in memory) column based on a longitude and latitude

Parameters:
  • name – Name of column
  • longitude – longitude expression
  • latitude – latitude expression (astronomical convenction latitude=90 is north pole)
  • degrees – If lon/lat are in degrees (default) or radians.
  • healpix_order – healpix order, >= 0
  • nest – Nested healpix (default) or ring.
add_variable(name, expression, overwrite=True)[source]

Add a variable column to the dataset

Param:str name: name of virtual varible
Param:expression: expression for the variable

Variable may refer to other variables, and virtual columns and expression may refer to variables

Example:
>>> dataset.add_variable("center")
>>> dataset.add_virtual_column("x_prime", "x-center")
>>> dataset.select("x_prime < 0")
add_virtual_column(name, expression, unique=False)[source]

Add a virtual column to the dataset

Example: >>> dataset.add_virtual_column(“r”, “sqrt(x**2 + y**2 + z**2)”) >>> dataset.select(“r < 10”)

Param:str name: name of virtual column
Param:expression: expression for the column
Parameters:unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2
add_virtual_columns_aitoff(alpha, delta, x, y, radians=True)[source]

Add aitoff (https://en.wikipedia.org/wiki/Aitoff_projection) projection

Parameters:
  • alpha – azimuth angle
  • delta – polar angle
  • x – output name for x coordinate
  • y – output name for y coordinate
  • radians – input and output in radians (True), or degrees (False)
Returns:

add_virtual_columns_cartesian_to_polar(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', propagate_uncertainties=False, radians=False)[source]

Convert cartesian to polar coordinates

Parameters:
  • x – expression for x
  • y – expression for y
  • radius_out – name for the virtual column for the radius
  • azimuth_out – name for the virtual column for the azimuth angle
  • propagate_uncertainties – {propagate_uncertainties}
  • radians – if True, azimuth is in radians, defaults to degrees
Returns:

add_virtual_columns_cartesian_to_spherical(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position')[source]

Convert cartesian to spherical coordinates.

Parameters:
  • x
  • y
  • z
  • alpha
  • delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True).
  • distance
  • radians
  • center
  • center_name
Returns:

add_virtual_columns_cartesian_velocities_to_polar(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', propagate_uncertainties=False)[source]

Convert cartesian to polar velocities.

Parameters:
  • x
  • y
  • vx
  • radius_polar – Optional expression for the radius, may lead to a better performance when given.
  • vy
  • vr_out
  • vazimuth_out
  • propagate_uncertainties – {propagate_uncertainties}
Returns:

add_virtual_columns_cartesian_velocities_to_spherical(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', vlong='vlong', vlat='vlat', distance=None)[source]

Concert velocities from a cartesian to a spherical coordinate system

TODO: errors

Parameters:
  • x – name of x column (input)
  • y – y
  • z – z
  • vx – vx
  • vy – vy
  • vz – vz
  • vr – name of the column for the radial velocity in the r direction (output)
  • vlong – name of the column for the velocity component in the longitude direction (output)
  • vlat – name of the column for the velocity component in the latitude direction, positive points to the north pole (output)
  • distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
Returns:

add_virtual_columns_matrix3d(x, y, z, xnew, ynew, znew, matrix, matrix_name='deprecated', matrix_is_expression=False, translation=[0, 0, 0], propagate_uncertainties=False)[source]
Parameters:
  • x (str) – name of x column
  • y (str) –
  • z (str) –
  • xnew (str) – name of transformed x column
  • ynew (str) –
  • znew (str) –
  • matrix (list[list]) – 2d array or list, with [row,column] order
  • matrix_name (str) –
Returns:

add_virtual_columns_rotation(x, y, xnew, ynew, angle_degrees, propagate_uncertainties=False)[source]

Rotation in 2d

Parameters:
  • x (str) – Name/expression of x column
  • y (str) – idem for y
  • xnew (str) – name of transformed x column
  • ynew (str) –
  • angle_degrees (float) – rotation in degrees, anti clockwise
Returns:

add_virtual_columns_spherical_to_cartesian(alpha, delta, distance, xname='x', yname='y', zname='z', propagate_uncertainties=False, center=[0, 0, 0], center_name='solar_position', radians=False)[source]

Convert spherical to cartesian coordinates.

Parameters:
  • alpha
  • delta – polar angle, ranging from the -90 (south pole) to 90 (north pole)
  • distance – radial distance, determines the units of x, y and z
  • xname
  • yname
  • zname
  • propagate_uncertainties – If true, will propagate errors for the new virtual columns, see :py:`Dataset.propagate_uncertainties` for details
  • center
  • center_name
  • radians
Returns:

byte_size(selection=False)[source]

Return the size in bytes the whole dataset requires (or the selection), respecting the active_fraction

classmethod can_open(path, *args, **kwargs)[source]

Tests if this class can open the file given by path

close_files()[source]

Close any possible open file handles, the dataset will not be in a usable state afterwards

col

Gives direct access to the data as numpy-like arrays.

Convenient when working with ipython in combination with small datasets, since this gives tab-completion

Columns can be accesed by there names, which are attributes. The attribues are currently strings, so you cannot do computations with them

Example:
>>> ds = vx.example()
>>> ds.plot(ds.col.x, ds.col.y)
column_count()[source]

Returns the number of columns, not counting virtual ones

combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]

Generate a list of combinations for the possible expressions for the given dimension

Parameters:
  • expressions_list – list of list of expressions, where the inner list defines the subspace
  • dimensions – if given, generates a subspace with all possible combinations for that dimension
  • exclude – list of
correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None)[source]

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possible on a grid defined by binby

Examples:

>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)")
array(0.6366637382215669)
>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 0.40594394,  0.69868851,  0.61394099,  0.65266318])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

count(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]

Count the number of non-NaN values (or all, if expression is None or “*”)

Examples:

>>> ds.count()
330000.0
>>> ds.count("*")
330000.0
>>> ds.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])
Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

cov(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance matrix for x and y or more expressions, possible on a grid defined by binby

Either x and y are expressions, e.g:

>>> ds.cov("x", "y")

Or only the x argument is given with a list of expressions, e,g.:

>> ds.cov([“x, “y, “z”])

Examples:

>>> ds.cov("x", "y")
array([[ 53.54521742,  -3.8123135 ],

[ -3.8123135 , 60.62257881]]) >>> ds.cov([“x”, “y”, “z”]) array([[ 53.54521742, -3.8123135 , -0.98260511], [ -3.8123135 , 60.62257881, 1.21381057], [ -0.98260511, 1.21381057, 25.55517638]])

>>> ds.cov("x", "y", binby="E", shape=2)
array([[[  9.74852878e+00,  -3.02004780e-02],

[ -3.02004780e-02, 9.99288215e+00]],

[[ 8.43996546e+01, -6.51984181e+00], [ -6.51984181e+00, 9.68938284e+01]]])

param x:expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
param y:if previous argument is not a list, this argument should be given
param binby:List of expressions for constructing a binned grid
param limits:description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
param shape:shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
param selection:
 Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
param delay:Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
return:Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)
covar(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance cov[x,y] between and x and y, possible on a grid defined by binby

Examples:

>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")
array(52.69461456005138)
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")/(ds.std("x**2+y**2+z**2") * ds.std("-log(-E+1)"))
0.63666373822156686
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 10.17387143,  51.94954078,  51.24902796,  20.2163929 ])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

delete_variable(name)[source]

Deletes a variable from a dataset

delete_virtual_column(name)[source]

Deletes a virtual column from a dataset

dropna(drop_nan=True, drop_masked=True, column_names=None)[source]

Create a shallow copy dataset, with filtering set using select_non_missing

Parameters:
  • drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
  • drop_masked – drop rows when there is a masked value in any of the columns
  • column_names – The columns to consider, default: all (real, non-virtual) columns
Returns:

Dataset

evaluate(expression, i1=None, i2=None, out=None, selection=None)[source]

Evaluate an expression, and return a numpy array with the results for the full column or a part of it.

Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.

To get partial results, use i1 and i2/

Parameters:
  • expression (str) – Name/expression to evaluate
  • i1 (int) – Start row index, default is the start (0)
  • i2 (int) – End row index, default is the length of the dataset
  • out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to

a memory mapped array) :param selection: selection to apply :return:

evaluate_variable(name)[source]

Evaluates the variable given by name

execute()[source]

Execute all delayed jobs

extract()[source]

Return a dataset containing only the filtered rows.

Note that no copy of the underlying data is made, only a view/reference is make.

The resulting dataset may be more efficient to work with when the original dataset is heavily filtered (contains just a small number of rows).

If no filtering is applied, it returns a trimmed view. For returned datasets, len(ds) == ds.length_original() == ds.length_unfiltered()

fillna(value, fill_nan=True, fill_masked=True, column_names=None, prefix='__original_', inplace=False)[source]

Return a dataset, where missing values/NaN are filled with ‘value’

Note that no copy of the underlying data is made, only a view/reference is make.

Note that filtering will be ignored (since they may change), you may want to consider running :py:`Dataset.extract` first.

Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.sort('(x-1.8)**2', ascending=False)  # b, c, a will be the order of a
Parameters:
  • or expression by (str) – expression to sort by
  • ascending (bool) – ascending (default, True) or descending (False)
  • kind (str) – kind of algorithm to use (passed to numpy.argsort)
  • inplace – Make modifications to self or return a new dataset
get_active_fraction()[source]

Value in the range (0, 1], to work only with a subset of rows

get_column_names(virtual=False, hidden=False, strings=False)[source]

Return a list of column names

Parameters:
  • virtual – If True, also return virtual columns
  • hidden – If True, also return hidden columns
Return type:

list of str

get_current_row()[source]

Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked

get_private_dir(create=False)[source]

Each datasets has a directory where files are stored for metadata etc

Example:
>>> import vaex as vx
>>> ds = vx.example()
>>> ds.get_private_dir()
'/Users/users/breddels/.vaex/datasets/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'
Parameters:create (bool) – is True, it will create the directory if it does not exist
get_selection(name='default')[source]

Get the current selection object (mostly for internal use atm)

get_variable(name)[source]

Returns the variable given by name, it will not evaluate it.

For evaluation, see Dataset.evaluate_variable(), see also Dataset.set_variable()

has_current_row()[source]

Returns True/False is there currently is a picked row

has_selection(name='default')[source]

Returns True of there is a selection

healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]

Count non missing value for expression on an array which represents healpix data.

Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
  • healpix_expression – {healpix_max_level}
  • healpix_max_level – {healpix_max_level}
  • healpix_level – {healpix_level}
  • binby – {binby}, these dimension follow the first healpix dimension.
  • limits – {limits}
  • shape – {shape}
  • selection – {selection}
  • delay – {delay}
  • progress – {progress}
Returns:

healpix_plot(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0))[source]
Parameters:
  • healpix_expression – {healpix_max_level}
  • healpix_max_level – {healpix_max_level}
  • healpix_level – {healpix_level}
  • what – {what}
  • selection – {selection}
  • grid – {grid}
  • healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.
  • healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”.
  • f – function to apply to the data
  • colormap – matplotlib colormap
  • grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid)))
  • image_size – size for the image that healpy uses for rendering
  • nest – If the healpix data is in nested (True) or ring (False)
  • figsize – If given, modify the matplotlib figure size. Example (14,9)
  • interactive – (Experimental, uses healpy.mollzoom is True)
  • title – Title of figure
  • smooth – apply gaussian smoothing, in degrees
  • show – Call matplotlib’s show (True) or not (False, defaut)
  • rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
Returns:

is_local()[source]

Returns True if the dataset is a local dataset, False when a remote dataset

length_original()[source]

the full length of the dataset, independant what active_fraction is, or filtering. This is the real length of the underlying ndarrays

length_unfiltered()[source]

The length of the arrays that should be considered (respecting active range), but without filtering

limits(expression, value=None, square=False, selection=None, delay=False, shape=None)[source]

Calculate the [min, max] range for expression, as described by value, which is ‘99.7%’ by default.

If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.

Example:

>>> ds.limits("x")
array([-28.86381927,  28.9261226 ])
>>> ds.limits(["x", "y"])
(array([-28.86381927,  28.9261226 ]), array([-28.60476934,  28.96535249]))
>>> ds.limits(["x", "y"], "minmax")
(array([-128.293991,  271.365997]), array([ -71.5523682,  146.465836 ]))
>>> ds.limits(["x", "y"], ["minmax", "90%"])
(array([-128.293991,  271.365997]), array([-13.37438402,  13.4224423 ]))
>>> ds.limits(["x", "y"], ["minmax", [0, 10]])
(array([-128.293991,  271.365997]), [0, 10])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • value – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

limits_percentage(expression, percentage=99.73, square=False, delay=False)[source]

Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.

The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:

>>> ds.limits_percentage("x", 90)
array([-12.35081376,  12.14858052]
>>> ds.percentile_approx("x", 5), ds.percentile_approx("x", 95)
(array([-12.36813152]), array([ 12.13275818]))

NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • percentage (float) – Value between 0 and 100
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

materialize(virtual_column, inplace=False)[source]

Returns a new dataset where the virtual column is turned into an in memory numpy array

Example:
>>> x = np.arange(1,4)
>>> y = np.arange(2,5)
>>> ds = vaex.from_arrays(x=x, y=y)
>>> ds['r'] = (ds.x**2 + ds.y**2)**0.5 # 'r' is a virtual column (computed on the fly)
>>> ds = ds.materialize('r')  # now 'r' is a 'real' column (i.e. a numpy array)
Parameters:inplace – {inplace}
max(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the maximum for given expressions, possible on a grid defined by binby

Example:

>>> ds.max("x")
array(271.365997)
>>> ds.max(["x", "y"])
array([ 271.365997,  146.465836])
>>> ds.max("x", binby="x", shape=5, limits=[-10, 10])
array([-6.00010443, -2.00002384,  1.99998057,  5.99983597,  9.99984646])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mean(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the mean for expression, possibly on a grid defined by binby.

Examples:

>>> ds.mean("x")
-0.067131491264005971
>>> ds.mean("(x**2+y**2)**0.5", binby="E", shape=4)
array([  2.43483742,   4.41840721,   8.26742458,  15.53846476])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False)[source]

Calculate the median , possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

min(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the minimum for given expressions, possible on a grid defined by binby

Example:

>>> ds.min("x")
array(-128.293991)
>>> ds.min(["x", "y"])
array([-128.293991 ,  -71.5523682])
>>> ds.min("x", binby="x", shape=5, limits=[-10, 10])
array([-9.99919128, -5.99972439, -1.99991322,  2.0000093 ,  6.0004878 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

minmax(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the minimum and maximum for expressions, possible on a grid defined by binby

Example:

>>> ds.minmax("x")
array([-128.293991,  271.365997])
>>> ds.minmax(["x", "y"])
array([[-128.293991 ,  271.365997 ],
           [ -71.5523682,  146.465836 ]])
>>> ds.minmax("x", binby="x", shape=5, limits=[-10, 10])
array([[-9.99919128, -6.00010443],
           [-5.99972439, -2.00002384],
           [-1.99991322,  1.99998057],
           [ 2.0000093 ,  5.99983597],
           [ 6.0004878 ,  9.99984646]])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mutual_information(x, y=None, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possible on a grid defined by binby

If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order

Examples:

>>> ds.mutual_information("x", "y")
array(0.1511814526380327)
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]])
array([ 0.15118145,  0.18439181,  1.07067379])
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True)
(array([ 1.07067379,  0.18439181,  0.15118145]),
[['E', 'Lz'], ['x', 'z'], ['x', 'y']])
Parameters:
  • x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,

percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False)[source]

Calculate the percentile given by percentage, possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

>>> ds.percentile_approx("x", 10), ds.percentile_approx("x", 90)
(array([-8.3220355]), array([ 7.92080358]))
>>> ds.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10])
array([[-7.56462982],
           [-3.61036641],
           [-0.01296306],
           [ 3.56697863],
           [ 7.45838367]])

0:1:0.1 1:1:0.2 2:1:0.3 3:1:0.4 4:1:0.5

5:1:0.6 6:1:0.7 7:1:0.8 8:1:0.9 9:1:1.0

Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]

Use at own risk, requires ipyvolume

propagate_uncertainties(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]

Propagates uncertainties (full covariance matrix) for a set of virtual columns.

Covariance matrix of the depending variables is guessed by finding columns prefixed by: ‘e’ or ‘e_’ or postfixed by ‘_error’, ‘_uncertainty’, ‘e’ and ‘_e’. Off diagonals (covariance or correlation) by postfixes with ‘_correlation’ or ‘_corr’ for correlation or ‘_covariance’ or ‘_cov’ for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation)

Example:

>>> ds = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2)
>>> ds['u'] = ds.x + ds.y
>>> ds['v'] = np.log10(ds.x)
>>> ds.propagate_uncertainties([ds.u, ds.v])
>>> ds.u_uncertainty, ds.v_uncertainty
Parameters:
  • columns – list of columns for which to calculate the covariance matrix.
  • depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties.
  • cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.
remove_virtual_meta()[source]

Removes the file with the virtual column etc, it does not change the current virtual columns etc

rename_column(name, new_name, unique=False)[source]

Renames a column, not this is only the in memory name, this will not be reflected on disk

sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]

Returns a dataset with a random set of rows

Note that no copy of the underlying data is made, only a view/reference is make.

Provide either n or frac.

Parameters:
  • n (int) – number of samples to take (default 1 if frac is None)
  • frac (float) – fractional number of takes to take
  • replace (bool) – If true, a row may be drawn multiple times
  • or expression weights (str) – (unnormalized) probability that a row can be drawn
  • or RandomState (int) – seed or RandomState for reproducability, when None a random seed it chosen
Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.sample(n=2, random_state=42) # 2 random rows, fixed seed
>>> ds.sample(frac=1) # 'shuffling'
>>> ds.sample(frac=1, replace=True) # useful for bootstrap (may contain repeated samples)
select(boolean_expression, mode='replace', name='default', executor=None)[source]

Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode

Selections are recorded in a history tree, per name, undo/redo can be done for them seperately

Parameters:
  • boolean_expression (str) – Any valid column expression, with comparison operators
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) – history tree or selection ‘slot’ to use
  • executor
Returns:

select_box(spaces, limits, mode='replace')[source]

Select a n-dimensional rectangular box bounded by limits

The following examples are equivalent: >>> ds.select_box([‘x’, ‘y’], [(0, 10), (0, 1)]) >>> ds.select_rectangle(‘x’, ‘y’, [(0, 10), (0, 1)]) :param spaces: list of expressions :param limits: sequence of shape [(x1, x2), (y1, y2)] :param mode: :return:

select_circle(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]

Select a circular region centred on xc, yc, with a radius of r.

Parameters:
  • x – expression for the x space
  • y – expression for the y space
  • xc – location of the centre of the circle in x
  • yc – location of the centre of the circle in y
  • r – the radius of the circle
  • name – name of the selection
  • mode
Returns:

Example: >>> ds.select_circle(‘x’,’y’,2,3,1)

select_ellipse(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]

Select an elliptical region centred on xc, yc, with a certain width, height and angle.

Parameters:
  • x – expression for the x space
  • y – expression for the y space
  • xc – location of the centre of the ellipse in x
  • yc – location of the centre of the ellipse in y
  • width – the width of the ellipse (diameter)
  • height – the width of the ellipse (diameter)
  • angle – (degrees) orientation of the ellipse, counter-clockwise measured from the y axis
  • name – name of the selection
  • mode
Returns:

Example: >>> ds.select_ellipse(‘x’,’y’, 2, -1, 5,1, 30, name=’my_ellipse’)

select_inverse(name='default', executor=None)[source]

Invert the selection, i.e. what is selected will not be, and vice versa

Parameters:
  • name (str) –
  • executor
Returns:

select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]

For performance reasons, a lasso selection is handled differently.

Parameters:
  • expression_x (str) – Name/expression for the x coordinate
  • expression_y (str) – Name/expression for the y coordinate
  • xsequence – list of x numbers defining the lasso, together with y
  • ysequence
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) –
  • executor
Returns:

select_non_missing(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]

Create a selection that selects rows having non missing values for all columns in column_names

The name reflect Panda’s, no rows are really dropped, but a mask is kept to keep track of the selection

Parameters:
  • drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
  • drop_masked – drop rows when there is a masked value in any of the columns
  • column_names – The columns to consider, default: all (real, non-virtual) columns
  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract
  • name (str) – history tree or selection ‘slot’ to use
Returns:

select_nothing(name='default')[source]

Select nothing

select_rectangle(x, y, limits, mode='replace')[source]

Select a 2d rectangular box in the space given by x and y, bounds by limits

Example: >>> ds.select_box(‘x’, ‘y’, [(0, 10), (0, 1)])

Parameters:
  • x – expression for the x space
  • y – expression fo the y space
  • limits – sequence of shape [(x1, x2), (y1, y2)]
  • mode
Returns:

selected_length()[source]

Returns the number of rows that are selected

selection_can_redo(name='default')[source]

Can selection name be redone?

selection_can_undo(name='default')[source]

Can selection name be undone?

selection_redo(name='default', executor=None)[source]

Redo selection, for the name

selection_undo(name='default', executor=None)[source]

Undo selection, for the name

set_active_fraction(value)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range(i1, i2)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_current_row(value)[source]

Set the current row, and emit the signal signal_pick

set_selection(selection, name='default', executor=None)[source]

Sets the selection object

Parameters:
  • selection – Selection object
  • name – selection ‘slot’
  • executor
Returns:

set_variable(name, expression_or_value, write=True)[source]

Set the variable to an expression or value defined by expression_or_value

Example:
>>> ds.set_variable("a", 2.)
>>> ds.set_variable("b", "a**2")
>>> ds.get_variable("b")
'a**2'
>>> ds.evaluate_variable("b")
4.0
Parameters:
  • name – Name of the variable
  • write – write variable to meta file
  • expression – value or expression
sort(by, ascending=True, kind='quicksort')[source]

Return a sorted dataset, sorted by the expression ‘by’

Note that no copy of the underlying data is made, only a view/reference is make.

Note that filtering will be ignored (since they may change), you may want to consider running :py:`Dataset.extract` first.

Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.sort('(x-1.8)**2', ascending=False)  # b, c, a will be the order of a
Parameters:
  • or expression by (str) – expression to sort by
  • ascending (bool) – ascending (default, True) or descending (False)
  • kind (str) – kind of algorithm to use (passed to numpy.argsort)
std(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the standard deviation for the given expression, possible on a grid defined by binby

>>> ds.std("vz")
110.31773397535071
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

subspace(*expressions, **kwargs)[source]

Return a Subspace for this dataset with the given expressions:

Example:

>>> subspace_xy = some_dataset("x", "y")
Return type:

Subspace

Parameters:
  • expressions (list[str]) – list of expressions
  • kwargs
Returns:

subspaces(expressions_list=None, dimensions=None, exclude=None, **kwargs)[source]

Generate a Subspaces object, based on a custom list of expressions or all possible combinations based on dimension

Parameters:
  • expressions_list – list of list of expressions, where the inner list defines the subspace
  • dimensions – if given, generates a subspace with all possible combinations for that dimension
  • exclude – list of
sum(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the sum for the given expression, possible on a grid defined by binby

Examples:

>>> ds.sum("L")
304054882.49378014
>>> ds.sum("L", binby="E", shape=4)
array([  8.83517994e+06,   5.92217598e+07,   9.55218726e+07,
                 1.40008776e+08])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

take(indices)[source]

Returns a dataset containing only rows indexed by indices

Note that no copy of the underlying data is made, only a view/reference is make.

Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.take([0,2])
to_astropy_table(column_names=None, selection=None, strings=True, virtual=False, index=None)[source]

Returns a astropy table object containing the ndarrays corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
  • index – if this column is given it is used for the index of the DataFrame
Returns:

astropy.table.Table object

to_copy(column_names=None, selection=None, strings=True, virtual=False, selections=True)[source]

Return a copy of the Dataset, if selection is None, it does not copy the data, it just has a reference

Parameters:
  • column_names – list of column names, to copy, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
  • selections – copy selections to new dataset
Returns:

dict

to_dict(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a dict containing the ndarray corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
Returns:

dict

to_items(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
Returns:

list of (name, ndarray) pairs

to_pandas_df(column_names=None, selection=None, strings=True, virtual=False, index_name=None)[source]

Return a pandas DataFrame containing the ndarray corresponding to the evaluated data

If index is given, that column is used for the index of the dataframe.

Example:
>>> df = ds.to_pandas_df(["x", "y", "z"])
>>> ds_copy = vx.from_pandas(df)
Parameters:
  • column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • strings – argument passed to Dataset.get_column_names when column_names is None
  • virtual – argument passed to Dataset.get_column_names when column_names is None
  • index_column – if this column is given it is used for the index of the DataFrame
Returns:

pandas.DataFrame object

trim(inplace=False)[source]

Return a dataset, where all columns are ‘trimmed’ by the active range.

For returned datasets, ds.get_active_range() returns (0, ds.length_original()).

Note that no copy of the underlying data is made, only a view/reference is make.

Parameters:inplace – Make modifications to self or return a new dataset
ucd_find(ucds, exclude=[])[source]

Find a set of columns (names) which have the ucd, or part of the ucd

Prefixed with a ^, it will only match the first part of the ucd

Example:
>>> dataset.ucd_find('pos.eq.ra', 'pos.eq.dec')
['RA', 'DEC']
>>> dataset.ucd_find('pos.eq.ra', 'doesnotexist')
>>> dataset.ucds[dataset.ucd_find('pos.eq.ra')]
'pos.eq.ra;meta.main'
>>> dataset.ucd_find('meta.main')]
'dec'
>>> dataset.ucd_find('^meta.main')]
>>>
unit(expression, default=None)[source]

Returns the unit (an astropy.unit.Units object) for the expression

Example:
>>> import vaex as vx
>>> ds = vx.example()
>>> ds.unit("x")
Unit("kpc")
>>> ds.unit("x*L")
Unit("km kpc2 / s")
Parameters:
  • expression – Expression, which can be a column name
  • default – if no unit is known, it will return this
Returns:

The resulting unit of the expression

Return type:

astropy.units.Unit

update_meta()[source]

Will read back the ucd, descriptions, units etc, written by Dataset.write_meta(). This will be done when opening a dataset.

update_virtual_meta()[source]

Will read back the virtual column etc, written by Dataset.write_virtual_meta(). This will be done when opening a dataset.

validate_expression(expression)[source]

Validate an expression (may throw Exceptions)

var(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the sample variance for the given expression, possible on a grid defined by binby

Examples:

>>> ds.var("vz")
12170.002429456246
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 15271.90481083,   7284.94713504,   3738.52239232,   1449.63418988])
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’]
  • binby – List of expressions for constructing a binned grid
  • limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
  • delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use)
  • progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

write_meta()[source]

Writes all meta data, ucd,description and units

The default implementation is to write this to a file called meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself. (For instance the vaex hdf5 implementation does this)

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

write_virtual_meta()[source]

Writes virtual columns, variables and their ucd,description and units

The default implementation is to write this to a file called virtual_meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself.

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_virtual_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

vaex.stat module

class vaex.stat.Expression[source]

Describes an expression for a statistic

calculate(ds, binby=[], shape=256, limits=None, selection=None)[source]

Calculate the statistic for a Dataset

vaex.stat.correlation(x, y)[source]

Creates a standard deviation statistic

vaex.stat.count(expression='*')[source]

Creates a count statistic

vaex.stat.covar(x, y)[source]

Creates a standard deviation statistic

vaex.stat.mean(expression)[source]

Creates a mean statistic

vaex.stat.std(expression)[source]

Creates a standard deviation statistic

vaex.stat.sum(expression)[source]

Creates a sum statistic

Machine learning with vaex.ml

Note that vaex.ml does not fall under the MIT, but the CC BY-CC-ND LICENSE, which means it’s ok for personal or academic use. You can install vaex-ml using pip install vaex-ml.