# API documentation for vaex library¶

 vaex.open(path[, convert, shuffle, copy_index]) Open a dataset from file given by path vaex.from_arrays(**arrays) Create an in memory dataset from numpy arrays vaex.from_csv(filename_or_buffer[, copy_index]) Shortcut to read a csv file using pandas and convert to a dataset directly vaex.from_ascii(path[, seperator, names, …]) Create an in memory dataset from an ascii file (whitespace seperated by default). vaex.from_pandas(df[, name, copy_index, …]) Create an in memory dataset from a pandas dataframe vaex.from_astropy_table(table)

## Quick list for visualization.¶

vaex.dataset.Dataset.plot(**kwargs)
vaex.dataset.Dataset.plot1d(**kwargs)
vaex.dataset.Dataset.scatter(**kwargs)
vaex.dataset.Dataset.plot_widget(x, y[, z, …])
vaex.dataset.Dataset.healpix_plot([…])
param healpix_expression:
{healpix_max_level} :param healpix_max_level: {healpix_max_level} :param healpix_level: {healpix_level} :param what: {what} :param selection: {selection} :param grid: {grid} :param healpix_input: Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.

## Quick list for statistics.¶

 vaex.dataset.Dataset.count([expression, …]) Count the number of non-NaN values (or all, if expression is None or “*”) vaex.dataset.Dataset.mean(expression[, …]) Calculate the mean for expression, possibly on a grid defined by binby. vaex.dataset.Dataset.std(expression[, …]) Calculate the standard deviation for the given expression, possible on a grid defined by binby vaex.dataset.Dataset.var(expression[, …]) Calculate the sample variance for the given expression, possible on a grid defined by binby vaex.dataset.Dataset.cov(x[, y, binby, …]) Calculate the covariance matrix for x and y or more expressions, possible on a grid defined by binby vaex.dataset.Dataset.correlation(x[, y, …]) Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possible on a grid defined by binby vaex.dataset.Dataset.median_approx(expression) Calculate the median , possible on a grid defined by binby vaex.dataset.Dataset.mode(expression[, …]) vaex.dataset.Dataset.min(expression[, …]) Calculate the minimum for given expressions, possible on a grid defined by binby vaex.dataset.Dataset.max(expression[, …]) Calculate the maximum for given expressions, possible on a grid defined by binby vaex.dataset.Dataset.minmax(expression[, …]) Calculate the minimum and maximum for expressions, possible on a grid defined by binby vaex.dataset.Dataset.mutual_information(x[, …]) Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possible on a grid defined by binby

## vaex module¶

Vaex is a library for dealing with big tabular data.

The most important class (datastructure) in vaex is the Dataset. A dataset is obtained by either, opening the example dataset:

>>> import vaex
>>> t = vaex.example()


Or using open() or from_csv(), to open a file:

>>> t1 = vaex.open("somedata.hdf5")
>>> t2 = vaex.open("somedata.fits")
>>> t3 = vaex.from_csv("somedata.csv")


Or connecting to a remove server:

>>> tbig = vaex.open("http://bla.com/bigtable")


The main purpose of vaex is to provide statistics, such as mean, count, sum, standard deviation, per columns, possibly with a selection, and on a regular grid.

To count the number of rows:

>>> t = vaex.example()
>>> t.count()
330000.0


Or the number of valid values, which for this dataset is the same:

>>> t.count("x")
330000.0


Count them on a regular grid:

>>> t.count("x", binby=["x", "y"], shape=(4,4))
array([[   902.,   5893.,   5780.,   1193.],
[  4097.,  71445.,  75916.,   4560.],
[  4743.,  71131.,  65560.,   4108.],
[  1115.,   6578.,   4382.,    821.]])


Visualise it using matplotlib:

>>> t.plot("x", "y", show=True)
<matplotlib.image.AxesImage at 0x1165a5090>

vaex.open(path, convert=False, shuffle=False, copy_index=True, *args, **kwargs)[source]

Open a dataset from file given by path

Example:

>>> ds = vaex.open('sometable.hdf5')
>>> ds = vaex.open('somedata*.csv', convert='bigdata.hdf5')

Parameters: path (str) – local or absolute path to file, or glob string convert – convert files to an hdf5 file for optimization, can also be a path shuffle (bool) – shuffle converted dataset or not args – extra arguments for file readers that need it kwargs – extra keyword arguments copy_index (bool) – copy index when source is read via pandas return dataset if file is supported, otherwise None Dataset
>>> import vaex as vx
>>> vx.open('myfile.hdf5')
<vaex.dataset.Hdf5MemoryMapped at 0x1136ee3d0>

vaex.from_arrays(**arrays)[source]

Create an in memory dataset from numpy arrays

Param: arrays: keyword arguments with arrays
>>> x = np.arange(10)
>>> y = x ** 2
>>> dataset = vx.from_arrays(x=x, y=y)

vaex.from_csv(filename_or_buffer, copy_index=True, **kwargs)[source]

Shortcut to read a csv file using pandas and convert to a dataset directly

vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]

Create an in memory dataset from an ascii file (whitespace seperated by default).

>>> ds = vx.from_ascii("table.asc")
>>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])

Parameters: path – file path seperator – value seperator, by default whitespace, use “,” for comma seperated values. names – If True, the first line is used for the column names, otherwise provide a list of strings with names skip_lines – skip lines at the start of the file skip_after – skip lines at the end of the file kwargs –
vaex.from_pandas(df, name='pandas', copy_index=True, index_name='index')[source]

Create an in memory dataset from a pandas dataframe

Param: pandas.DataFrame df: Pandas dataframe name: unique for the dataset
>>> import pandas as pd
>>> df = pd.from_csv("test.csv")
>>> ds = vx.from_pandas(df, name="test")

vaex.from_astropy_table(table)[source]
vaex.from_samp(username=None, password=None)[source]

Connect to a SAMP Hub and wait for a single table load event, disconnect, download the table and return the dataset

Useful if you want to send a single table from say TOPCAT to vaex in a python console or notebook

vaex.open_many(filenames)[source]

Open a list of filenames, and return a dataset with all datasets cocatenated

Parameters: filenames (list[str]) – list of filenames/paths Dataset
vaex.server(url, **kwargs)[source]

Connect to hostname supporting the vaex web api

Parameters: Return vaex.dataset.ServerRest: hostname (str) – hostname or ip address of server returns a server object, note that it does not connect to the server yet, so this will always succeed ServerRest
vaex.example(download=True)[source]

Returns an example dataset which comes with vaex for testing/learning purposes

Return type: vaex.dataset.Dataset
vaex.app(*args, **kwargs)[source]

Create a vaex app, the QApplication mainloop must be started.

In ipython notebook/jupyter do the following: import vaex.ui.main # this causes the qt api level to be set properly import vaex as xs Next cell: %gui qt Next cell app = vx.app()

From now on, you can run the app along with jupyter

vaex.zeldovich(dim=2, N=256, n=-2.5, t=None, scale=1, seed=None)[source]

Creates a zeldovich dataset

vaex.set_log_level_debug()[source]

set log level to debug

vaex.set_log_level_info()[source]

set log level to info

vaex.set_log_level_warning()[source]

set log level to warning

vaex.set_log_level_exception()[source]

set log level to exception

vaex.set_log_level_off()[source]

Disabled logging

vaex.delayed(f)[source]

## Dataset class¶

class vaex.dataset.Dataset(name, column_names, executor=None)[source]

All datasets are encapsulated in this class, local or remote dataets

Each dataset has a number of columns, and a number of rows, the length of the dataset.

The most common operations are: Dataset.plot >>> >>>

All Datasets have one ‘selection’, and all calculations by Subspace are done on the whole dataset (default) or for the selection. The following example shows how to use the selection.

>>> some_dataset.select("x < 0")
>>> subspace_xy = some_dataset("x", "y")
>>> subspace_xy_selected = subspace_xy.selected()


TODO: active fraction, length and shuffled

__call__(*expressions, **kwargs)[source]

Alias/shortcut for Dataset.subspace()

__getitem__(item)[source]

Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering

Examples
>> ds[‘Lz’] # the expression ‘Lz >> ds[‘Lz/2’] # the expression ‘Lz/2’ >> ds[[“Lz”, “E”]] # a shallow copy with just two columns >> ds[ds.Lz < 0] # a shallow copy with the filter Lz < 0 applied
__init__(name, column_names, executor=None)[source]

x.__init__(…) initializes x; see help(type(x)) for signature

__iter__()[source]

Iterator over the column names

__len__()[source]

Returns the number of rows in the dataset (filtering applied)

__setitem__(name, value)[source]

Convenient way to add a virtual column / expression to this dataset

Examples:
>>> ds['r'] = np.sqrt(ds.x**2 + ds.y**2 + ds.z**2)

__weakref__

list of weak references to the object (if defined)

add_column(name, f_or_array)[source]

Add an in memory array as a column

add_column_healpix(name='healpix', longitude='ra', latitude='dec', degrees=True, healpix_order=12, nest=True)[source]

Add a healpix (in memory) column based on a longitude and latitude

Parameters: name – Name of column longitude – longitude expression latitude – latitude expression (astronomical convenction latitude=90 is north pole) degrees – If lon/lat are in degrees (default) or radians. healpix_order – healpix order, >= 0 nest – Nested healpix (default) or ring.
add_variable(name, expression, overwrite=True)[source]

Add a variable column to the dataset

Param: str name: name of virtual varible expression: expression for the variable

Variable may refer to other variables, and virtual columns and expression may refer to variables

>>> dataset.add_variable("center")
>>> dataset.select("x_prime < 0")

add_virtual_column(name, expression, unique=False)[source]

Add a virtual column to the dataset

Example: >>> dataset.add_virtual_column(“r”, “sqrt(x**2 + y**2 + z**2)”) >>> dataset.select(“r < 10”)

Param: str name: name of virtual column expression: expression for the column unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2
add_virtual_columns_aitoff(alpha, delta, x, y, radians=True)[source]

Parameters: alpha – azimuth angle delta – polar angle x – output name for x coordinate y – output name for y coordinate radians – input and output in radians (True), or degrees (False)
add_virtual_columns_cartesian_to_polar(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', propagate_uncertainties=False, radians=False)[source]

Convert cartesian to polar coordinates

Parameters: x – expression for x y – expression for y radius_out – name for the virtual column for the radius azimuth_out – name for the virtual column for the azimuth angle propagate_uncertainties – {propagate_uncertainties} radians – if True, azimuth is in radians, defaults to degrees
add_virtual_columns_cartesian_to_spherical(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position')[source]

Convert cartesian to spherical coordinates.

Parameters: x – y – z – alpha – delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True). distance – radians – center – center_name –
add_virtual_columns_cartesian_velocities_to_polar(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', propagate_uncertainties=False)[source]

Convert cartesian to polar velocities.

Parameters: x – y – vx – radius_polar – Optional expression for the radius, may lead to a better performance when given. vy – vr_out – vazimuth_out – propagate_uncertainties – {propagate_uncertainties}
add_virtual_columns_cartesian_velocities_to_spherical(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', vlong='vlong', vlat='vlat', distance=None)[source]

Concert velocities from a cartesian to a spherical coordinate system

TODO: errors

Parameters: x – name of x column (input) y – y z – z vx – vx vy – vy vz – vz vr – name of the column for the radial velocity in the r direction (output) vlong – name of the column for the velocity component in the longitude direction (output) vlat – name of the column for the velocity component in the latitude direction, positive points to the north pole (output) distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
add_virtual_columns_matrix3d(x, y, z, xnew, ynew, znew, matrix, matrix_name='deprecated', matrix_is_expression=False, translation=[0, 0, 0], propagate_uncertainties=False)[source]
Parameters: x (str) – name of x column y (str) – z (str) – xnew (str) – name of transformed x column ynew (str) – znew (str) – matrix (list[list]) – 2d array or list, with [row,column] order matrix_name (str) –
add_virtual_columns_polar_velocities_to_cartesian(x='x', y='y', azimuth=None, vr='vr_polar', vazimuth='vphi_polar', vx_out='vx', vy_out='vy', propagate_uncertainties=False)[source]

Convert cylindrical polar velocities to Cartesian.

Parameters: x – y – azimuth – Optional expression for the azimuth in degrees , may lead to a better performance when given. vr – vazimuth – vx_out – vy_out – propagate_uncertainties – {propagate_uncertainties}
add_virtual_columns_rotation(x, y, xnew, ynew, angle_degrees, propagate_uncertainties=False)[source]

Rotation in 2d

Parameters: x (str) – Name/expression of x column y (str) – idem for y xnew (str) – name of transformed x column ynew (str) – angle_degrees (float) – rotation in degrees, anti clockwise
add_virtual_columns_spherical_to_cartesian(alpha, delta, distance, xname='x', yname='y', zname='z', propagate_uncertainties=False, center=[0, 0, 0], center_name='solar_position', radians=False)[source]

Convert spherical to cartesian coordinates.

Parameters: alpha – delta – polar angle, ranging from the -90 (south pole) to 90 (north pole) distance – radial distance, determines the units of x, y and z xname – yname – zname – propagate_uncertainties – If true, will propagate errors for the new virtual columns, see :py:Dataset.propagate_uncertainties for details center – center_name – radians –
byte_size(selection=False)[source]

Return the size in bytes the whole dataset requires (or the selection), respecting the active_fraction

classmethod can_open(path, *args, **kwargs)[source]

Tests if this class can open the file given by path

close_files()[source]

Close any possible open file handles, the dataset will not be in a usable state afterwards

col

Convenient when working with ipython in combination with small datasets, since this gives tab-completion

Columns can be accesed by there names, which are attributes. The attribues are currently strings, so you cannot do computations with them

>>> ds = vx.example()
>>> ds.plot(ds.col.x, ds.col.y)

column_count()[source]

Returns the number of columns, not counting virtual ones

combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]

Generate a list of combinations for the possible expressions for the given dimension

Parameters: expressions_list – list of list of expressions, where the inner list defines the subspace dimensions – if given, generates a subspace with all possible combinations for that dimension exclude – list of
correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None)[source]

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possible on a grid defined by binby

Examples:

>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)")
array(0.6366637382215669)
>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 0.40594394,  0.69868851,  0.61394099,  0.65266318])

Parameters: x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
count(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]

Count the number of non-NaN values (or all, if expression is None or “*”)

Examples:

>>> ds.count()
330000.0
>>> ds.count("*")
330000.0
>>> ds.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])

Parameters: expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1 Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
cov(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance matrix for x and y or more expressions, possible on a grid defined by binby

Either x and y are expressions, e.g:

>>> ds.cov("x", "y")


Or only the x argument is given with a list of expressions, e,g.:

>> ds.cov([“x, “y, “z”])

Examples:

>>> ds.cov("x", "y")
array([[ 53.54521742,  -3.8123135 ],


[ -3.8123135 , 60.62257881]]) >>> ds.cov([“x”, “y”, “z”]) array([[ 53.54521742, -3.8123135 , -0.98260511], [ -3.8123135 , 60.62257881, 1.21381057], [ -0.98260511, 1.21381057, 25.55517638]])

>>> ds.cov("x", "y", binby="E", shape=2)
array([[[  9.74852878e+00,  -3.02004780e-02],


[ -3.02004780e-02, 9.99288215e+00]],

[[ 8.43996546e+01, -6.51984181e+00], [ -6.51984181e+00, 9.68938284e+01]]])

param x: param y: expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] if previous argument is not a list, this argument should be given List of expressions for constructing a binned grid description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)
covar(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the covariance cov[x,y] between and x and y, possible on a grid defined by binby

Examples:

>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")
array(52.69461456005138)
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")/(ds.std("x**2+y**2+z**2") * ds.std("-log(-E+1)"))
0.63666373822156686
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 10.17387143,  51.94954078,  51.24902796,  20.2163929 ])

Parameters: x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
delete_variable(name)[source]

Deletes a variable from a dataset

delete_virtual_column(name)[source]

Deletes a virtual column from a dataset

dropna(drop_nan=True, drop_masked=True, column_names=None)[source]

Create a shallow copy dataset, with filtering set using select_non_missing

Parameters: drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values) drop_masked – drop rows when there is a masked value in any of the columns column_names – The columns to consider, default: all (real, non-virtual) columns Dataset
evaluate(expression, i1=None, i2=None, out=None, selection=None)[source]

Evaluate an expression, and return a numpy array with the results for the full column or a part of it.

Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.

To get partial results, use i1 and i2/

Parameters: expression (str) – Name/expression to evaluate i1 (int) – Start row index, default is the start (0) i2 (int) – End row index, default is the length of the dataset out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to

a memory mapped array) :param selection: selection to apply :return:

evaluate_variable(name)[source]

Evaluates the variable given by name

execute()[source]

Execute all delayed jobs

extract()[source]

Return a dataset containing only the filtered rows.

Note that no copy of the underlying data is made, only a view/reference is make.

The resulting dataset may be more efficient to work with when the original dataset is heavily filtered (contains just a small number of rows).

If no filtering is applied, it returns a trimmed view. For returned datasets, len(ds) == ds.length_original() == ds.length_unfiltered()

fillna(value, fill_nan=True, fill_masked=True, column_names=None, prefix='__original_', inplace=False)[source]

Return a dataset, where missing values/NaN are filled with ‘value’

Note that no copy of the underlying data is made, only a view/reference is make.

Note that filtering will be ignored (since they may change), you may want to consider running :py:Dataset.extract first.

Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.sort('(x-1.8)**2', ascending=False)  # b, c, a will be the order of a

Parameters: or expression by (str) – expression to sort by ascending (bool) – ascending (default, True) or descending (False) kind (str) – kind of algorithm to use (passed to numpy.argsort) inplace – Make modifications to self or return a new dataset
get_active_fraction()[source]

Value in the range (0, 1], to work only with a subset of rows

get_column_names(virtual=False, hidden=False, strings=False)[source]

Return a list of column names

Parameters: virtual – If True, also return virtual columns hidden – If True, also return hidden columns list of str
get_current_row()[source]

Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked

get_private_dir(create=False)[source]

Each datasets has a directory where files are stored for metadata etc

>>> import vaex as vx
>>> ds = vx.example()
>>> ds.get_private_dir()
'/Users/users/breddels/.vaex/datasets/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'

Parameters: create (bool) – is True, it will create the directory if it does not exist
get_selection(name='default')[source]

Get the current selection object (mostly for internal use atm)

get_variable(name)[source]

Returns the variable given by name, it will not evaluate it.

For evaluation, see Dataset.evaluate_variable(), see also Dataset.set_variable()

has_current_row()[source]

Returns True/False is there currently is a picked row

has_selection(name='default')[source]

Returns True of there is a selection

healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]

Count non missing value for expression on an array which represents healpix data.

Parameters: expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows healpix_expression – {healpix_max_level} healpix_max_level – {healpix_max_level} healpix_level – {healpix_level} binby – {binby}, these dimension follow the first healpix dimension. limits – {limits} shape – {shape} selection – {selection} delay – {delay} progress – {progress}
healpix_plot(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0), **kwargs)[source]
Parameters: healpix_expression – {healpix_max_level} healpix_max_level – {healpix_max_level} healpix_level – {healpix_level} what – {what} selection – {selection} grid – {grid} healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”. healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”. f – function to apply to the data colormap – matplotlib colormap grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid))) image_size – size for the image that healpy uses for rendering nest – If the healpix data is in nested (True) or ring (False) figsize – If given, modify the matplotlib figure size. Example (14,9) interactive – (Experimental, uses healpy.mollzoom is True) title – Title of figure smooth – apply gaussian smoothing, in degrees show – Call matplotlib’s show (True) or not (False, defaut) rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
is_local()[source]

Returns True if the dataset is a local dataset, False when a remote dataset

iscategory(column)[source]

Returns true if column is a category

length_original()[source]

the full length of the dataset, independant what active_fraction is, or filtering. This is the real length of the underlying ndarrays

length_unfiltered()[source]

The length of the arrays that should be considered (respecting active range), but without filtering

limits(expression, value=None, square=False, selection=None, delay=False, shape=None)[source]

Calculate the [min, max] range for expression, as described by value, which is ‘99.7%’ by default.

If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.

Example:

>>> ds.limits("x")
array([-28.86381927,  28.9261226 ])
>>> ds.limits(["x", "y"])
(array([-28.86381927,  28.9261226 ]), array([-28.60476934,  28.96535249]))
>>> ds.limits(["x", "y"], "minmax")
(array([-128.293991,  271.365997]), array([ -71.5523682,  146.465836 ]))
>>> ds.limits(["x", "y"], ["minmax", "90%"])
(array([-128.293991,  271.365997]), array([-13.37438402,  13.4224423 ]))
>>> ds.limits(["x", "y"], ["minmax", [0, 10]])
(array([-128.293991,  271.365997]), [0, 10])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] value – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
limits_percentage(expression, percentage=99.73, square=False, delay=False)[source]

Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.

The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:

>>> ds.limits_percentage("x", 90)
array([-12.35081376,  12.14858052]
>>> ds.percentile_approx("x", 5), ds.percentile_approx("x", 95)
(array([-12.36813152]), array([ 12.13275818]))


NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] percentage (float) – Value between 0 and 100 delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
materialize(virtual_column, inplace=False)[source]

Returns a new dataset where the virtual column is turned into an in memory numpy array

Example:
>>> x = np.arange(1,4)
>>> y = np.arange(2,5)
>>> ds = vaex.from_arrays(x=x, y=y)
>>> ds['r'] = (ds.x**2 + ds.y**2)**0.5 # 'r' is a virtual column (computed on the fly)
>>> ds = ds.materialize('r')  # now 'r' is a 'real' column (i.e. a numpy array)

Parameters: inplace – {inplace}
max(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the maximum for given expressions, possible on a grid defined by binby

Example:

>>> ds.max("x")
array(271.365997)
>>> ds.max(["x", "y"])
array([ 271.365997,  146.465836])
>>> ds.max("x", binby="x", shape=5, limits=[-10, 10])
array([-6.00010443, -2.00002384,  1.99998057,  5.99983597,  9.99984646])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
mean(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the mean for expression, possibly on a grid defined by binby.

Examples:

>>> ds.mean("x")
-0.067131491264005971
>>> ds.mean("(x**2+y**2)**0.5", binby="E", shape=4)
array([  2.43483742,   4.41840721,   8.26742458,  15.53846476])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False)[source]

Calculate the median , possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’ percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
min(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the minimum for given expressions, possible on a grid defined by binby

Example:

>>> ds.min("x")
array(-128.293991)
>>> ds.min(["x", "y"])
array([-128.293991 ,  -71.5523682])
>>> ds.min("x", binby="x", shape=5, limits=[-10, 10])
array([-9.99919128, -5.99972439, -1.99991322,  2.0000093 ,  6.0004878 ])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
minmax(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the minimum and maximum for expressions, possible on a grid defined by binby

Example:

>>> ds.minmax("x")
array([-128.293991,  271.365997])
>>> ds.minmax(["x", "y"])
array([[-128.293991 ,  271.365997 ],
[ -71.5523682,  146.465836 ]])
>>> ds.minmax("x", binby="x", shape=5, limits=[-10, 10])
array([[-9.99919128, -6.00010443],
[-5.99972439, -2.00002384],
[-1.99991322,  1.99998057],
[ 2.0000093 ,  5.99983597],
[ 6.0004878 ,  9.99984646]])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
mutual_information(x, y=None, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possible on a grid defined by binby

If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order

Examples:

>>> ds.mutual_information("x", "y")
array(0.1511814526380327)
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]])
array([ 0.15118145,  0.18439181,  1.07067379])
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True)
(array([ 1.07067379,  0.18439181,  0.15118145]),
[['E', 'Lz'], ['x', 'z'], ['x', 'y']])

Parameters: x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,
percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False)[source]

Calculate the percentile given by percentage, possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

>>> ds.percentile_approx("x", 10), ds.percentile_approx("x", 90)
(array([-8.3220355]), array([ 7.92080358]))
>>> ds.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10])
array([[-7.56462982],
[-3.61036641],
[-0.01296306],
[ 3.56697863],
[ 7.45838367]])


0:1:0.1 1:1:0.2 2:1:0.3 3:1:0.4 4:1:0.5

5:1:0.6 6:1:0.7 7:1:0.8 8:1:0.9 9:1:1.0

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’ percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]

Use at own risk, requires ipyvolume

propagate_uncertainties(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]

Propagates uncertainties (full covariance matrix) for a set of virtual columns.

Covariance matrix of the depending variables is guessed by finding columns prefixed by: ‘e’ or ‘e_’ or postfixed by ‘_error’, ‘_uncertainty’, ‘e’ and ‘_e’. Off diagonals (covariance or correlation) by postfixes with ‘_correlation’ or ‘_corr’ for correlation or ‘_covariance’ or ‘_cov’ for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation)

Example:

>>> ds = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2)
>>> ds['u'] = ds.x + ds.y
>>> ds['v'] = np.log10(ds.x)
>>> ds.propagate_uncertainties([ds.u, ds.v])
>>> ds.u_uncertainty, ds.v_uncertainty

Parameters: columns – list of columns for which to calculate the covariance matrix. depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties. cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.
remove_virtual_meta()[source]

Removes the file with the virtual column etc, it does not change the current virtual columns etc

rename_column(name, new_name, unique=False, store_in_state=True)[source]

Renames a column, not this is only the in memory name, this will not be reflected on disk

sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]

Returns a dataset with a random set of rows

Note that no copy of the underlying data is made, only a view/reference is make.

Provide either n or frac.

Parameters: n (int) – number of samples to take (default 1 if frac is None) frac (float) – fractional number of takes to take replace (bool) – If true, a row may be drawn multiple times or expression weights (str) – (unnormalized) probability that a row can be drawn or RandomState (int) – seed or RandomState for reproducability, when None a random seed it chosen
Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.sample(n=2, random_state=42) # 2 random rows, fixed seed
>>> ds.sample(frac=1) # 'shuffling'
>>> ds.sample(frac=1, replace=True) # useful for bootstrap (may contain repeated samples)

select(boolean_expression, mode='replace', name='default', executor=None)[source]

Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode

Selections are recorded in a history tree, per name, undo/redo can be done for them seperately

Parameters: boolean_expression (str) – Any valid column expression, with comparison operators mode (str) – Possible boolean operator: replace/and/or/xor/subtract name (str) – history tree or selection ‘slot’ to use executor –
select_box(spaces, limits, mode='replace', name='default')[source]

Select a n-dimensional rectangular box bounded by limits

The following examples are equivalent: >>> ds.select_box([‘x’, ‘y’], [(0, 10), (0, 1)]) >>> ds.select_rectangle(‘x’, ‘y’, [(0, 10), (0, 1)]) :param spaces: list of expressions :param limits: sequence of shape [(x1, x2), (y1, y2)] :param mode: :param name: :return:

select_circle(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]

Select a circular region centred on xc, yc, with a radius of r.

Parameters: x – expression for the x space y – expression for the y space xc – location of the centre of the circle in x yc – location of the centre of the circle in y r – the radius of the circle name – name of the selection mode –

Example: >>> ds.select_circle(‘x’,’y’,2,3,1)

select_ellipse(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]

Select an elliptical region centred on xc, yc, with a certain width, height and angle.

Parameters: x – expression for the x space y – expression for the y space xc – location of the centre of the ellipse in x yc – location of the centre of the ellipse in y width – the width of the ellipse (diameter) height – the width of the ellipse (diameter) angle – (degrees) orientation of the ellipse, counter-clockwise measured from the y axis name – name of the selection mode –

Example: >>> ds.select_ellipse(‘x’,’y’, 2, -1, 5,1, 30, name=’my_ellipse’)

select_inverse(name='default', executor=None)[source]

Invert the selection, i.e. what is selected will not be, and vice versa

Parameters: name (str) – executor –
select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]

For performance reasons, a lasso selection is handled differently.

Parameters: expression_x (str) – Name/expression for the x coordinate expression_y (str) – Name/expression for the y coordinate xsequence – list of x numbers defining the lasso, together with y ysequence – mode (str) – Possible boolean operator: replace/and/or/xor/subtract name (str) – executor –
select_non_missing(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]

Create a selection that selects rows having non missing values for all columns in column_names

The name reflect Panda’s, no rows are really dropped, but a mask is kept to keep track of the selection

Parameters: drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values) drop_masked – drop rows when there is a masked value in any of the columns column_names – The columns to consider, default: all (real, non-virtual) columns mode (str) – Possible boolean operator: replace/and/or/xor/subtract name (str) – history tree or selection ‘slot’ to use
select_nothing(name='default')[source]

Select nothing

select_rectangle(x, y, limits, mode='replace', name='default')[source]

Select a 2d rectangular box in the space given by x and y, bounds by limits

Example: >>> ds.select_box(‘x’, ‘y’, [(0, 10), (0, 1)])

Parameters: x – expression for the x space y – expression fo the y space limits – sequence of shape [(x1, x2), (y1, y2)] mode –
selected_length()[source]

Returns the number of rows that are selected

selection_can_redo(name='default')[source]

Can selection name be redone?

selection_can_undo(name='default')[source]

Can selection name be undone?

selection_redo(name='default', executor=None)[source]

Redo selection, for the name

selection_undo(name='default', executor=None)[source]

Undo selection, for the name

set_active_fraction(value)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range(i1, i2)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_current_row(value)[source]

Set the current row, and emit the signal signal_pick

set_selection(selection, name='default', executor=None)[source]

Sets the selection object

Parameters: selection – Selection object name – selection ‘slot’ executor –
set_variable(name, expression_or_value, write=True)[source]

Set the variable to an expression or value defined by expression_or_value

>>> ds.set_variable("a", 2.)
>>> ds.set_variable("b", "a**2")
>>> ds.get_variable("b")
'a**2'
>>> ds.evaluate_variable("b")
4.0

Parameters: name – Name of the variable write – write variable to meta file expression – value or expression
sort(by, ascending=True, kind='quicksort')[source]

Return a sorted dataset, sorted by the expression ‘by’

Note that no copy of the underlying data is made, only a view/reference is make.

Note that filtering will be ignored (since they may change), you may want to consider running :py:Dataset.extract first.

Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.sort('(x-1.8)**2', ascending=False)  # b, c, a will be the order of a

Parameters: or expression by (str) – expression to sort by ascending (bool) – ascending (default, True) or descending (False) kind (str) – kind of algorithm to use (passed to numpy.argsort)
std(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the standard deviation for the given expression, possible on a grid defined by binby

>>> ds.std("vz")
110.31773397535071
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
subspace(*expressions, **kwargs)[source]

Return a Subspace for this dataset with the given expressions:

Example:

>>> subspace_xy = some_dataset("x", "y")

Return type: Subspace expressions (list[str]) – list of expressions kwargs –
subspaces(expressions_list=None, dimensions=None, exclude=None, **kwargs)[source]

Generate a Subspaces object, based on a custom list of expressions or all possible combinations based on dimension

Parameters: expressions_list – list of list of expressions, where the inner list defines the subspace dimensions – if given, generates a subspace with all possible combinations for that dimension exclude – list of
sum(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the sum for the given expression, possible on a grid defined by binby

Examples:

>>> ds.sum("L")
304054882.49378014
>>> ds.sum("L", binby="E", shape=4)
array([  8.83517994e+06,   5.92217598e+07,   9.55218726e+07,
1.40008776e+08])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
take(indices)[source]

Returns a dataset containing only rows indexed by indices

Note that no copy of the underlying data is made, only a view/reference is make.

Example:
>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds = vaex.from_arrays(a=a, x=x)
>>> ds.take([0,2])

to_astropy_table(column_names=None, selection=None, strings=True, virtual=False, index=None)[source]

Returns a astropy table object containing the ndarrays corresponding to the evaluated data

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None index – if this column is given it is used for the index of the DataFrame astropy.table.Table object
to_copy(column_names=None, selection=None, strings=True, virtual=False, selections=True)[source]

Return a copy of the Dataset, if selection is None, it does not copy the data, it just has a reference

Parameters: column_names – list of column names, to copy, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None selections – copy selections to new dataset dict
to_dict(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a dict containing the ndarray corresponding to the evaluated data

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None dict
to_items(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None list of (name, ndarray) pairs
to_pandas_df(column_names=None, selection=None, strings=True, virtual=False, index_name=None)[source]

Return a pandas DataFrame containing the ndarray corresponding to the evaluated data

If index is given, that column is used for the index of the dataframe.

>>> df = ds.to_pandas_df(["x", "y", "z"])
>>> ds_copy = vx.from_pandas(df)

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None index_column – if this column is given it is used for the index of the DataFrame pandas.DataFrame object
trim(inplace=False)[source]

Return a dataset, where all columns are ‘trimmed’ by the active range.

For returned datasets, ds.get_active_range() returns (0, ds.length_original()).

Note that no copy of the underlying data is made, only a view/reference is make.

Parameters: inplace – Make modifications to self or return a new dataset
ucd_find(ucds, exclude=[])[source]

Find a set of columns (names) which have the ucd, or part of the ucd

Prefixed with a ^, it will only match the first part of the ucd

>>> dataset.ucd_find('pos.eq.ra', 'pos.eq.dec')
['RA', 'DEC']
>>> dataset.ucd_find('pos.eq.ra', 'doesnotexist')
>>> dataset.ucds[dataset.ucd_find('pos.eq.ra')]
'pos.eq.ra;meta.main'
>>> dataset.ucd_find('meta.main')]
'dec'
>>> dataset.ucd_find('^meta.main')]
>>>

unit(expression, default=None)[source]

Returns the unit (an astropy.unit.Units object) for the expression

>>> import vaex as vx
>>> ds = vx.example()
>>> ds.unit("x")
Unit("kpc")
>>> ds.unit("x*L")
Unit("km kpc2 / s")

Parameters: expression – Expression, which can be a column name default – if no unit is known, it will return this The resulting unit of the expression astropy.units.Unit
update_meta()[source]

Will read back the ucd, descriptions, units etc, written by Dataset.write_meta(). This will be done when opening a dataset.

update_virtual_meta()[source]

Will read back the virtual column etc, written by Dataset.write_virtual_meta(). This will be done when opening a dataset.

validate_expression(expression)[source]

Validate an expression (may throw Exceptions)

var(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]

Calculate the sample variance for the given expression, possible on a grid defined by binby

Examples:

>>> ds.var("vz")
12170.002429456246
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 15271.90481083,   7284.94713504,   3738.52239232,   1449.63418988])
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections delay – Do not return the result, but a proxy for delayhronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
write_meta()[source]

Writes all meta data, ucd,description and units

The default implementation is to write this to a file called meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself. (For instance the vaex hdf5 implementation does this)

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

write_virtual_meta()[source]

Writes virtual columns, variables and their ucd,description and units

The default implementation is to write this to a file called virtual_meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself.

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_virtual_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

## vaex.stat module¶

class vaex.stat.Expression[source]

Describes an expression for a statistic

calculate(ds, binby=[], shape=256, limits=None, selection=None)[source]

Calculate the statistic for a Dataset

vaex.stat.correlation(x, y)[source]

Creates a standard deviation statistic

vaex.stat.count(expression='*')[source]

Creates a count statistic

vaex.stat.covar(x, y)[source]

Creates a standard deviation statistic

vaex.stat.mean(expression)[source]

Creates a mean statistic

vaex.stat.std(expression)[source]

Creates a standard deviation statistic

vaex.stat.sum(expression)[source]

Creates a sum statistic

## Machine learning with vaex.ml¶

Note that vaex.ml does not fall under the MIT, but the CC BY-CC-ND LICENSE, which means it’s ok for personal or academic use. You can install vaex-ml using pip install vaex-ml.