API documentation for vaex library¶
Quick lists¶
Opening/reading in your data.¶
|
Open a DataFrame from file given by path. |
|
Open a list of filenames, and return a DataFrame with all DataFrames concatenated. |
|
Create an in memory DataFrame from numpy arrays. |
|
Create a DataFrame from an Apache Arrow dataset. |
|
Creates a vaex DataFrame from an arrow Table. |
|
Create an in memory DataFrame from an ascii file (whitespace seperated by default). |
|
Create a vaex DataFrame from an Astropy Table. |
|
Load a CSV file as a DataFrame, and optionally convert to an HDF5 file. |
|
Fast CSV reader using Apache Arrow. |
|
Create a Vaex DataFrame from a Vaex Dataset |
|
Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values |
|
Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6). |
|
A method to read a JSON file using pandas, and convert to a DataFrame directly. |
|
Create an in memory DataFrame from a pandas DataFrame. |
|
Create a dataframe from a list of dict. |
Visualizations.¶
|
Viz data in a 2d histogram/heatmap. |
|
Plot a histogram. |
Viz (small amounts) of data in 2d using a scatter plot |
Statistics.¶
|
Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby. |
|
Count the number of non-NaN values (or all, if expression is None or "*"). |
|
Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby. |
|
Calculate the maximum for given expressions, possibly on a grid defined by binby. |
|
Calculate the mean for expression, possibly on a grid defined by binby. |
Calculate the median, possibly on a grid defined by binby. |
|
|
Calculate the minimum for given expressions, possibly on a grid defined by binby. |
|
Calculate the minimum and maximum for expressions, possibly on a grid defined by binby. |
|
Calculate/estimate the mode. |
Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby. |
|
|
Calculate the standard deviation for the given expression, possible on a grid defined by binby |
|
Returns all unique values. |
|
Calculate the sample variance for the given expression, possible on a grid defined by binby |
vaex-core¶
Vaex is a library for dealing with larger than memory DataFrames (out of core).
The most important class (datastructure) in vaex is the DataFrame
. A DataFrame is obtained by either opening
the example dataset:
>>> import vaex
>>> df = vaex.example()
Or using open()
to open a file.
>>> df1 = vaex.open("somedata.hdf5")
>>> df2 = vaex.open("somedata.fits")
>>> df2 = vaex.open("somedata.arrow")
>>> df4 = vaex.open("somedata.csv")
Or connecting to a remove server:
>>> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")
A few strong features of vaex are:
Performance: works with huge tabular data, process over a billion (> 109) rows/second.
Expression system / Virtual columns: compute on the fly, without wasting ram.
Memory efficient: no memory copies when doing filtering/selections/subsets.
Visualization: directly supported, a one-liner is often enough.
User friendly API: you will only need to deal with a DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.
Very fast statistics on N dimensional grids such as histograms, running mean, heatmaps.
Follow the tutorial at https://docs.vaex.io/en/latest/tutorial.html to learn how to use vaex.
- vaex.concat(dfs, resolver='flexible') vaex.dataframe.DataFrame [source]¶
Concatenate a list of DataFrames.
- Parameters
resolver – How to resolve schema conflicts, see
DataFrame.concat()
.
- vaex.delayed(f)[source]¶
Decorator to transparantly accept delayed computation.
Example:
>>> delayed_sum = ds.sum(ds.E, binby=ds.x, limits=limits, >>> shape=4, delay=True) >>> @vaex.delayed >>> def total_sum(sums): >>> return sums.sum() >>> sum_of_sums = total_sum(delayed_sum) >>> ds.execute() >>> sum_of_sums.get() See the tutorial for a more complete example https://docs.vaex.io/en/latest/tutorial.html#Parallel-computations
- vaex.example()[source]¶
Result of an N-body simulation of the accretion of 33 satellite galaxies into a Milky Way dark matter halo.
Data was greated by Helmi & de Zeeuw 2000. The data contains the position (x, y, z), velocitie (vx, vy, vz), the energy (E), the angular momentum (L, Lz) and iron content (FeH) of the particles.
- Return type
- vaex.from_arrays(**arrays) vaex.dataframe.DataFrameLocal [source]¶
Create an in memory DataFrame from numpy arrays.
Example
>>> import vaex, numpy as np >>> x = np.arange(5) >>> y = x ** 2 >>> vaex.from_arrays(x=x, y=y) # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16 >>> some_dict = {'x': x, 'y': y} >>> vaex.from_arrays(**some_dict) # in case you have your columns in a dict # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16
- Parameters
arrays – keyword arguments with arrays
- Return type
- vaex.from_arrow_dataset(arrow_dataset) vaex.dataframe.DataFrame [source]¶
Create a DataFrame from an Apache Arrow dataset.
- vaex.from_arrow_table(table) vaex.dataframe.DataFrame [source]¶
Creates a vaex DataFrame from an arrow Table.
- Parameters
as_numpy – Will lazily cast columns to a NumPy ndarray.
- Return type
- vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]¶
Create an in memory DataFrame from an ascii file (whitespace seperated by default).
>>> ds = vx.from_ascii("table.asc") >>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
- Parameters
path – file path
seperator – value seperator, by default whitespace, use “,” for comma seperated values.
names – If True, the first line is used for the column names, otherwise provide a list of strings with names
skip_lines – skip lines at the start of the file
skip_after – skip lines at the end of the file
kwargs –
- Return type
- vaex.from_csv(filename_or_buffer, copy_index=False, chunk_size=None, convert=False, fs_options={}, progress=None, fs=None, **kwargs)[source]¶
Load a CSV file as a DataFrame, and optionally convert to an HDF5 file.
- Parameters
filename_or_buffer (str or file) – CSV file path or file-like
copy_index (bool) – copy index when source is read via Pandas
chunk_size (int) –
if the CSV file is too big to fit in the memory this parameter can be used to read CSV file in chunks. For example:
>>> import vaex >>> for i, df in enumerate(vaex.read_csv('taxi.csv', chunk_size=100_000)): >>> df = df[df.passenger_count < 6] >>> df.export_hdf5(f'taxi_{i:02}.hdf5')
convert (bool or str) – convert files to an hdf5 file for optimization, can also be a path. The CSV file will be read in chunks: either using the provided chunk_size argument, or a default size. Each chunk will be saved as a separate hdf5 file, then all of them will be combined into one hdf5 file. So for a big CSV file you will need at least double of extra space on the disk. Default chunk_size for converting is 5 million rows, which corresponds to around 1Gb memory on an example of NYC Taxi dataset.
progress – (Only applies when convert is not False) True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
kwargs – extra keyword arguments, currently passed to Pandas read_csv function, but the implementation might change in future versions.
- Returns
DataFrame
- vaex.from_dataset(dataset: vaex.dataset.Dataset) vaex.dataframe.DataFrame [source]¶
Create a Vaex DataFrame from a Vaex Dataset
- vaex.from_dict(data)[source]¶
Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values
Example
>>> data = {'A':[1,2,3],'B':['a','b','c']} >>> vaex.from_dict(data) # A B 0 1 'a' 1 2 'b' 2 3 'c'
- Parameters
data – A dict of {column:[value, value,…]}
- Return type
- vaex.from_items(*items)[source]¶
Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).
Example
>>> import vaex, numpy as np >>> x = np.arange(5) >>> y = x ** 2 >>> vaex.from_items(('x', x), ('y', y)) # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16
- Parameters
items – list of [(name, numpy array), …]
- Return type
- vaex.from_json(path_or_buffer, orient=None, precise_float=False, lines=False, copy_index=False, **kwargs)[source]¶
A method to read a JSON file using pandas, and convert to a DataFrame directly.
- Parameters
path_or_buffer (str) – a valid JSON string or file-like, default: None The string could be a URL. Valid URL schemes include http, ftp, s3, gcs, and file. For file URLs, a host is expected. For instance, a local file could be
file://localhost/path/to/table.json
orient (str) – Indication of expected JSON string format. Allowed values are
split
,records
,index
,columns
, andvalues
.precise_float (bool) – Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality
lines (bool) – Read the file as a json object per line.
- Return type
- vaex.from_pandas(df, name='pandas', copy_index=False, index_name='index')[source]¶
Create an in memory DataFrame from a pandas DataFrame.
- Param
pandas.DataFrame df: Pandas DataFrame
- Param
name: unique for the DataFrame
>>> import vaex, pandas as pd >>> df_pandas = pd.from_csv('test.csv') >>> df = vaex.from_pandas(df_pandas)
- Return type
- vaex.from_records(records: List[Dict], array_type='arrow', defaults={}) vaex.dataframe.DataFrame [source]¶
Create a dataframe from a list of dict.
Warning
This is for convenience only, for performance pass arrays to
from_arrays()
for instance.
- vaex.open(path, convert=False, progress=None, shuffle=False, fs_options={}, fs=None, *args, **kwargs)[source]¶
Open a DataFrame from file given by path.
Example:
>>> df = vaex.open('sometable.hdf5') >>> df = vaex.open('somedata*.csv', convert='bigdata.hdf5')
- Parameters
path (str or list) – local or absolute path to file, or glob string, or list of paths
convert – Uses dataframe.export when convert is a path. If True,
convert=path+'.hdf5'
The conversion is skipped if the input file or conversion argument did not change.progress – (Only applies when convert is not False) True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
shuffle (bool) – shuffle converted DataFrame or not
fs_options (dict) – Extra arguments passed to an optional file system if needed. See below
group – (optional) Specify the group to be read from and HDF5 file. By default this is set to “/table”.
fs – Apache Arrow FileSystem object, or FSSpec FileSystem object, if specified, fs_options should be empty.
args – extra arguments for file readers that need it
kwargs – extra keyword arguments
- Returns
return a DataFrame on success, otherwise None
- Return type
Note: From version 4.14.0 vaex.open() will lazily read CSV files. If you prefer to read the entire CSV file into memory, use vaex.from_csv() or vaex.from_csv_arrow() instead.
Cloud storage support:
Vaex supports streaming of HDF5 files from Amazon AWS S3 and Google Cloud Storage. Files are by default cached in $HOME/.vaex/file-cache/(s3|gs) such that successive access is as fast as native disk access.
Amazon AWS S3 options:
The following common fs_options are used for S3 access:
anon: Use anonymous access or not (false by default). (Allowed values are: true,True,1,false,False,0)
anonymous - Alias for anon
cache: Use the disk cache or not, only set to false if the data should be accessed once. (Allowed values are: true,True,1,false,False,0)
access_key - AWS access key, if not provided will use the standard env vars, or the ~/.aws/credentials file
secret_key - AWS secret key, similar to access_key
profile - If multiple profiles are present in ~/.aws/credentials, pick this one instead of ‘default’, see https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
region - AWS Region, e.g. ‘us-east-1`, will be determined automatically if not provided.
endpoint_override - URL/ip to connect to, instead of AWS, e.g. ‘localhost:9000’ for minio
All fs_options can also be encoded in the file path as a query string.
Examples:
>>> df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5', fs_options={'anonymous': True}) >>> df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true') >>> df = vaex.open('s3://mybucket/path/to/file.hdf5', fs_options={'access_key': my_key, 'secret_key': my_secret_key}) >>> df = vaex.open(f's3://mybucket/path/to/file.hdf5?access_key={my_key}&secret_key={my_secret_key}') >>> df = vaex.open('s3://mybucket/path/to/file.hdf5?profile=myproject')
Google Cloud Storage options:
The following fs_options are used for GCP access:
token: Authentication method for GCP. Use ‘anon’ for annonymous access. See https://gcsfs.readthedocs.io/en/latest/index.html#credentials for more details.
cache: Use the disk cache or not, only set to false if the data should be accessed once. (Allowed values are: true,True,1,false,False,0).
project and other arguments are passed to
gcsfs.core.GCSFileSystem
Examples:
>>> df = vaex.open('gs://vaex-data/airlines/us_airline_data_1988_2019.hdf5', fs_options={'token': None}) >>> df = vaex.open('gs://vaex-data/airlines/us_airline_data_1988_2019.hdf5?token=anon') >>> df = vaex.open('gs://vaex-data/testing/xys.hdf5?token=anon&cache=False')
- vaex.open_many(filenames)[source]¶
Open a list of filenames, and return a DataFrame with all DataFrames concatenated.
The filenames can be of any format that is supported by
vaex.open()
, namely hdf5, arrow, parquet, csv, etc.
- vaex.register_function(scope=None, as_property=False, name=None, on_expression=True, df_accessor=None, multiprocessing=False)[source]¶
Decorator to register a new function with vaex.
If on_expression is True, the function will be available as a method on an Expression, where the first argument will be the expression itself.
If df_accessor is given, it is added as a method to that dataframe accessor (see e.g. vaex/geo.py)
Example:
>>> import vaex >>> df = vaex.example() >>> @vaex.register_function() >>> def invert(x): >>> return 1/x >>> df.x.invert()
>>> import numpy as np >>> df = vaex.from_arrays(departure=np.arange('2015-01-01', '2015-12-05', dtype='datetime64')) >>> @vaex.register_function(as_property=True, scope='dt') >>> def dt_relative_day(x): >>> return vaex.functions.dt_dayofyear(x)/365. >>> df.departure.dt.relative_day
- vaex.vconstant(value, length, dtype=None, chunk_size=1024)[source]¶
Creates a virtual column with constant values, which uses 0 memory.
- Parameters
value – The value with which to fill the column
length – The length of the column, i.e. the number of rows it should contain.
dtype – The preferred dtype for the column.
chunk_size – Could be used to optimize the performance (evaluation) of this column.
- vaex.vrange(start, stop, step=1, dtype='f8')[source]¶
Creates a virtual column which is the equivalent of numpy.arange, but uses 0 memory
Aggregation and statistics¶
- class vaex.agg.AggregatorDescriptorKurtosis(name, expression, short_name='kurtosis', selection=None, edges=False)[source]¶
- class vaex.agg.AggregatorDescriptorMean(name, expressions, short_name='mean', selection=None, edges=False)[source]¶
- class vaex.agg.AggregatorDescriptorMulti(name, expressions, short_name, selection=None, edges=False)[source]¶
Bases:
vaex.agg.AggregatorDescriptor
Uses multiple operations/aggregation to calculate the final aggretation
- class vaex.agg.AggregatorDescriptorSkew(name, expression, short_name='skew', selection=None, edges=False)[source]¶
- class vaex.agg.AggregatorDescriptorStd(name, expression, short_name='var', ddof=0, selection=None, edges=False)[source]¶
- class vaex.agg.AggregatorDescriptorVar(name, expression, short_name='var', ddof=0, selection=None, edges=False)[source]¶
- vaex.agg.all(expression=None, selection=None)[source]¶
Aggregator that returns True when all of the values in the group are True, or when all of the data in the group is valid (i.e. not missing values or np.nan). The aggregator returns False if there is no data in the group when the selection argument is used.
- Parameters
expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)
- vaex.agg.any(expression=None, selection=None)[source]¶
Aggregator that returns True when any of the values in the group are True, or when there is any data in the group that is valid (i.e. not missing values or np.nan). The aggregator returns False if there is no data in the group when the selection argument is used.
- Parameters
expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)
- vaex.agg.first(expression, order_expression=None, selection=None, edges=False)[source]¶
Creates a first aggregation.
- Parameters
expression – {expression_one}.
order_expression – Order the values in the bins by this expression.
selection – {selection1}
edges – {edges}
- vaex.agg.last(expression, order_expression=None, selection=None, edges=False)[source]¶
Creates a first aggregation.
- Parameters
expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y .
order_expression – Order the values in the bins by this expression.
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)
edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
- class vaex.agg.list(expression, selection=None, dropna=False, dropnan=False, dropmissing=False, edges=False)[source]¶
Bases:
vaex.agg.AggregatorDescriptorBasic
Aggregator that returns a list of values belonging to the specified expression.
- Parameters
expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)
dropmissing – Drop rows with missing values
dropnan – Drop rows with NaN values
dropna – Drop rows with Not Available (NA) values (NaN or missing values).
edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
- vaex.agg.nunique(expression, dropna=False, dropnan=False, dropmissing=False, selection=None, edges=False)[source]¶
Aggregator that calculates the number of unique items per bin.
- Parameters
expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
dropmissing – Drop rows with missing values
dropnan – Drop rows with NaN values
dropna – Drop rows with Not Available (NA) values (NaN or missing values).
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)
Caching¶
(Currently experimental, use at own risk) Vaex can cache task results, such as aggregations, or the internal hashmaps used for groupby to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.
Internally, Vaex calculates fingerprints (such as hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, such that a restart of a process will most likely result in similar hash keys.
Caches can turned on globally, or used as a context manager:
>>> import vaex
>>> df = vaex.example()
>>> vaex.cache.memory_infinite() # cache on globally
<cache restore context manager>
>>> vaex.cache.is_on()
True
>>> vaex.cache.off() # cache off globally
<cache restore context manager>
>>> vaex.cache.is_on()
False
>>> with vaex.cache.memory_infinite():
... df.x.sum() # calculated without cache
array(-20884.64307324)
>>> vaex.cache.is_on()
False
The functions vaex.cache.set()
and vaex.cache.get()
simply look up the values in a global dict (vaex.cache.cache
), but can be set for more complex behaviour.
A good library to use for in-memory caching is cachetools (https://pypi.org/project/cachetools/)
>>> import vaex
>>> import cachetools
>>> df = vaex.example()
>>> vaex.cache.cache = cachetools.LRUCache(1_000_000_000) # 1gb cache
Configure using environment variables¶
See Configuration for more configuration options.
Especially when using the vaex server it can be useful to turn on caching externally using enviroment variables.
$ VAEX_CACHE=disk VAEX_CACHE_DISK_SIZE_LIMIT=”10GB” python -m vaex.server
Will enable caching using vaex.cache.disk()
and configure it to use at max 10 GB of disk space.
When using Vaex in combination with Flask or Plotly Dash, and using gunicorn for scaling, it can be useful to use a multilevel cache, where the first cache is small but low latency (and private for each progress), and a second higher latency disk cache that is shared among all processes.
$ VAEX_CACHE=”memory,disk” VAEX_CACHE_DISK_SIZE_LIMIT=”10GB” VAEX_CACHE_MEMORY_SIZE_LIMIT=”1GB” gunicorn -w 16 app:server
- vaex.cache.disk(clear=False, size_limit='10GB', eviction_policy='least-recently-stored')[source]¶
Stored cached values using the diskcache library.
See configuration details at configuration of cache. and configuration of paths
- Parameters
size_limit (int or str) – Max size of cache in bytes (or use a string like ‘128MB’) See http://www.grantjenks.com/docs/diskcache/tutorial.html?highlight=eviction#tutorial-settings for more details.
eviction_policy (str) – Eviction policy, See http://www.grantjenks.com/docs/diskcache/tutorial.html?highlight=eviction#tutorial-eviction-policies
clear (bool) – Remove all disk space used for caching before turning on cache.
- vaex.cache.get(key, default=None, type=None)[source]¶
Looks up the cache value for the key, or returns the default
Will return None if the cache is turned off.
- Parameters
key (str) – Cache key.
default – Return when cache is on, but key not in cache
type – Currently unused.
- vaex.cache.memory(maxsize='1GB', classname='LRUCache', clear=False)[source]¶
Sets a memory cache using cachetools (https://cachetools.readthedocs.io/).
Calling multiple times with clear=False will keep the current cache (useful in notebook usage).
- Parameters
- vaex.cache.memory_infinite(clear=False)[source]¶
Sets a dict a cache, creating an infinite cache.
Calling multiple times with clear=False will keep the current cache (useful in notebook usage)
- vaex.cache.off()[source]¶
Turns off caching, or temporary when used as context manager
>>> import vaex >>> df = vaex.example() >>> vaex.cache.memory_infinite() # cache on <cache restore context manager> >>> with vaex.cache.off(): ... df.x.sum() # calculated without cache array(-20884.64307324) >>> df.x.sum() # calculated with cache array(-20884.64307324) >>> vaex.cache.off() # cache off <cache restore context manager> >>> df.x.sum() # calculated without cache array(-20884.64307324)
- vaex.cache.redis(client=None)[source]¶
Uses Redis for caching.
- Parameters
client – Redis client, if None, will call redis.Redis()
- vaex.cache.set(key, value, type=None, duration_wallclock=None)[source]¶
Set a cache value
Useful to more advanced strategies, where we want to have different behaviour based on the type and costs. Implementations can set this function override the default behaviour:
>>> import vaex >>> vaex.cache.memory_infinite() >>> def my_smart_cache_setter(key, value, type=None, duration_wallclock=None): ... if duration_wallclock >= 0.1: # skip fast calculations ... vaex.cache.cache[key] = value ... >>> vaex.cache.set = my_smart_cache_setter
DataFrame class¶
- class vaex.dataframe.DataFrame(name=None, executor=None)[source]¶
Bases:
object
All local or remote datasets are encapsulated in this class, which provides a pandas like API to your dataset.
Each DataFrame (df) has a number of columns, and a number of rows, the length of the DataFrame.
All DataFrames have multiple ‘selection’, and all calculations are done on the whole DataFrame (default) or for the selection. The following example shows how to use the selection.
>>> df.select("x < 0") >>> df.sum(df.y, selection=True) >>> df.sum(df.y, selection=[df.x < 0, df.x > 0])
- __getitem__(item)[source]¶
Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering.
Example:
>>> df['Lz'] # the expression 'Lz >>> df['Lz/2'] # the expression 'Lz/2' >>> df[["Lz", "E"]] # a shallow copy with just two columns >>> df[df.Lz < 0] # a shallow copy with the filter Lz < 0 applied
- __setitem__(name, value)[source]¶
Convenient way to add a virtual column / expression to this DataFrame.
Example:
>>> import vaex, numpy as np >>> df = vaex.example() >>> df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2) >>> df.r <vaex.expression.Expression(expressions='r')> instance at 0x121687e80 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]
- __weakref__¶
list of weak references to the object (if defined)
- add_variable(name, expression, overwrite=True, unique=True)[source]¶
Add a variable to a DataFrame.
A variable may refer to other variables, and virtual columns and expression may refer to variables.
Example
>>> df.add_variable('center', 0) >>> df.add_virtual_column('x_prime', 'x-center') >>> df.select('x_prime < 0')
- Param
str name: name of virtual varible
- Param
expression: expression for the variable
- add_virtual_column(name, expression, unique=False)[source]¶
Add a virtual column to the DataFrame.
Example:
>>> df.add_virtual_column("r", "sqrt(x**2 + y**2 + z**2)") >>> df.select("r < 10")
- Param
str name: name of virtual column
- Param
expression: expression for the column
- Parameters
unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2
- apply(f, arguments=None, vectorize=False, multiprocessing=True)[source]¶
Apply a function on a per row basis across the entire DataFrame.
Example:
>>> import vaex >>> df = vaex.example() >>> def func(x, y): ... return (x+y)/(x-y) ... >>> df.apply(func, arguments=[df.x, df.y]) Expression = lambda_function(x, y) Length: 330,000 dtype: float64 (expression) ------------------------------------------- 0 -0.460789 1 3.90038 2 -0.642851 3 0.685768 4 -0.543357
- Parameters
f – The function to be applied
arguments – List of arguments to be passed on to the function f.
vectorize – Call f with arrays instead of a scalars (for better performance).
multiprocessing (bool) – Use multiple processes to avoid the GIL (Global interpreter lock).
- Returns
A function that is lazily evaluated.
- byte_size(selection=False, virtual=False)[source]¶
Return the size in bytes the whole DataFrame requires (or the selection), respecting the active_fraction.
- cat(i1, i2, format='html')[source]¶
Display the DataFrame from row i1 till i2
For format, see https://pypi.org/project/tabulate/
- close()[source]¶
Close any possible open file handles or other resources, the DataFrame will not be in a usable state afterwards.
- property col¶
Gives direct access to the columns only (useful for tab completion).
Convenient when working with ipython in combination with small DataFrames, since this gives tab-completion.
Columns can be accessed by their names, which are attributes. The attributes are currently expressions, so you can do computations with them.
Example
>>> ds = vaex.example() >>> df.plot(df.col.x, df.col.y)
- column_count(hidden=False)[source]¶
Returns the number of columns (including virtual columns).
- Parameters
hidden (bool) – If True, include hidden columns in the tally
- Returns
Number of columns in the DataFrame
- combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]¶
Generate a list of combinations for the possible expressions for the given dimension.
- Parameters
expressions_list – list of list of expressions, where the inner list defines the subspace
dimensions – if given, generates a subspace with all possible combinations for that dimension
exclude – list of
- correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None, array_type=None)[source]¶
Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby.
The x and y arguments can be single expressions of lists of expressions. - If x and y are single expression, it computes the correlation between x and y; - If x is a list of expressions and y is a single expression, it computes the correlation between each expression in x and the expression in y; - If x is a list of expressions and y is None, it computes the correlation matrix amongst all expressions in x; - If x is a list of tuples of length 2, it computes the correlation for the specified dimension pairs; - If x and y are lists of expressions, it computes the correlation matrix defined by the two expression lists.
Example:
>>> import vaex >>> df = vaex.example() >>> df.correlation("x**2+y**2+z**2", "-log(-E+1)") array(0.6366637382215669) >>> df.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4) array([ 0.40594394, 0.69868851, 0.61394099, 0.65266318]) >>> df.correlation(x=['x', 'y', 'z']) array([[ 1. , -0.06668907, -0.02709719], [-0.06668907, 1. , 0.03450365], [-0.02709719, 0.03450365, 1. ]]) >>> df.correlation(x=['x', 'y', 'z'], y=['E', 'Lz']) array([[-0.01116315, -0.00369268], [-0.0059848 , 0.02472491], [ 0.01428211, -0.05900035]])
- Parameters
x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- count(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]¶
Count the number of non-NaN values (or all, if expression is None or “*”).
Example:
>>> df.count() 330000 >>> df.count("*") 330000.0 >>> df.count("*", binby=["x"], shape=4) array([ 10925., 155427., 152007., 10748.])
- Parameters
expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- cov(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.
Either x and y are expressions, e.g.:
>>> df.cov("x", "y")
Or only the x argument is given with a list of expressions, e.g.:
>>> df.cov(["x, "y, "z"])
Example:
>>> df.cov("x", "y") array([[ 53.54521742, -3.8123135 ], [ -3.8123135 , 60.62257881]]) >>> df.cov(["x", "y", "z"]) array([[ 53.54521742, -3.8123135 , -0.98260511], [ -3.8123135 , 60.62257881, 1.21381057], [ -0.98260511, 1.21381057, 25.55517638]])
>>> df.cov("x", "y", binby="E", shape=2) array([[[ 9.74852878e+00, -3.02004780e-02], [ -3.02004780e-02, 9.99288215e+00]], [[ 8.43996546e+01, -6.51984181e+00], [ -6.51984181e+00, 9.68938284e+01]]])
- Parameters
x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
y – if previous argument is not a list, this argument should be given
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)
- covar(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Calculate the covariance cov[x,y] between x and y, possibly on a grid defined by binby.
Example:
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)") array(52.69461456005138) >>> df.covar("x**2+y**2+z**2", "-log(-E+1)")/(df.std("x**2+y**2+z**2") * df.std("-log(-E+1)")) 0.63666373822156686 >>> df.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4) array([ 10.17387143, 51.94954078, 51.24902796, 20.2163929 ])
- Parameters
x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- data_type(expression, array_type=None, internal=False, axis=0)[source]¶
Return the datatype for the given expression, if not a column, the first row will be evaluated to get the data type.
Example:
>>> df = vaex.from_scalars(x=1, s='Hi')
- describe(strings=True, virtual=True, selection=None)[source]¶
Give a description of the DataFrame.
>>> import vaex >>> df = vaex.example()[['x', 'y', 'z']] >>> df.describe() x y z dtype float64 float64 float64 count 330000 330000 330000 missing 0 0 0 mean -0.0671315 -0.0535899 0.0169582 std 7.31746 7.78605 5.05521 min -128.294 -71.5524 -44.3342 max 271.366 146.466 50.7185 >>> df.describe(selection=df.x > 0) x y z dtype float64 float64 float64 count 164060 164060 164060 missing 165940 165940 165940 mean 5.13572 -0.486786 -0.0868073 std 5.18701 7.61621 5.02831 min 1.51635e-05 -71.5524 -44.3342 max 271.366 78.0724 40.2191
- diff(periods=1, column=None, fill_value=None, trim=False, inplace=False, reverse=False)[source]¶
Calculate the difference between the current row and the row offset by periods
- Parameters
periods (int) – Which row to take the difference with
column (str or list[str]) – Column or list of columns to use (default is all).
fill_value – Value to use instead of missing values.
trim (bool) – Do not include rows that would otherwise have missing values
reverse (bool) – When true, calculate row[periods] - row[current]
inplace – If True, make modifications to self, otherwise return a new DataFrame
- drop(columns, inplace=False, check=True)[source]¶
Drop columns (or a single column).
- Parameters
columns – List of columns or a single column name
inplace – If True, make modifications to self, otherwise return a new DataFrame
check – When true, it will check if the column is used in virtual columns or the filter, and hide it instead.
- dropinf(column_names=None, how='any')[source]¶
Create a shallow copy of a DataFrame, with filtering set using isinf.
- dropmissing(column_names=None, how='any')[source]¶
Create a shallow copy of a DataFrame, with filtering set using ismissing.
- dropna(column_names=None, how='any')[source]¶
Create a shallow copy of a DataFrame, with filtering set using isna.
- dropnan(column_names=None, how='any')[source]¶
Create a shallow copy of a DataFrame, with filtering set using isnan.
- property dtypes¶
Gives a Pandas series object containing all numpy dtypes of all columns (except hidden).
- evaluate(expression, i1=None, i2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, progress=None)[source]¶
Evaluate an expression, and return a numpy array with the results for the full column or a part of it.
Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.
To get partial results, use i1 and i2
- Parameters
expression (str) – Name/expression to evaluate
i1 (int) – Start row index, default is the start (0)
i2 (int) – End row index, default is the length of the DataFrame
out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to a memory mapped array)
progress – {progress}
selection – selection to apply
- Returns
- evaluate_iterator(expression, s1=None, s2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, prefetch=True, progress=None)[source]¶
Generator to efficiently evaluate expressions in chunks (number of rows).
See
DataFrame.evaluate()
for other arguments.Example:
>>> import vaex >>> df = vaex.example() >>> for i1, i2, chunk in df.evaluate_iterator(df.x, chunk_size=100_000): ... print(f"Total of {i1} to {i2} = {chunk.sum()}") ... Total of 0 to 100000 = -7460.610158279056 Total of 100000 to 200000 = -4964.85827154921 Total of 200000 to 300000 = -7303.271340043915 Total of 300000 to 330000 = -2424.65234724951
- Parameters
progress – {progress}
prefetch – Prefetch/compute the next chunk in parallel while the current value is yielded/returned.
- extract()[source]¶
Return a DataFrame containing only the filtered rows.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered (contains just a small number of rows).
If no filtering is applied, it returns a trimmed view. For the returned df, len(df) == df.length_original() == df.length_unfiltered()
- Return type
- fillna(value, column_names=None, prefix='__original_', inplace=False)[source]¶
Return a DataFrame, where missing values/NaN are filled with ‘value’.
The original columns will be renamed, and by default they will be hidden columns. No data is lost.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Note
Note that filtering will be ignored (since they may change), you may want to consider running
extract()
first.Example:
>>> import vaex >>> import numpy as np >>> x = np.array([3, 1, np.nan, 10, np.nan]) >>> df = vaex.from_arrays(x=x) >>> df_filled = df.fillna(value=-1, column_names=['x']) >>> df_filled # x 0 3 1 1 2 -1 3 10 4 -1
- Parameters
value (float) – The value to use for filling nan or masked values.
fill_na (bool) – If True, fill np.nan values with value.
fill_masked (bool) – If True, fill masked values with values.
column_names (list) – List of column names in which to fill missing values.
prefix (str) – The prefix to give the original columns.
inplace – If True, make modifications to self, otherwise return a new DataFrame
- filter(expression, mode='and')[source]¶
General version of df[<boolean expression>] to modify the filter applied to the DataFrame.
See
DataFrame.select()
for usage of selection.Note that using df = df[<boolean expression>], one can only narrow the filter (i.e. only less rows can be selected). Using the filter method, and a different boolean mode (e.g. “or”) one can actually cause more rows to be selected. This differs greatly from numpy and pandas for instance, which can only narrow the filter.
Example:
>>> import vaex >>> import numpy as np >>> x = np.arange(10) >>> df = vaex.from_arrays(x=x, y=x**2) >>> df # x y 0 0 0 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25 6 6 36 7 7 49 8 8 64 9 9 81 >>> dff = df[df.x<=2] >>> dff # x y 0 0 0 1 1 1 2 2 4 >>> dff = dff.filter(dff.x >=7, mode="or") >>> dff # x y 0 0 0 1 1 1 2 2 4 3 7 49 4 8 64 5 9 81
- fingerprint(dependencies=None, treeshake=False)[source]¶
Id that uniquely identifies a dataframe (cross runtime).
- first(expression, order_expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]¶
Return the first element of a binned expression, where the values each bin are sorted by order_expression.
Example:
>>> import vaex >>> df = vaex.example() >>> df.first(df.x, df.y, shape=8) >>> df.first(df.x, df.y, shape=8, binby=[df.y]) >>> df.first(df.x, df.y, shape=8, binby=[df.y]) array([-4.81883764, 11.65378 , 9.70084476, -7.3025589 , 4.84954977, 8.47446537, -5.73602629, 10.18783 ])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
order_expression – Order the values in the bins by this expression.
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Ndarray containing the first elements.
- Return type
numpy.array
- get_column_names(virtual=True, strings=True, hidden=False, regex=None, dtype=None)[source]¶
Return a list of column names
Example:
>>> import vaex >>> df = vaex.from_scalars(x=1, x2=2, y=3, s='string') >>> df['r'] = (df.x**2 + df.y**2)**2 >>> df.get_column_names() ['x', 'x2', 'y', 's', 'r'] >>> df.get_column_names(virtual=False) ['x', 'x2', 'y', 's'] >>> df.get_column_names(regex='x.*') ['x', 'x2'] >>> df.get_column_names(dtype='string') ['s']
- Parameters
virtual – If False, skip virtual columns
hidden – If False, skip hidden columns
strings – If False, skip string columns
regex – Only return column names matching the (optional) regular expression
dtype – Only return column names with the given dtype. Can be a single or a list of dtypes.
- Return type
list of str
- get_current_row()[source]¶
Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked.
- get_private_dir(create=False)[source]¶
Each DataFrame has a directory where files are stored for metadata etc.
Example
>>> import vaex >>> ds = vaex.example() >>> vaex.get_private_dir() '/Users/users/breddels/.vaex/dfs/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'
- Parameters
create (bool) – is True, it will create the directory if it does not exist
- get_selection(name='default')[source]¶
Get the current selection object (mostly for internal use atm).
- get_variable(name)[source]¶
Returns the variable given by name, it will not evaluate it.
For evaluation, see
DataFrame.evaluate_variable()
, see alsoDataFrame.set_variable()
- healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]¶
Count non missing value for expression on an array which represents healpix data.
- Parameters
expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows
healpix_expression – {healpix_max_level}
healpix_max_level – {healpix_max_level}
healpix_level – {healpix_level}
binby – {binby}, these dimension follow the first healpix dimension.
limits – {limits}
shape – {shape}
selection – {selection}
delay – {delay}
progress – {progress}
- Returns
- kurtosis(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]¶
Calculate the kurtosis for the given expression, possible on a grid defined by binby.
Example:
>>> df.kurtosis('vz') 0.33414303 >>> df.kurtosis("vz", binby=["E"], shape=4) array([0.35286113, 0.14455428, 0.52955107, 5.06716345])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- last(expression, order_expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]¶
Return the last element of a binned expression, where the values each bin are sorted by order_expression.
- Parameters
expression – The value to be placed in the bin.
order_expression – Order the values in the bins by this expression.
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Ndarray containing the first elements.
- Return type
numpy.array
- length_original()[source]¶
the full length of the DataFrame, independent what active_fraction is, or filtering. This is the real length of the underlying ndarrays.
- length_unfiltered()[source]¶
The length of the arrays that should be considered (respecting active range), but without filtering.
- limits(expression, value=None, square=False, selection=None, delay=False, progress=None, shape=None)[source]¶
Calculate the [min, max] range for expression, as described by value, which is ‘minmax’ by default.
If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.
Example:
>>> import vaex >>> df = vaex.example() >>> df.limits("x") array([-128.293991, 271.365997]) >>> df.limits("x", "99.7%") array([-28.86381927, 28.9261226 ]) >>> df.limits(["x", "y"]) (array([-128.293991, 271.365997]), array([ -71.5523682, 146.465836 ])) >>> df.limits(["x", "y"], "99.7%") (array([-28.86381927, 28.9261226 ]), array([-28.60476934, 28.96535249])) >>> df.limits(["x", "y"], ["minmax", "90%"]) (array([-128.293991, 271.365997]), array([-13.37438402, 13.4224423 ])) >>> df.limits(["x", "y"], ["minmax", [0, 10]]) (array([-128.293991, 271.365997]), [0, 10])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
value – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
- limits_percentage(expression, percentage=99.73, square=False, selection=False, progress=None, delay=False)[source]¶
Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.
The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:
Example:
>>> df.limits_percentage("x", 90) array([-12.35081376, 12.14858052] >>> df.percentile_approx("x", 5), df.percentile_approx("x", 95) (array([-12.36813152]), array([ 12.13275818]))
NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
percentage (float) – Value between 0 and 100
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
- Returns
List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
- materialize(column=None, inplace=False, virtual_column=None)[source]¶
Turn columns into native CPU format for optimal performance at cost of memory.
Warning
This may use of lot of memory, be mindfull.
Virtual columns will be evaluated immediately, and all real columns will be cached in memory when used for the first time.
Example for virtual column:
>>> x = np.arange(1,4) >>> y = np.arange(2,5) >>> df = vaex.from_arrays(x=x, y=y) >>> df['r'] = (df.x**2 + df.y**2)**0.5 # 'r' is a virtual column (computed on the fly) >>> df = df.materialize('r') # now 'r' is a 'real' column (i.e. a numpy array)
Example with parquet file >>> df = vaex.open(‘somewhatslow.parquet’) >>> df.x.sum() # slow >>> df = df.materialize() >>> df.x.sum() # slow, but will fill the cache >>> df.x.sum() # as fast as possible, will use memory
- Parameters
column – string or list of strings with column names to materialize, all columns when None
inplace – If True, make modifications to self, otherwise return a new DataFrame
virtual_column – for backward compatibility
- max(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]¶
Calculate the maximum for given expressions, possibly on a grid defined by binby.
Example:
>>> df.max("x") array(271.365997) >>> df.max(["x", "y"]) array([ 271.365997, 146.465836]) >>> df.max("x", binby="x", shape=5, limits=[-10, 10]) array([-6.00010443, -2.00002384, 1.99998057, 5.99983597, 9.99984646])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
- mean(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]¶
Calculate the mean for expression, possibly on a grid defined by binby.
Example:
>>> df.mean("x") -0.067131491264005971 >>> df.mean("(x**2+y**2)**0.5", binby="E", shape=4) array([ 2.43483742, 4.41840721, 8.26742458, 15.53846476])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False, progress=None)[source]¶
Calculate the median, possibly on a grid defined by binby.
NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- min(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]¶
Calculate the minimum for given expressions, possibly on a grid defined by binby.
Example:
>>> df.min("x") array(-128.293991) >>> df.min(["x", "y"]) array([-128.293991 , -71.5523682]) >>> df.min("x", binby="x", shape=5, limits=[-10, 10]) array([-9.99919128, -5.99972439, -1.99991322, 2.0000093 , 6.0004878 ])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
- minmax(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.
Example:
>>> df.minmax("x") array([-128.293991, 271.365997]) >>> df.minmax(["x", "y"]) array([[-128.293991 , 271.365997 ], [ -71.5523682, 146.465836 ]]) >>> df.minmax("x", binby="x", shape=5, limits=[-10, 10]) array([[-9.99919128, -6.00010443], [-5.99972439, -2.00002384], [-1.99991322, 1.99998057], [ 2.0000093 , 5.99983597], [ 6.0004878 , 9.99984646]])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
- mode(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]¶
Calculate/estimate the mode.
- mutual_information(x, y=None, dimension=2, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]¶
Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.
The x and y arguments can be single expressions of lists of expressions: - If x and y are single expression, it computes the mutual information between x and y; - If x is a list of expressions and y is a single expression, it computes the mutual information between each expression in x and the expression in y; - If x is a list of expressions and y is None, it computes the mutual information matrix amongst all expressions in x; - If x is a list of tuples of length 2, it computes the mutual information for the specified dimension pairs; - If x and y are lists of expressions, it computes the mutual information matrix defined by the two expression lists.
If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order.
Example:
>>> import vaex >>> df = vaex.example() >>> df.mutual_information("x", "y") array(0.1511814526380327) >>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]]) array([ 0.15118145, 0.18439181, 1.07067379]) >>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True) (array([ 1.07067379, 0.18439181, 0.15118145]), [['E', 'Lz'], ['x', 'z'], ['x', 'y']]) >>> df.mutual_information(x=['x', 'y', 'z']) array([[3.53535106, 0.06893436, 0.11656418], [0.06893436, 3.49414866, 0.14089177], [0.11656418, 0.14089177, 3.96144906]]) >>> df.mutual_information(x=['x', 'y', 'z'], y=['E', 'Lz']) array([[0.32316291, 0.16110026], [0.36573065, 0.17802792], [0.35239151, 0.21677695]])
- Parameters
x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,
- property nbytes¶
Alias for df.byte_size(), see
DataFrame.byte_size()
.
- nop(expression=None, progress=False, delay=False)[source]¶
Evaluates expression or a list of expressions, and drops the result. Usefull for benchmarking, since vaex is usually lazy.
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
- Returns
None
- percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False, progress=None)[source]¶
Calculate the percentile given by percentage, possibly on a grid defined by binby.
NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits.
Example:
>>> df.percentile_approx("x", 10), df.percentile_approx("x", 90) (array([-8.3220355]), array([ 7.92080358])) >>> df.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10]) array([[-7.56462982], [-3.61036641], [-0.01296306], [ 3.56697863], [ 7.45838367]])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’
percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- plot2d_contour(x=None, y=None, what='count(*)', limits=None, shape=256, selection=None, f='identity', figsize=None, xlabel=None, ylabel=None, aspect='auto', levels=None, fill=False, colorbar=False, colorbar_label=None, colormap=None, colors=None, linewidths=None, linestyles=None, vmin=None, vmax=None, grid=None, show=None, **kwargs)¶
Plot conting contours on 2D grid.
- Parameters
x – {expression}
y – {expression}
what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
limits – {limits}
shape – {shape}
selection – {selection}
f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
figsize – (x, y) tuple passed to plt.figure for setting the figure size
xlabel – label of the x-axis (defaults to param x)
ylabel – label of the y-axis (defaults to param y)
aspect – the aspect ratio of the figure
levels – the contour levels to be passed on plt.contour or plt.contourf
colorbar – plot a colorbar or not
colorbar_label – the label of the colourbar (defaults to param what)
colormap – matplotlib colormap to pass on to plt.contour or plt.contourf
colors – the colours of the contours
linewidths – the widths of the contours
linestyles – the style of the contour lines
vmin – instead of automatic normalization, scale the data between vmin and vmax
vmax – see vmin
grid – {grid}
show –
- plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]¶
Use at own risk, requires ipyvolume
- plot_bq(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]¶
Deprecated: use plot_widget
- propagate_uncertainties(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]¶
Propagates uncertainties (full covariance matrix) for a set of virtual columns.
Covariance matrix of the depending variables is guessed by finding columns prefixed by “e” or “e_” or postfixed by “_error”, “_uncertainty”, “e” and “_e”. Off diagonals (covariance or correlation) by postfixes with “_correlation” or “_corr” for correlation or “_covariance” or “_cov” for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation.)
Example
>>> df = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2) >>> df["u"] = df.x + df.y >>> df["v"] = np.log10(df.x) >>> df.propagate_uncertainties([df.u, df.v]) >>> df.u_uncertainty, df.v_uncertainty
- Parameters
columns – list of columns for which to calculate the covariance matrix.
depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties.
cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.
- remove_virtual_meta()[source]¶
Removes the file with the virtual column etc, it does not change the current virtual columns etc.
- rename(name, new_name, unique=False)[source]¶
Renames a column or variable, and rewrite expressions such that they refer to the new name
- rolling(window, trim=False, column=None, fill_value=None, edge='right')[source]¶
Create a
vaex.rolling.Rolling
rolling window object- Parameters
window (int) – Size of the rolling window.
trim (bool) – Trim off begin or end of dataframe to avoid missing values
column (str or list[str]) – Column name or column names of columns affected (None for all)
fill_value (any) – Scalar value to use for data outside of existing rows.
edge (str) – Where the edge of the rolling window is for the current row.
- sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]¶
Returns a DataFrame with a random set of rows
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Provide either n or frac.
Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df # s x 0 a 1 1 b 2 2 c 3 3 d 4 >>> df.sample(n=2, random_state=42) # 2 random rows, fixed seed # s x 0 b 2 1 d 4 >>> df.sample(frac=1, random_state=42) # 'shuffling' # s x 0 c 3 1 a 1 2 d 4 3 b 2 >>> df.sample(frac=1, replace=True, random_state=42) # useful for bootstrap (may contain repeated samples) # s x 0 d 4 1 a 1 2 a 1 3 d 4
- Parameters
n (int) – number of samples to take (default 1 if frac is None)
frac (float) – fractional number of takes to take
replace (bool) – If true, a row may be drawn multiple times
weights (str or expression) – (unnormalized) probability that a row can be drawn
RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.
- Returns
Returns a new DataFrame with a shallow copy/view of the underlying data
- Return type
- schema_arrow(reduce_large=False)[source]¶
Similar to
schema()
, but returns an arrow schema- Parameters
reduce_large (bool) – change large_string to normal string
- select(boolean_expression, mode='replace', name='default', executor=None)[source]¶
Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode.
Selections are recorded in a history tree, per name, undo/redo can be done for them separately.
- select_box(spaces, limits, mode='replace', name='default')[source]¶
Select a n-dimensional rectangular box bounded by limits.
The following examples are equivalent:
>>> df.select_box(['x', 'y'], [(0, 10), (0, 1)]) >>> df.select_rectangle('x', 'y', [(0, 10), (0, 1)])
- Parameters
spaces – list of expressions
limits – sequence of shape [(x1, x2), (y1, y2)]
mode –
name –
- Returns
- select_circle(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]¶
Select a circular region centred on xc, yc, with a radius of r.
Example:
>>> df.select_circle('x','y',2,3,1)
- Parameters
x – expression for the x space
y – expression for the y space
xc – location of the centre of the circle in x
yc – location of the centre of the circle in y
r – the radius of the circle
name – name of the selection
mode –
- Returns
- select_ellipse(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]¶
Select an elliptical region centred on xc, yc, with a certain width, height and angle.
Example:
>>> df.select_ellipse('x','y', 2, -1, 5,1, 30, name='my_ellipse')
- Parameters
x – expression for the x space
y – expression for the y space
xc – location of the centre of the ellipse in x
yc – location of the centre of the ellipse in y
width – the width of the ellipse (diameter)
height – the width of the ellipse (diameter)
angle – (degrees) orientation of the ellipse, counter-clockwise measured from the y axis
name – name of the selection
mode –
- Returns
- select_inverse(name='default', executor=None)[source]¶
Invert the selection, i.e. what is selected will not be, and vice versa
- Parameters
name (str) –
executor –
- Returns
- select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]¶
For performance reasons, a lasso selection is handled differently.
- Parameters
- Returns
- select_non_missing(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]¶
Create a selection that selects rows having non missing values for all columns in column_names.
The name reflects Pandas, no rows are really dropped, but a mask is kept to keep track of the selection
- Parameters
drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)
drop_masked – drop rows when there is a masked value in any of the columns
column_names – The columns to consider, default: all (real, non-virtual) columns
mode (str) – Possible boolean operator: replace/and/or/xor/subtract
name (str) – history tree or selection ‘slot’ to use
- Returns
- select_rectangle(x, y, limits, mode='replace', name='default')[source]¶
Select a 2d rectangular box in the space given by x and y, bounded by limits.
Example:
>>> df.select_box('x', 'y', [(0, 10), (0, 1)])
- Parameters
x – expression for the x space
y – expression fo the y space
limits – sequence of shape [(x1, x2), (y1, y2)]
mode –
- set_active_fraction(value)[source]¶
Sets the active_fraction, set picked row to None, and remove selection.
TODO: we may be able to keep the selection, if we keep the expression, and also the picked row
- set_active_range(i1, i2)[source]¶
Sets the active_fraction, set picked row to None, and remove selection.
TODO: we may be able to keep the selection, if we keep the expression, and also the picked row
- set_selection(selection, name='default', executor=None)[source]¶
Sets the selection object
- Parameters
selection – Selection object
name – selection ‘slot’
executor –
- Returns
- set_variable(name, expression_or_value, write=True)[source]¶
Set the variable to an expression or value defined by expression_or_value.
Example
>>> df.set_variable("a", 2.) >>> df.set_variable("b", "a**2") >>> df.get_variable("b") 'a**2' >>> df.evaluate_variable("b") 4.0
- Parameters
name – Name of the variable
write – write variable to meta file
expression – value or expression
- shift(periods, column=None, fill_value=None, trim=False, inplace=False)[source]¶
Shift a column or multiple columns by periods amounts of rows.
- Parameters
periods (int) – Shift column forward (when positive) or backwards (when negative)
column (str or list[str]) – Column or list of columns to shift (default is all).
fill_value – Value to use instead of missing values.
trim (bool) – Do not include rows that would otherwise have missing values
inplace – If True, make modifications to self, otherwise return a new DataFrame
- shuffle(random_state=None)[source]¶
Shuffle order of rows (equivalent to df.sample(frac=1))
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c']), x=np.arange(1,4)) >>> df # s x 0 a 1 1 b 2 2 c 3 >>> df.shuffle(random_state=42) # s x 0 a 1 1 b 2 2 c 3
- Parameters
RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.
- Returns
Returns a new DataFrame with a shallow copy/view of the underlying data
- Return type
- skew(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]¶
Calculate the skew for the given expression, possible on a grid defined by binby.
Example:
>>> df.skew("vz") 0.02116528 >>> df.skew("vz", binby=["E"], shape=4) array([-0.069976 , -0.01003445, 0.05624177, -2.2444322 ])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- sort(by, ascending=True)[source]¶
Return a sorted DataFrame, sorted by the expression ‘by’.
Both ‘by’ and ‘ascending’ arguments can be lists. Note that missing/nan/NA values will always be pushed to the end, no matter the sorting order.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Note
Note that filtering will be ignored (since they may change), you may want to consider running
extract()
first.Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df['y'] = (df.x-1.8)**2 >>> df # s x y 0 a 1 0.64 1 b 2 0.04 2 c 3 1.44 3 d 4 4.84 >>> df.sort('y', ascending=False) # Note: passing '(x-1.8)**2' gives the same result # s x y 0 d 4 4.84 1 c 3 1.44 2 a 1 0.64 3 b 2 0.04
- split(into=None)[source]¶
Returns a list containing ordered subsets of the DataFrame.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Example:
>>> import vaex >>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> for dfs in df.split(into=0.3): ... print(dfs.x.values) ... [0 1 3] [3 4 5 6 7 8 9] >>> for split in df.split(into=[0.2, 0.3, 0.5]): ... print(dfs.x.values) [0 1] [2 3 4] [5 6 7 8 9]
- Parameters
into (int/float/list) – If float will split the DataFrame in two, the first of which will have a relative length as specified by this parameter. When a list, will split into as many portions as elements in the list, where each element defines the relative length of that portion. Note that such a list of fractions will always be re-normalized to 1. When an int, split DataFrame into n dataframes of equal length (last one may deviate), if len(df) < n, it will return len(df) DataFrames.
- split_random(into, random_state=None)[source]¶
Returns a list containing random portions of the DataFrame.
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Example:
>>> import vaex, import numpy as np >>> np.random.seed(111) >>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> for dfs in df.split_random(into=0.3, random_state=42): ... print(dfs.x.values) ... [8 1 5] [0 7 2 9 4 3 6] >>> for split in df.split_random(into=[0.2, 0.3, 0.5], random_state=42): ... print(dfs.x.values) [8 1] [5 0 7] [2 9 4 3 6]
- Parameters
into (int/float/list) – If float will split the DataFrame in two, the first of which will have a relative length as specified by this parameter. When a list, will split into as many portions as elements in the list, where each element defines the relative length of that portion. Note that such a list of fractions will always be re-normalized to 1. When an int, split DataFrame into n dataframes of equal length (last one may deviate), if len(df) < n, it will return len(df) DataFrames.
RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.
- Returns
A list of DataFrames.
- Return type
- state_load(file, use_active_range=False, keep_columns=None, set_filter=True, trusted=True, fs_options=None, fs=None)[source]¶
Load a state previously stored by
DataFrame.state_write()
, see alsoDataFrame.state_set()
.- Parameters
file (str) – filename (ending in .json or .yaml)
use_active_range (bool) – Whether to use the active range or not.
keep_columns (list) – List of columns that should be kept if the state to be set contains less columns.
set_filter (bool) – Set the filter from the state (default), or leave the filter as it is it.
fs_options (dict) – arguments to pass the the file system handler (s3fs or gcsfs)
fs – ‘Pass a file system object directly, see
vaex.open()
’
- state_write(file, fs_options=None, fs=None)[source]¶
Write the internal state to a json or yaml file (see
DataFrame.state_get()
)Example
>>> import vaex >>> df = vaex.from_scalars(x=1, y=2) >>> df['r'] = (df.x**2 + df.y**2)**0.5 >>> df.state_write('state.json') >>> print(open('state.json').read()) { "virtual_columns": { "r": "(((x ** 2) + (y ** 2)) ** 0.5)" }, "column_names": [ "x", "y", "r" ], "renamed_columns": [], "variables": { "pi": 3.141592653589793, "e": 2.718281828459045, "km_in_au": 149597870.7, "seconds_per_year": 31557600 }, "functions": {}, "selections": { "__filter__": null }, "ucds": {}, "units": {}, "descriptions": {}, "description": null, "active_range": [ 0, 1 ] } >>> df.state_write('state.yaml') >>> print(open('state.yaml').read()) active_range: - 0 - 1 column_names: - x - y - r description: null descriptions: {} functions: {} renamed_columns: [] selections: __filter__: null ucds: {} units: {} variables: pi: 3.141592653589793 e: 2.718281828459045 km_in_au: 149597870.7 seconds_per_year: 31557600 virtual_columns: r: (((x ** 2) + (y ** 2)) ** 0.5)
- Parameters
file (str) – filename (ending in .json or .yaml)
fs_options (dict) – arguments to pass the the file system handler (s3fs or gcsfs)
fs – ‘Pass a file system object directly, see
vaex.open()
’
- std(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, array_type=None)[source]¶
Calculate the standard deviation for the given expression, possible on a grid defined by binby
>>> df.std("vz") 110.31773397535071 >>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- sum(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]¶
Calculate the sum for the given expression, possible on a grid defined by binby
Example:
>>> df.sum("L") 304054882.49378014 >>> df.sum("L", binby="E", shape=4) array([ 8.83517994e+06, 5.92217598e+07, 9.55218726e+07, 1.40008776e+08])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
- take(indices, filtered=True, dropfilter=True)[source]¶
Returns a DataFrame containing only rows indexed by indices
Note
Note that no copy of the underlying data is made, only a view/reference is made.
Example:
>>> import vaex, numpy as np >>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5)) >>> df.take([0,2]) # s x 0 a 1 1 c 3
- Parameters
indices – sequence (list or numpy array) with row numbers
filtered – (for internal use) The indices refer to the filtered data.
dropfilter – (for internal use) Drop the filter, set to False when indices refer to unfiltered, but may contain rows that still need to be filtered out.
- Returns
DataFrame which is a shallow copy of the original data.
- Return type
- to_arrays(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]¶
Return a list of ndarrays
- Parameters
column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
strings – argument passed to DataFrame.get_column_names when column_names is None
virtual – argument passed to DataFrame.get_column_names when column_names is None
parallel – Evaluate the (virtual) columns in parallel
chunk_size – Return an iterator with cuts of the object in lenght of this size
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
list of arrays
- to_arrow_table(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, reduce_large=False)[source]¶
Returns an arrow Table object containing the arrays corresponding to the evaluated data
- Parameters
column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
strings – argument passed to DataFrame.get_column_names when column_names is None
virtual – argument passed to DataFrame.get_column_names when column_names is None
parallel – Evaluate the (virtual) columns in parallel
chunk_size – Return an iterator with cuts of the object in lenght of this size
reduce_large (bool) – If possible, cast large_string to normal string
- Returns
pyarrow.Table object or iterator of
- to_astropy_table(column_names=None, selection=None, strings=True, virtual=True, index=None, parallel=True)[source]¶
Returns a astropy table object containing the ndarrays corresponding to the evaluated data
- Parameters
column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
strings – argument passed to DataFrame.get_column_names when column_names is None
virtual – argument passed to DataFrame.get_column_names when column_names is None
index – if this column is given it is used for the index of the DataFrame
- Returns
astropy.table.Table object
- to_dask_array(chunks='auto')[source]¶
Lazily expose the DataFrame as a dask.array
Example
>>> df = vaex.example() >>> A = df[['x', 'y', 'z']].to_dask_array() >>> A dask.array<vaex-df-1f048b40-10ec-11ea-9553, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray> >>> A+1 dask.array<add, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray>
- Parameters
chunks – How to chunk the array, similar to
dask.array.from_array()
.- Returns
dask.array.Array
object.
- to_dict(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]¶
Return a dict containing the ndarray corresponding to the evaluated data
- Parameters
column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
strings – argument passed to DataFrame.get_column_names when column_names is None
virtual – argument passed to DataFrame.get_column_names when column_names is None
parallel – Evaluate the (virtual) columns in parallel
chunk_size – Return an iterator with cuts of the object in lenght of this size
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
dict
- to_items(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]¶
Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data
- Parameters
column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
strings – argument passed to DataFrame.get_column_names when column_names is None
virtual – argument passed to DataFrame.get_column_names when column_names is None
parallel – Evaluate the (virtual) columns in parallel
chunk_size – Return an iterator with cuts of the object in lenght of this size
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
list of (name, ndarray) pairs or iterator of
- to_pandas_df(column_names=None, selection=None, strings=True, virtual=True, index_name=None, parallel=True, chunk_size=None, array_type=None)[source]¶
Return a pandas DataFrame containing the ndarray corresponding to the evaluated data
If index is given, that column is used for the index of the dataframe.
Example
>>> df_pandas = df.to_pandas_df(["x", "y", "z"]) >>> df_copy = vaex.from_pandas(df_pandas)
- Parameters
column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
strings – argument passed to DataFrame.get_column_names when column_names is None
virtual – argument passed to DataFrame.get_column_names when column_names is None
index_column – if this column is given it is used for the index of the DataFrame
parallel – Evaluate the (virtual) columns in parallel
chunk_size – Return an iterator with cuts of the object in lenght of this size
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
pandas.DataFrame object or iterator of
- to_records(index=None, selection=None, column_names=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type='python')[source]¶
Return a list of [{column_name: value}, …)] “records” where each dict is an evaluated row.
- Parameters
index – an index to use to get the record of a specific row when provided
column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
strings – argument passed to DataFrame.get_column_names when column_names is None
virtual – argument passed to DataFrame.get_column_names when column_names is None
parallel – Evaluate the (virtual) columns in parallel
chunk_size – Return an iterator with cuts of the object in lenght of this size
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
list of [{column_name:value}, …] records
- trim(inplace=False)[source]¶
Return a DataFrame, where all columns are ‘trimmed’ by the active range.
For the returned DataFrame, df.get_active_range() returns (0, df.length_original()).
Note
Note that no copy of the underlying data is made, only a view/reference is made.
- Parameters
inplace – If True, make modifications to self, otherwise return a new DataFrame
- Return type
- ucd_find(ucds, exclude=[])[source]¶
Find a set of columns (names) which have the ucd, or part of the ucd.
Prefixed with a ^, it will only match the first part of the ucd.
Example
>>> df.ucd_find('pos.eq.ra', 'pos.eq.dec') ['RA', 'DEC'] >>> df.ucd_find('pos.eq.ra', 'doesnotexist') >>> df.ucds[df.ucd_find('pos.eq.ra')] 'pos.eq.ra;meta.main' >>> df.ucd_find('meta.main')] 'dec' >>> df.ucd_find('^meta.main')]
- unique(expression, return_inverse=False, dropna=False, dropnan=False, dropmissing=False, progress=False, selection=None, axis=None, delay=False, limit=None, limit_raise=True, array_type='python')[source]¶
Returns all unique values.
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
return_inverse – Return the inverse mapping from unique values to original values.
dropna – Drop rows with Not Available (NA) values (NaN or missing values).
dropnan – Drop rows with NaN values
dropmissing – Drop rows with missing values
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
axis (int) – Axis over which to determine the unique elements (None will flatten arrays or lists)
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
limit (int) – Limit the amount of results
limit_raise (bool) – Raise
vaex.RowLimitException
when limit is exceeded, or return at maximum ‘limit’ amount of results.array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- unit(expression, default=None)[source]¶
Returns the unit (an astropy.unit.Units object) for the expression.
Example
>>> import vaex >>> ds = vaex.example() >>> df.unit("x") Unit("kpc") >>> df.unit("x*L") Unit("km kpc2 / s")
- Parameters
expression – Expression, which can be a column name
default – if no unit is known, it will return this
- Returns
The resulting unit of the expression
- Return type
astropy.units.Unit
- var(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, array_type=None)[source]¶
Calculate the sample variance for the given expression, possible on a grid defined by binby
Example:
>>> df.var("vz") 12170.002429456246 >>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4) array([ 15271.90481083, 7284.94713504, 3738.52239232, 1449.63418988]) >>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5 array([ 123.57954851, 85.35190177, 61.14345748, 38.0740619 ])
- Parameters
expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]
binby – List of expressions for constructing a binned grid
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
DataFrameLocal class¶
- class vaex.dataframe.DataFrameLocal(dataset=None, name=None)[source]¶
Bases:
vaex.dataframe.DataFrame
Base class for DataFrames that work with local file/data
- __array__(dtype=None, parallel=True)[source]¶
Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.
Note this returns the same result as:
>>> np.array(ds)
If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).
- as_numpy(strict=False)[source]¶
Lazily cast all numerical columns to numpy.
If strict is True, it will also cast non-numerical types.
- binby(by=None, agg=None, sort=False, copy=True, delay=False, progress=None)[source]¶
Return a
BinBy
orDataArray
object when agg is not NoneThe binby operation does not return a ‘flat’ DataFrame, instead it returns an N-d grid in the form of an xarray.
- Parameters
agg (dict, list or agg) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the binby object.
copy (bool) – Copy the dataframe (shallow, does not cost memory) so that the fingerprint of the original dataframe is not modified.
delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
DataArray
orBinBy
object.
- categorize(column, min_value=0, max_value=None, labels=None, inplace=False)[source]¶
Mark column as categorical.
This may help speed up calculations using integer columns between a range of [min_value, max_value].
If max_value is not given, the [min_value and max_value] are calcuated from the data.
Example:
>>> import vaex >>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6]) >>> df = df.categorize('year', min_value=2020, max_value=2019) >>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']) >>> df # year weekday 0 2012 0 1 2015 4 2 2019 6 >>> df.is_category('year') True
- Parameters
column – column to assume is categorical.
labels – labels to associate to the values between min_value and max_value
min_value – minimum integer value (if max_value is not given, this is calculated)
max_value – maximum integer value (if max_value is not given, this is calculated)
labels – Labels to associate to each value, list(range(min_value, max_value+1)) by default
inplace – If True, make modifications to self, otherwise return a new DataFrame
- compare(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]¶
Compare two DataFrames and report their difference, use with care for large DataFrames
- concat(*others, resolver='flexible') vaex.dataframe.DataFrame [source]¶
Concatenates multiple DataFrames, adding the rows of the other DataFrame to the current, returned in a new DataFrame.
In the case of resolver=’flexible’, when not all columns has the same names, the missing data is filled with missing values.
In the case of resolver=’strict’ all datasets need to have matching column names.
- Parameters
others – The other DataFrames that are concatenated with this DataFrame
resolver (str) – How to resolve schema conflicts, ‘flexible’ or ‘strict’.
- Returns
New DataFrame with the rows concatenated
- copy(column_names=None, treeshake=False)[source]¶
Make a shallow copy of a DataFrame. One can also specify a subset of columns.
This is a fairly cheap operation, since no memory copies of the underlying data are made.
{note_copy}
- property data¶
Gives direct access to the data as numpy arrays.
Convenient when working with IPython in combination with small DataFrames, since this gives tab-completion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use DataFrame.evaluate(…).
Columns can be accessed by their names, which are attributes. The attributes are of type numpy.ndarray.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2)
- export(path, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None, **kwargs)[source]¶
Exports the DataFrame to a file depending on the file extension.
E.g if the filename ends on .hdf5, df.export_hdf5 is called.
- Parameters
path (str) – path for file
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration, if supported.
parallel (bool) – Evaluate the (virtual) columns in parallel
fs_options (dict) – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}
- Returns
- export_arrow(to, progress=None, chunk_size=1048576, parallel=True, reduce_large=True, fs_options=None, fs=None, as_stream=True)[source]¶
Exports the DataFrame to a file of stream written with arrow
- Parameters
to – filename, file object, or
pyarrow.RecordBatchStreamWriter
, py:data:pyarrow.RecordBatchFileWriter orpyarrow.parquet.ParquetWriter
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel (bool) – Evaluate the (virtual) columns in parallel
reduce_large (bool) – If True, convert arrow large_string type to string type
as_stream (bool) – Write as an Arrow stream if true, else a file. see also https://arrow.apache.org/docs/format/Columnar.html?highlight=arrow1#ipc-file-format
fs_options (dict) – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}
- Returns
- export_csv(path, progress=None, chunk_size=1048576, parallel=True, backend='pandas', **kwargs)[source]¶
Exports the DataFrame to a CSV file.
- Parameters
path (str) – path to the file
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel (bool) – Evaluate the (virtual) columns in parallel
backend (str) – Which backend to use, either ‘pandas’ or ‘arrow’. Arrow is considerably faster, but pandas is more flexible.
kwargs – additional keyword arguments are passed to the the backends. See
DataFrameLocal.export_csv_pandas()
andDataFrameLocal.export_csv_arrow()
for more details.
- export_csv_arrow(to, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None)[source]¶
Exports the DataFrame to a CSV file via PyArrow.
- Parameters
to (str) – path to the file
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel (bool) – Evaluate the (virtual) columns in parallel
fs_options (dict) – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}fs – Pass a file system object directly, see
vaex.open()
- export_csv_pandas(path, progress=None, chunk_size=1048576, parallel=True, **kwargs)[source]¶
Exports the DataFrame to a CSV file via the Pandas.
- Parameters
path (str) – Path for file
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel – Evaluate the (virtual) columns in parallel
kwargs – Extra keyword arguments to be passed on pandas.DataFrame.to_csv()
- export_feather(to, parallel=True, reduce_large=True, compression='lz4', fs_options=None, fs=None)[source]¶
Exports the DataFrame to an arrow file using the feather file format version 2
- Feather is exactly represented as the Arrow IPC file format on disk, but also support compression.
- Parameters
to – filename or file object
parallel (bool) – Evaluate the (virtual) columns in parallel
reduce_large (bool) – If True, convert arrow large_string type to string type
compression – Can be one of ‘zstd’, ‘lz4’ or ‘uncompressed’
fs_options – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}fs – Pass a file system object directly, see
vaex.open()
- Returns
- export_fits(path, progress=None)[source]¶
Exports the DataFrame to a fits file that is compatible with TOPCAT colfits format
- Parameters
path (str) – path for file
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
- export_hdf5(path, byteorder='=', progress=None, chunk_size=1048576, parallel=True, column_count=1, writer_threads=0, group='/table', mode='w')[source]¶
Exports the DataFrame to a vaex hdf5 file
- Parameters
path (str) – path for file
byteorder (str) – = for native, < for little endian and > for big endian
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
parallel (bool) – Evaluate the (virtual) columns in parallel
column_count (int) – How many columns to evaluate and export in parallel (>1 requires fast random access, like and SSD drive).
writer_threads (int) – Use threads for writing or not, only useful when column_count > 1.
group (str) – Write the data into a custom group in the hdf5 file.
mode (str) – If set to “w” (write), an existing file will be overwritten. If set to “a”, one can append additional data to the hdf5 file, but it needs to be in a different group.
- Returns
- export_json(to, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None)[source]¶
Exports the DataFrame to a CSV file.
- Parameters
to – filename or file object
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel – Evaluate the (virtual) columns in parallel
fs_options – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}fs – Pass a file system object directly, see
vaex.open()
- Returns
- export_many(path, progress=None, chunk_size=1048576, parallel=True, max_workers=None, fs_options=None, fs=None, **export_kwargs)[source]¶
Export the DataFrame to multiple files of the same type in parallel.
The path will be formatted using the i parameter (which is the chunk index).
Example:
>>> import vaex >>> df = vaex.open('my_big_dataset.hdf5') >>> print(f'number of rows: {len(df):,}') number of rows: 193,938,982 >>> df.export_many(path='my/destination/folder/chunk-{i:03}.arrow') >>> df_single_chunk = vaex.open('my/destination/folder/chunk-00001.arrow') >>> print(f'number of rows: {len(df_single_chunk):,}') number of rows: 1,048,576 >>> df_all_chunks = vaex.open('my/destination/folder/chunk-*.arrow') >>> print(f'number of rows: {len(df_all_chunks):,}') number of rows: 193,938,982
- Parameters
path (str) – Path for file, formatted by chunk index i (e.g. ‘chunk-{i:05}.parquet’)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel (bool) – Evaluate the (virtual) columns in parallel
max_workers (int) – Number of workers/threads to use for writing in parallel
fs_options (dict) – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}
- export_parquet(path, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None, **kwargs)[source]¶
Exports the DataFrame to a parquet file.
Note: This may require that all of the data fits into memory (memory mapped data is an exception).
- Parameters
path (str) – path for file
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel (bool) – Evaluate the (virtual) columns in parallel
fs_options (dict) – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}fs – Pass a file system object directly, see
vaex.open()
kwargs – Extra keyword arguments to be passed on to py:data:pyarrow.parquet.ParquetWriter.
- Returns
- export_partitioned(path, by, directory_format='{key}={value}', progress=None, chunk_size=1048576, parallel=True, fs_options={}, fs=None)[source]¶
Expertimental: export files using hive partitioning.
If no extension is found in the path, we assume parquet files. Otherwise you can specify the format like an format-string. Where {i} is a zero based index, {uuid} a unique id, and {subdir} the Hive key=value directory.
- Example paths:
‘/some/dir/{subdir}/{i}.parquet’
‘/some/dir/{subdir}/fixed_name.parquet’
‘/some/dir/{subdir}/{uuid}.parquet’
‘/some/dir/{subdir}/{uuid}.parquet’
- Parameters
path – directory where to write the files to.
str (str or list of) – Which column to partition by.
directory_format (str) – format string for directories, default ‘{key}={value}’ for Hive layout.
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
chunk_size (int) – Number of rows to be written to disk in a single iteration
parallel (bool) – Evaluate the (virtual) columns in parallel
fs_options (dict) – see
vaex.open()
e.g. for S3 {“profile”: “myproject”}
- groupby(by=None, agg=None, sort=False, ascending=True, assume_sparse='auto', row_limit=None, copy=True, progress=None, delay=False)[source]¶
Return a
GroupBy
orDataFrame
object when agg is not NoneExamples:
>>> import vaex >>> import numpy as np >>> np.random.seed(42) >>> x = np.random.randint(1, 5, 10) >>> y = x**2 >>> df = vaex.from_arrays(x=x, y=y) >>> df.groupby(df.x, agg='count') # x y_count 0 3 4 1 4 2 2 1 3 3 2 1 >>> df.groupby(df.x, agg=[vaex.agg.count('y'), vaex.agg.mean('y')]) # x y_count y_mean 0 3 4 9 1 4 2 16 2 1 3 1 3 2 1 4 >>> df.groupby(df.x, agg={'z': [vaex.agg.count('y'), vaex.agg.mean('y')]}) # x z_count z_mean 0 3 4 9 1 4 2 16 2 1 3 1 3 2 1 4
Example using datetime:
>>> import vaex >>> import numpy as np >>> t = np.arange('2015-01-01', '2015-02-01', dtype=np.datetime64) >>> y = np.arange(len(t)) >>> df = vaex.from_arrays(t=t, y=y) >>> df.groupby(vaex.BinnerTime.per_week(df.t)).agg({'y' : 'sum'}) # t y 0 2015-01-01 00:00:00 21 1 2015-01-08 00:00:00 70 2 2015-01-15 00:00:00 119 3 2015-01-22 00:00:00 168 4 2015-01-29 00:00:00 87
- Parameters
agg (dict, list or agg) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the groupby object.
sort (bool) – Sort columns for which we group by.
ascending (bool or list of bools) – ascending (default, True) or descending (False).
assume_sparse (bool or str) – Assume that when grouping by multiple keys, that the existing pairs are sparse compared to the cartesian product. If ‘auto’, let vaex decide (e.g. a groupby with 10_000 rows but only 4*3=12 combinations does not matter much to compress into say 8 existing combinations, and will save another pass over the data)
row_limit (int) – Limits the resulting dataframe to the number of rows (default is not to check, only works when assume_sparse is True). Throws a
vaex.RowLimitException
when the condition is not met.copy (bool) – Copy the dataframe (shallow, does not cost memory) so that the fingerprint of the original dataframe is not modified.
delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
- Returns
DataFrame
orGroupBy
object.
- hashed(inplace=False) vaex.dataframe.DataFrame [source]¶
Return a DataFrame with a hashed dataset
- is_local()[source]¶
The local implementation of
DataFrame.evaluate()
, always returns True.
- join(other, on=None, left_on=None, right_on=None, lprefix='', rprefix='', lsuffix='', rsuffix='', how='left', allow_duplication=False, prime_growth=False, cardinality_other=None, inplace=False)[source]¶
Return a DataFrame joined with other DataFrames, matched by columns/expression on/left_on/right_on
If neither on/left_on/right_on is given, the join is done by simply adding the columns (i.e. on the implicit row index).
Note: The filters will be ignored when joining, the full DataFrame will be joined (since filters may change). If either DataFrame is heavily filtered (contains just a small number of rows) consider running
DataFrame.extract()
first.Example:
>>> a = np.array(['a', 'b', 'c']) >>> x = np.arange(1,4) >>> ds1 = vaex.from_arrays(a=a, x=x) >>> b = np.array(['a', 'b', 'd']) >>> y = x**2 >>> ds2 = vaex.from_arrays(b=b, y=y) >>> ds1.join(ds2, left_on='a', right_on='b')
- Parameters
other – Other DataFrame to join with (the right side)
on – default key for the left table (self)
left_on – key for the left table (self), overrides on
right_on – default key for the right table (other), overrides on
lprefix – prefix to add to the left column names in case of a name collision
rprefix – similar for the right
lsuffix – suffix to add to the left column names in case of a name collision
rsuffix – similar for the right
how – how to join, ‘left’ keeps all rows on the left, and adds columns (with possible missing values) ‘right’ is similar with self and other swapped. ‘inner’ will only return rows which overlap.
allow_duplication (bool) – Allow duplication of rows when the joined column contains non-unique values.
cardinality_other (int) – Number of unique elements (or estimate of) for the other table.
prime_growth (bool) – Growth strategy for the hashmaps used internally, can improve performance in some case (e.g. integers with low bits unused).
inplace – If True, make modifications to self, otherwise return a new DataFrame
- Returns
- label_encode(column, values=None, inplace=False, lazy=False)¶
Deprecated: use ordinal_encode
Encode column as ordinal values and mark it as categorical.
The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].
- param lazy
When False, it will materialize the ordinal codes.
- length(selection=False)[source]¶
Get the length of the DataFrames, for the selection of the whole DataFrame.
If selection is False, it returns len(df).
TODO: Implement this in DataFrameRemote, and move the method up in
DataFrame.length()
- Parameters
selection – When True, will return the number of selected rows
- Returns
- ordinal_encode(column, values=None, inplace=False, lazy=False)[source]¶
Deprecated: use ordinal_encode
Encode column as ordinal values and mark it as categorical.
The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].
- param lazy
When False, it will materialize the ordinal codes.
- selected_length(selection='default')[source]¶
The local implementation of
DataFrame.selected_length()
- shallow_copy(virtual=True, variables=True)[source]¶
Creates a (shallow) copy of the DataFrame.
It will link to the same data, but will have its own state, e.g. virtual columns, variables, selection etc.
- property values¶
Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.
Note this returns the same result as:
>>> np.array(ds)
If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).
Date/time operations¶
- class vaex.expression.DateTime(expression)[source]¶
Bases:
object
DateTime operations
Usually accessed using e.g. df.birthday.dt.dayofweek
- __weakref__¶
list of weak references to the object (if defined)
- property date¶
Return the date part of the datetime value
- Returns
an expression containing the date portion of a datetime value
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.date Expression = dt_date(date) Length: 3 dtype: datetime64[D] (expression) ------------------------------------------- 0 2009-10-12 1 2016-02-11 2 2015-11-12
- property day¶
Extracts the day from a datetime sample.
- Returns
an expression containing the day extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.day Expression = dt_day(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 12 1 11 2 12
- property day_name¶
Returns the day names of a datetime sample in English.
- Returns
an expression containing the day names extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.day_name Expression = dt_day_name(date) Length: 3 dtype: str (expression) --------------------------------- 0 Monday 1 Thursday 2 Thursday
- property dayofweek¶
Obtain the day of the week with Monday=0 and Sunday=6
- Returns
an expression containing the day of week.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.dayofweek Expression = dt_dayofweek(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 0 1 3 2 3
- property dayofyear¶
The ordinal day of the year.
- Returns
an expression containing the ordinal day of the year.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.dayofyear Expression = dt_dayofyear(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 285 1 42 2 316
- floor(freq, *args)¶
Perform floor operation on an expression for a given frequency.
- Parameters
freq – The frequency level to floor the index to. Must be a fixed frequency like ‘S’ (second), or ‘H’ (hour), but not ‘ME’ (month end).
- Returns
an expression containing the floored datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.floor("H") Expression = dt_floor(date, 'H') Length: 3 dtype: datetime64[ns] (expression) -------------------------------------------- 0 2009-10-12 03:00:00.000000000 1 2016-02-11 10:00:00.000000000 2 2015-11-12 11:00:00.000000000
- property halfyear¶
Return the half-year of the date. Values can be 1 and 2, for the first and second half of the year respectively.
- Returns
an expression containing the half-year extracted from the datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22 >>> df.date.dt.halfyear Expression = dt_halfyear(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 2 1 1 2 2
- property hour¶
Extracts the hour out of a datetime samples.
- Returns
an expression containing the hour extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.hour Expression = dt_hour(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 3 1 10 2 11
- property is_leap_year¶
Check whether a year is a leap year.
- Returns
an expression which evaluates to True if a year is a leap year, and to False otherwise.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.is_leap_year Expression = dt_is_leap_year(date) Length: 3 dtype: bool (expression) ---------------------------------- 0 False 1 True 2 False
- property minute¶
Extracts the minute out of a datetime samples.
- Returns
an expression containing the minute extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.minute Expression = dt_minute(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 31 1 17 2 34
- property month¶
Extracts the month out of a datetime sample.
- Returns
an expression containing the month extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.month Expression = dt_month(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 10 1 2 2 11
- property month_name¶
Returns the month names of a datetime sample in English.
- Returns
an expression containing the month names extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.month_name Expression = dt_month_name(date) Length: 3 dtype: str (expression) --------------------------------- 0 October 1 February 2 November
- property quarter¶
Return the quarter of the date. Values range from 1-4.
- Returns
an expression containing the quarter extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22 >>> df.date.dt.quarter Expression = dt_quarter(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 4 1 1 2 4
- property second¶
Extracts the second out of a datetime samples.
- Returns
an expression containing the second extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.second Expression = dt_second(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 0 1 34 2 22
- strftime(date_format)¶
Returns a formatted string from a datetime sample.
- Returns
an expression containing a formatted string extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.strftime("%Y-%m") Expression = dt_strftime(date, '%Y-%m') Length: 3 dtype: object (expression) ------------------------------------ 0 2009-10 1 2016-02 2 2015-11
- property weekofyear¶
Returns the week ordinal of the year.
- Returns
an expression containing the week ordinal of the year, extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.weekofyear Expression = dt_weekofyear(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 42 1 6 2 46
- property year¶
Extracts the year out of a datetime sample.
- Returns
an expression containing the year extracted from a datetime column.
Example:
>>> import vaex >>> import numpy as np >>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64) >>> df = vaex.from_arrays(date=date) >>> df # date 0 2009-10-12 03:31:00 1 2016-02-11 10:17:34 2 2015-11-12 11:34:22
>>> df.date.dt.year Expression = dt_year(date) Length: 3 dtype: int64 (expression) ----------------------------------- 0 2009 1 2016 2 2015
Expression class¶
- class vaex.expression.Expression(ds, expression, ast=None, _selection=False)[source]¶
Bases:
object
Expression class
- __bool__()[source]¶
Cast expression to boolean. Only supports (<expr1> == <expr2> and <expr1> != <expr2>)
The main use case for this is to support assigning to traitlets. e.g.:
>>> bool(expr1 == expr2)
This will return True when expr1 and expr2 are exactly the same (in string representation). And similarly for:
>>> bool(expr != expr2)
All other cases will return True.
- __eq__(b)¶
Return self==value.
- __ge__(b)¶
Return self>=value.
- __getitem__(slicer)[source]¶
Provides row and optional field access (struct arrays) via bracket notation.
Examples:
>>> import vaex >>> import pyarrow as pa >>> array = pa.StructArray.from_arrays(arrays=[[1, 2, 3], ["a", "b", "c"]], names=["col1", "col2"]) >>> df = vaex.from_arrays(array=array, integer=[5, 6, 7]) >>> df # array integer 0 {'col1': 1, 'col2': 'a'} 5 1 {'col1': 2, 'col2': 'b'} 6 2 {'col1': 3, 'col2': 'c'} 7
>>> df.integer[1:] Expression = integer Length: 2 dtype: int64 (column) ------------------------------- 0 6 1 7
>>> df.array[1:] Expression = array Length: 2 dtype: struct<col1: int64, col2: string> (column) ----------------------------------------------------------- 0 {'col1': 2, 'col2': 'b'} 1 {'col1': 3, 'col2': 'c'}
>>> df.array[:, "col1"] Expression = struct_get(array, 'col1') Length: 3 dtype: int64 (expression) ----------------------------------- 0 1 1 2 2 3
>>> df.array[1:, ["col1"]] Expression = struct_project(array, ['col1']) Length: 2 dtype: struct<col1: int64> (expression) ------------------------------------------------- 0 {'col1': 2} 1 {'col1': 3}
>>> df.array[1:, ["col2", "col1"]] Expression = struct_project(array, ['col2', 'col1']) Length: 2 dtype: struct<col2: string, col1: int64> (expression) --------------------------------------------------------------- 0 {'col2': 'b', 'col1': 2} 1 {'col2': 'c', 'col1': 3}
- __gt__(b)¶
Return self>value.
- __hash__ = None¶
- __le__(b)¶
Return self<=value.
- __lt__(b)¶
Return self<value.
- __ne__(b)¶
Return self!=value.
- __weakref__¶
list of weak references to the object (if defined)
- abs(**kwargs)¶
Lazy wrapper around
numpy.abs
- apply(f, vectorize=False, multiprocessing=True)[source]¶
Apply a function along all values of an Expression.
Shorthand for
df.apply(f, arguments=[expression])
, seeDataFrame.apply()
Example:
>>> df = vaex.example() >>> df.x Expression = x Length: 330,000 dtype: float64 (column) --------------------------------------- 0 -0.777471 1 3.77427 2 1.37576 3 -7.06738 4 0.243441
>>> def func(x): ... return x**2
>>> df.x.apply(func) Expression = lambda_function(x) Length: 330,000 dtype: float64 (expression) ------------------------------------------- 0 0.604461 1 14.2451 2 1.89272 3 49.9478 4 0.0592637
- Parameters
f – A function to be applied on the Expression values
vectorize – Call f with arrays instead of a scalars (for better performance).
multiprocessing (bool) – Use multiple processes to avoid the GIL (Global interpreter lock).
- Returns
A function that is lazily evaluated when called.
- arccos(**kwargs)¶
Lazy wrapper around
numpy.arccos
- arccosh(**kwargs)¶
Lazy wrapper around
numpy.arccosh
- arcsin(**kwargs)¶
Lazy wrapper around
numpy.arcsin
- arcsinh(**kwargs)¶
Lazy wrapper around
numpy.arcsinh
- arctan(**kwargs)¶
Lazy wrapper around
numpy.arctan
- arctan2(**kwargs)¶
Lazy wrapper around
numpy.arctan2
- arctanh(**kwargs)¶
Lazy wrapper around
numpy.arctanh
- as_arrow()¶
Lazily convert to Apache Arrow array type
- as_numpy(strict=False)¶
Lazily convert to NumPy ndarray type
- property ast¶
Returns the abstract syntax tree (AST) of the expression
- clip(**kwargs)¶
Lazy wrapper around
numpy.clip
- copy(df=None)[source]¶
Efficiently copies an expression.
Expression objects have both a string and AST representation. Creating the AST representation involves parsing the expression, which is expensive.
Using copy will deepcopy the AST when the expression was already parsed.
- Parameters
df – DataFrame for which the expression will be evaluated (self.df if None)
- cosh(**kwargs)¶
Lazy wrapper around
numpy.cosh
- count(binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]¶
Shortcut for ds.count(expression, …), see Dataset.count
- countna()[source]¶
Returns the number of Not Availiable (N/A) values in the expression. This includes missing values and np.nan values.
- deg2rad(**kwargs)¶
Lazy wrapper around
numpy.deg2rad
- digitize(**kwargs)¶
Lazy wrapper around
numpy.digitize
- dot_product(b)¶
Compute the dot product between a and b.
- Parameters
a – A list of Expressions or a list of values (e.g. a vector)
b – A list of Expressions or a list of values (e.g. a vector)
- Returns
Vaex expression
- expand(stop=[])[source]¶
Expand the expression such that no virtual columns occurs, only normal columns.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2) >>> r.expand().expression 'sqrt(((x ** 2) + (y ** 2)))'
- expm1(**kwargs)¶
Lazy wrapper around
numpy.expm1
- fillmissing(value)[source]¶
Returns an array where missing values are replaced by value.
See :ismissing for the definition of missing values.
- fillna(value)¶
Returns an array where NA values are replaced by value. See :isna for the definition of missing values.
- fillnan(value)¶
Returns an array where nan values are replaced by value. See :isnan for the definition of missing values.
- format(format)¶
Uses http://www.cplusplus.com/reference/string/to_string/ for formatting
- hashmap_apply(hashmap, check_missing=False)¶
Apply values to hashmap, if check_missing is True, missing values in the hashmap will translated to null/missing values
- isfinite(**kwargs)¶
Lazy wrapper around
numpy.isfinite
- isin(values, use_hashmap=True)[source]¶
Lazily tests if each value in the expression is present in values.
- Parameters
values – List/array of values to check
use_hashmap – use a hashmap or not (especially faster when values contains many elements)
- Returns
Expression
with the lazy expression.
- isinf(**kwargs)¶
Lazy wrapper around
numpy.isinf
- ismissing()¶
Returns True where there are missing values (masked arrays), missing strings or None
- isna()¶
Returns a boolean expression indicating if the values are Not Availiable (missing or NaN).
- isnan()¶
Returns an array where there are NaN values
- kurtosis(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for df.kurtosis(expression, …), see DataFrame.kurtosis
- log10(**kwargs)¶
Lazy wrapper around
numpy.log10
- log1p(**kwargs)¶
Lazy wrapper around
numpy.log1p
- map(mapper, nan_value=None, missing_value=None, default_value=None, allow_missing=False, axis=None)[source]¶
Map values of an expression or in memory column according to an input dictionary or a custom callable function.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'red', 'blue', 'red', 'green']) >>> mapper = {'red': 1, 'blue': 2, 'green': 3} >>> df['color_mapped'] = df.color.map(mapper) >>> df # color color_mapped 0 red 1 1 red 1 2 blue 2 3 red 1 4 green 3 >>> import numpy as np >>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, np.nan]) >>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user', np.nan: 'unknown'}) >>> df # type role 0 0 admin 1 1 maintainer 2 2 user 3 2 user 4 2 user 5 nan unknown >>> import vaex >>> import numpy as np >>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, 4]) >>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user'}, default_value='unknown') >>> df # type role 0 0 admin 1 1 maintainer 2 2 user 3 2 user 4 2 user 5 4 unknown :param mapper: dict like object used to map the values from keys to values :param nan_value: value to be used when a nan is present (and not in the mapper) :param missing_value: value to use used when there is a missing value :param default_value: value to be used when a value is not in the mapper (like dict.get(key, default)) :param allow_missing: used to signal that values in the mapper should map to a masked array with missing values, assumed True when default_value is not None. :param bool axis: Axis over which to determine the unique elements (None will flatten arrays or lists) :return: A vaex expression :rtype: vaex.expression.Expression
- property masked¶
Alias to df.is_masked(expression)
- max(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for ds.max(expression, …), see Dataset.max
- maximum(**kwargs)¶
Lazy wrapper around
numpy.maximum
- mean(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for ds.mean(expression, …), see Dataset.mean
- min(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for ds.min(expression, …), see Dataset.min
- minimum(**kwargs)¶
Lazy wrapper around
numpy.minimum
- minmax(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for ds.minmax(expression, …), see Dataset.minmax
- nop()[source]¶
Evaluates expression, and drop the result, usefull for benchmarking, since vaex is usually lazy
- notna()¶
Opposite of isna
- nunique(dropna=False, dropnan=False, dropmissing=False, selection=None, axis=None, limit=None, limit_raise=True, progress=None, delay=False)[source]¶
Counts number of unique values, i.e. len(df.x.unique()) == df.x.nunique().
- Parameters
dropna – Drop rows with Not Available (NA) values (NaN or missing values).
dropnan – Drop rows with NaN values
dropmissing – Drop rows with missing values
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)
limit (int) – Limit the amount of results
limit_raise (bool) – Raise
vaex.RowLimitException
when limit is exceeded, or return at maximum ‘limit’ amount of results.progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
- rad2deg(**kwargs)¶
Lazy wrapper around
numpy.rad2deg
- round(**kwargs)¶
Lazy wrapper around
numpy.round
- searchsorted(**kwargs)¶
Lazy wrapper around
numpy.searchsorted
- sinc(**kwargs)¶
Lazy wrapper around
numpy.sinc
- sinh(**kwargs)¶
Lazy wrapper around
numpy.sinh
- skew(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for df.skew(expression, …), see DataFrame.skew
- sqrt(**kwargs)¶
Lazy wrapper around
numpy.sqrt
- std(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for ds.std(expression, …), see Dataset.std
- property str¶
Gives access to string operations via
StringOperations
- property str_pandas¶
Gives access to string operations via
StringOperationsPandas
(using Pandas Series)
- property struct¶
Gives access to struct operations via
StructOperations
- sum(axis=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Sum elements over given axis.
If no axis is given, it will sum over all axes.
For non list elements, this is a shortcut for ds.sum(expression, …), see Dataset.sum.
>>> list_data = [1, 2, None], None, [], [1, 3, 4, 5] >>> df = vaex.from_arrays(some_list=pa.array(list_data)) >>> df.some_list.sum().item() # will sum over all axis 16 >>> df.some_list.sum(axis=1).tolist() # sums the list elements [3, None, 0, 13]
- Parameters
axis (int) – Axis over which to determine the unique elements (None will flatten arrays or lists)
- tanh(**kwargs)¶
Lazy wrapper around
numpy.tanh
- to_arrow(convert_to_native=False)[source]¶
Convert to Apache Arrow array (will byteswap/copy if convert_to_native=True).
- to_pandas_series()[source]¶
Return a pandas.Series representation of the expression.
Note: Pandas is likely to make a memory copy of the data.
- to_string()¶
Cast/convert to string, same as expression.astype(‘str’)
- property transient¶
If this expression is not transient (e.g. on disk) optimizations can be made
- unique(dropna=False, dropnan=False, dropmissing=False, selection=None, axis=None, limit=None, limit_raise=True, array_type='list', progress=None, delay=False)[source]¶
Returns all unique values.
- Parameters
dropna – Drop rows with Not Available (NA) values (NaN or missing values).
dropnan – Drop rows with NaN values
dropmissing – Drop rows with missing values
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)
limit (int) – Limit the amount of results
limit_raise (bool) – Raise
vaex.RowLimitException
when limit is exceeded, or return at maximum ‘limit’ amount of results.array_type (bool) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
- value_counts(dropna=False, dropnan=False, dropmissing=False, ascending=False, progress=False, axis=None, delay=False)[source]¶
Computes counts of unique values.
- WARNING:
If the expression/column is not categorical, it will be converted on the fly
dropna is False by default, it is True by default in pandas
- Parameters
dropna – Drop rows with Not Available (NA) values (NaN or missing values).
dropnan – Drop rows with NaN values
dropmissing – Drop rows with missing values
ascending – when False (default) it will report the most frequent occuring item first
progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False
axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)
delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)
- Returns
Pandas series containing the counts
- var(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]¶
Shortcut for ds.std(expression, …), see Dataset.var
- variables(ourself=False, expand_virtual=True, include_virtual=True)[source]¶
Return a set of variables this expression depends on.
Example:
>>> df = vaex.example() >>> r = np.sqrt(df.data.x**2 + df.data.y**2) >>> r.variables() {'x', 'y'}
- where(x, y, dtype=None)¶
Return the values row-wise chosen from x or y depending on the condition.
This a useful function when you want to create based on some condition. If the condition is True, the value from x is taken, and othewise the value from y is taken. An easy way to think about the syntax is df.func.where(“if”, “then”, “else”). Please see the example below.
Note: if the function is used as a method of an expression, that expression is assumed to be the condition.
- Parameters
condition – An boolean expression
x – A single value or an expression, the value passed on if the condition is satisfied.
y – A single value or an expression, the value passed on if the condition is not satisfied.
dtype – Optionally specify the dtype of the resulting expression
- Return type
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[0, 1, 2, 3]) >>> df['y'] = df.func.where(df.x >=2, df.x, -1) >>> df # x y 0 0 -1 1 1 -1 2 2 2 3 3 3
Geo operations¶
- class vaex.geo.DataFrameAccessorGeo(df)[source]¶
Bases:
object
Geometry/geographic helper methods
Example:
>>> df_xyz = df.geo.spherical2cartesian(df.longitude, df.latitude, df.distance) >>> df_xyz.x.mean()
- __weakref__¶
list of weak references to the object (if defined)
- bearing(lon1, lat1, lon2, lat2, bearing='bearing', inplace=False)[source]¶
Calculates a bearing, based on http://www.movable-type.co.uk/scripts/latlong.html
- cartesian2spherical(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position', inplace=False)[source]¶
Convert cartesian to spherical coordinates.
- Parameters
x –
y –
z –
alpha –
delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True).
distance –
radians –
center –
center_name –
- Returns
- cartesian_to_polar(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', propagate_uncertainties=False, radians=False, inplace=False)[source]¶
Convert cartesian to polar coordinates
- Parameters
x – expression for x
y – expression for y
radius_out – name for the virtual column for the radius
azimuth_out – name for the virtual column for the azimuth angle
propagate_uncertainties – {propagate_uncertainties}
radians – if True, azimuth is in radians, defaults to degrees
- Returns
- inside_polygon(y, px, py)¶
Test if points defined by x and y are inside the polygon px, py
Example:
>>> import vaex >>> import numpy as np >>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4]) >>> px = np.array([1.5, 2.5, 2.5, 1.5]) >>> py = np.array([2.5, 2.5, 3.5, 3.5]) >>> df['inside'] = df.geo.inside_polygon(df.x, df.y, px, py) >>> df # x y inside 0 1 2 False 1 2 3 True 2 3 4 False
- Parameters
x – {expression_one}
y – {expression_one}
px – list of x coordinates for the polygon
px – list of y coordinates for the polygon
- Returns
Expression, which is true if point is inside, else false.
- inside_polygons(y, pxs, pys, any=True)¶
Test if points defined by x and y are inside all or any of the the polygons px, py
Example:
>>> import vaex >>> import numpy as np >>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4]) >>> px = np.array([1.5, 2.5, 2.5, 1.5]) >>> py = np.array([2.5, 2.5, 3.5, 3.5]) >>> df['inside'] = df.geo.inside_polygons(df.x, df.y, [px, px + 1], [py, py + 1], any=True) >>> df # x y inside 0 1 2 False 1 2 3 True 2 3 4 True
- Parameters
x – {expression_one}
y – {expression_one}
pxs – list of N ndarrays with x coordinates for the polygon, N is the number of polygons
pxs – list of N ndarrays with y coordinates for the polygon
any – return true if in any polygon, or all polygons
- Returns
Expression , which is true if point is inside, else false.
- inside_which_polygon(y, pxs, pys)¶
Find in which polygon (0 based index) a point resides
Example:
>>> import vaex >>> import numpy as np >>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4]) >>> px = np.array([1.5, 2.5, 2.5, 1.5]) >>> py = np.array([2.5, 2.5, 3.5, 3.5]) >>> df['polygon_index'] = df.geo.inside_which_polygon(df.x, df.y, [px, px + 1], [py, py + 1]) >>> df # x y polygon_index 0 1 2 -- 1 2 3 0 2 3 4 1
- Parameters
x – {expression_one}
y – {expression_one}
px – list of N ndarrays with x coordinates for the polygon, N is the number of polygons
px – list of N ndarrays with y coordinates for the polygon
- Returns
Expression, 0 based index to which polygon the point belongs (or missing/masked value)
- inside_which_polygons(x, y, pxss, pyss=None, any=True)[source]¶
Find in which set of polygons (0 based index) a point resides.
If any=True, it will be the first matching polygon set index, if any=False, it will be the first index that matches all polygons in the set.
>>> import vaex >>> import numpy as np >>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4]) >>> px = np.array([1.5, 2.5, 2.5, 1.5]) >>> py = np.array([2.5, 2.5, 3.5, 3.5]) >>> polygonA = [px, py] >>> polygonB = [px + 1, py + 1] >>> pxs = [[polygonA, polygonB], [polygonA]] >>> df['polygon_index'] = df.geo.inside_which_polygons(df.x, df.y, pxs, any=True) >>> df # x y polygon_index 0 1 2 -- 1 2 3 0 2 3 4 0 >>> df['polygon_index'] = df.geo.inside_which_polygons(df.x, df.y, pxs, any=False) >>> df # x y polygon_index 0 1 2 -- 1 2 3 1 2 3 4 --
- Parameters
x – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
px – list of N ndarrays with x coordinates for the polygon, N is the number of polygons
px – list of N ndarrays with y coordinates for the polygon, if None, the shape of the ndarrays of the last dimention of the x arrays should be 2 (i.e. have the x and y coordinates)
any – test if point it in any polygon (logically or), or all polygons (logically and)
- Returns
Expression, 0 based index to which polygon the point belongs (or missing/masked value)
- project_aitoff(alpha, delta, x, y, radians=True, inplace=False)[source]¶
Add aitoff (https://en.wikipedia.org/wiki/Aitoff_projection) projection
- Parameters
alpha – azimuth angle
delta – polar angle
x – output name for x coordinate
y – output name for y coordinate
radians – input and output in radians (True), or degrees (False)
- Returns
- project_gnomic(alpha, delta, alpha0=0, delta0=0, x='x', y='y', radians=False, postfix='', inplace=False)[source]¶
Adds a gnomic projection to the DataFrame
- rotation_2d(x, y, xnew, ynew, angle_degrees, propagate_uncertainties=False, inplace=False)[source]¶
Rotation in 2d.
- spherical2cartesian(alpha, delta, distance, xname='x', yname='y', zname='z', propagate_uncertainties=False, center=[0, 0, 0], radians=False, inplace=False)[source]¶
Convert spherical to cartesian coordinates.
- Parameters
alpha –
delta – polar angle, ranging from the -90 (south pole) to 90 (north pole)
distance – radial distance, determines the units of x, y and z
xname –
yname –
zname –
propagate_uncertainties – If true, will propagate errors for the new virtual columns, see
propagate_uncertainties()
for detailscenter –
radians –
- Returns
New dataframe (in inplace is False) with new x,y,z columns
- velocity_cartesian2polar(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', propagate_uncertainties=False, inplace=False)[source]¶
Convert cartesian to polar velocities.
- Parameters
x –
y –
vx –
radius_polar – Optional expression for the radius, may lead to a better performance when given.
vy –
vr_out –
vazimuth_out –
propagate_uncertainties – If true, will propagate errors for the new virtual columns, see
propagate_uncertainties()
for details
- Returns
- velocity_cartesian2spherical(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', vlong='vlong', vlat='vlat', distance=None, inplace=False)[source]¶
Convert velocities from a cartesian to a spherical coordinate system
TODO: uncertainty propagation
- Parameters
x – name of x column (input)
y – y
z – z
vx – vx
vy – vy
vz – vz
vr – name of the column for the radial velocity in the r direction (output)
vlong – name of the column for the velocity component in the longitude direction (output)
vlat – name of the column for the velocity component in the latitude direction, positive points to the north pole (output)
distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
- Returns
- velocity_polar2cartesian(x='x', y='y', azimuth=None, vr='vr_polar', vazimuth='vphi_polar', vx_out='vx', vy_out='vy', propagate_uncertainties=False, inplace=False)[source]¶
Convert cylindrical polar velocities to Cartesian.
- Parameters
x –
y –
azimuth – Optional expression for the azimuth in degrees , may lead to a better performance when given.
vr –
vazimuth –
vx_out –
vy_out –
propagate_uncertainties – If true, will propagate errors for the new virtual columns, see
propagate_uncertainties()
for details
Logging¶
Sets up logging for vaex.
See configuration of logging how to configure logging.
String operations¶
- class vaex.expression.StringOperations(expression)[source]¶
Bases:
object
String operations.
Usually accessed using e.g. df.name.str.lower()
- __weakref__¶
list of weak references to the object (if defined)
- byte_length()¶
Returns the number of bytes in a string sample.
- Returns
an expression contains the number of bytes in each sample of a string column.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.byte_length() Expression = str_byte_length(text) Length: 5 dtype: int64 (expression) ----------------------------------- 0 9 1 11 2 9 3 3 4 4
- capitalize()¶
Capitalize the first letter of a string sample.
- Returns
an expression containing the capitalized strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.capitalize() Expression = str_capitalize(text) Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 Very pretty 2 Is coming 3 Our 4 Way.
- cat(other)¶
Concatenate two string columns on a row-by-row basis.
- Parameters
other (expression) – The expression of the other column to be concatenated.
- Returns
an expression containing the concatenated columns.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.cat(df.text) Expression = str_cat(text, text) Length: 5 dtype: str (expression) --------------------------------- 0 SomethingSomething 1 very prettyvery pretty 2 is comingis coming 3 ourour 4 way.way.
- center(width, fillchar=' ')¶
Fills the left and right side of the strings with additional characters, such that the sample has a total of width characters.
- Parameters
- Returns
an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.center(width=11, fillchar='!') Expression = str_center(text, width=11, fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 !Something! 1 very pretty 2 !is coming! 3 !!!!our!!!! 4 !!!!way.!!!
- contains(pattern, regex=True)¶
Check if a string pattern or regex is contained within a sample of a string column.
- Parameters
- Returns
an expression which is evaluated to True if the pattern is found in a given sample, and it is False otherwise.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.contains('very') Expression = str_contains(text, 'very') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 True 2 False 3 False 4 False
- count(pat, regex=False)¶
Count the occurences of a pattern in sample of a string column.
- Parameters
- Returns
an expression containing the number of times a pattern is found in each sample.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.count(pat="et", regex=False) Expression = str_count(text, pat='et', regex=False) Length: 5 dtype: int64 (expression) ----------------------------------- 0 1 1 1 2 0 3 0 4 0
- endswith(pat)¶
Check if the end of each string sample matches the specified pattern.
- Parameters
pat (str) – A string pattern or a regex
- Returns
an expression evaluated to True if the pattern is found at the end of a given sample, False otherwise.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.endswith(pat="ing") Expression = str_endswith(text, pat='ing') Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 True 3 False 4 False
- equals(y)¶
Tests if strings x and y are the same
- Returns
a boolean expression
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.equals(df.text) Expression = str_equals(text, text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 True 2 True 3 True 4 True
>>> df.text.str.equals('our') Expression = str_equals(text, 'our') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 True 4 False
- extract_regex(pattern)¶
Extract substrings defined by a regular expression using Apache Arrow (Google RE2 library).
- Parameters
pattern (str) – A regular expression which needs to contain named capture groups, e.g. ‘letter’ and ‘digit’ for the regular expression ‘(?P<letter>[ab])(?P<digit>d)’.
- Returns
an expression containing a struct with field names corresponding to capture group identifiers.
Example:
>>> import vaex >>> email = ["foo@bar.org", "bar@foo.org", "open@source.org", "invalid@address.com"] >>> df = vaex.from_arrays(email=email) >>> df # email 0 foo@bar.org 1 bar@foo.org 2 open@source.org 3 invalid@address.com
>>> pattern = "(?P<name>.*)@(?P<address>.*)\.org" >>> df.email.str.extract_regex(pattern=pattern) Expression = str_extract_regex(email, pattern='(?P<name>.*)@(?P<addres... Length: 4 dtype: struct<name: string, address: string> (expression) ------------------------------------------------------------------- 0 {'name': 'foo', 'address': 'bar'} 1 {'name': 'bar', 'address': 'foo'} 2 {'name': 'open', 'address': 'source'} 3 --
- find(sub, start=0, end=None)¶
Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned.
- Parameters
- Returns
an expression containing the lowest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.find(sub="et") Expression = str_find(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
- get(i)¶
Extract a character from each sample at the specified position from a string column. Note that if the specified position is out of bound of the string sample, this method returns ‘’, while pandas retunrs nan.
- Parameters
i (int) – The index location, at which to extract the character.
- Returns
an expression containing the extracted characters.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.get(5) Expression = str_get(text, 5) Length: 5 dtype: str (expression) --------------------------------- 0 h 1 p 2 m 3 4
- index(sub, start=0, end=None)¶
Returns the lowest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned. It is the same as str.find.
- Parameters
- Returns
an expression containing the lowest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.index(sub="et") Expression = str_find(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
- isalnum(ascii=False)¶
Check if all characters in a string sample are alphanumeric.
- Parameters
ascii (bool) – Transform only ascii characters (usually faster).
- Returns
an expression evaluated to True if a sample contains only alphanumeric characters, otherwise False.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isalnum() Expression = str_isalnum(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 False 3 True 4 False
- isalpha()¶
Check if all characters in a string sample are alphabetic.
- Returns
an expression evaluated to True if a sample contains only alphabetic characters, otherwise False.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isalpha() Expression = str_isalpha(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 False 3 True 4 False
- isdigit()¶
Check if all characters in a string sample are digits.
- Returns
an expression evaluated to True if a sample contains only digits, otherwise False.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', '6'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 6
>>> df.text.str.isdigit() Expression = str_isdigit(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 False 4 True
- islower()¶
Check if all characters in a string sample are lowercase characters.
- Returns
an expression evaluated to True if a sample contains only lowercase characters, otherwise False.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.islower() Expression = str_islower(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 True 2 True 3 True 4 True
- isspace()¶
Check if all characters in a string sample are whitespaces.
- Returns
an expression evaluated to True if a sample contains only whitespaces, otherwise False.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', ' ', ' '] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 4
>>> df.text.str.isspace() Expression = str_isspace(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 True 4 True
- istitle(ascii=False)¶
TODO
- isupper()¶
Check if all characters in a string sample are lowercase characters.
- Returns
an expression evaluated to True if a sample contains only lowercase characters, otherwise False.
Example:
>>> import vaex >>> text = ['SOMETHING', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 SOMETHING 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.isupper() Expression = str_isupper(text) Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 False 2 False 3 False 4 False
- join(sep)¶
Same as find (difference with pandas is that it does not raise a ValueError)
- len()¶
Returns the length of a string sample.
- Returns
an expression contains the length of each sample of a string column.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.len() Expression = str_len(text) Length: 5 dtype: int64 (expression) ----------------------------------- 0 9 1 11 2 9 3 3 4 4
- ljust(width, fillchar=' ')¶
Fills the right side of string samples with a specified character such that the strings are right-hand justified.
- Parameters
- Returns
an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.ljust(width=10, fillchar='!') Expression = str_ljust(text, width=10, fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 Something! 1 very pretty 2 is coming! 3 our!!!!!!! 4 way.!!!!!!
- lower()¶
Converts string samples to lower case.
- Returns
an expression containing the converted strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.lower() Expression = str_lower(text) Length: 5 dtype: str (expression) --------------------------------- 0 something 1 very pretty 2 is coming 3 our 4 way.
- lstrip(to_strip=None)¶
Remove leading characters from a string sample.
- Parameters
to_strip (str) – The string to be removed
- Returns
an expression containing the modified string column.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.lstrip(to_strip='very ') Expression = str_lstrip(text, to_strip='very ') Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 pretty 2 is coming 3 our 4 way.
- match(pattern)¶
Check if a string sample matches a given regular expression.
- Parameters
pattern (str) – a string or regex to match to a string sample.
- Returns
an expression which is evaluated to True if a match is found, False otherwise.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.match(pattern='our') Expression = str_match(text, pattern='our') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 True 4 False
- notequals(y)¶
Tests if strings x and y are the not same
- Returns
a boolean expression
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.notequals(df.text) Expression = str_notequals(text, text) Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 False 3 False 4 False
>>> df.text.str.notequals('our') Expression = str_notequals(text, 'our') Length: 5 dtype: bool (expression) ---------------------------------- 0 True 1 True 2 True 3 False 4 True
- pad(width, side='left', fillchar=' ')¶
Pad strings in a given column.
- Parameters
- Returns
an expression containing the padded strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.pad(width=10, side='left', fillchar='!') Expression = str_pad(text, width=10, side='left', fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 !Something 1 very pretty 2 !is coming 3 !!!!!!!our 4 !!!!!!way.
- repeat(repeats)¶
Duplicate each string in a column.
- Parameters
repeats (int) – number of times each string sample is to be duplicated.
- Returns
an expression containing the duplicated strings
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.repeat(3) Expression = str_repeat(text, 3) Length: 5 dtype: str (expression) --------------------------------- 0 SomethingSomethingSomething 1 very prettyvery prettyvery pretty 2 is comingis comingis coming 3 ourourour 4 way.way.way.
- replace(pat, repl, n=- 1, flags=0, regex=False)¶
Replace occurences of a pattern/regex in a column with some other string.
- Parameters
- Returns
an expression containing the string replacements.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.replace(pat='et', repl='__') Expression = str_replace(text, pat='et', repl='__') Length: 5 dtype: str (expression) --------------------------------- 0 Som__hing 1 very pr__ty 2 is coming 3 our 4 way.
- rfind(sub, start=0, end=None)¶
Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned.
- Parameters
- Returns
an expression containing the highest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rfind(sub="et") Expression = str_rfind(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
- rindex(sub, start=0, end=None)¶
Returns the highest indices in each string in a column, where the provided substring is fully contained between within a sample. If the substring is not found, -1 is returned. Same as str.rfind.
- Parameters
- Returns
an expression containing the highest indices specifying the start of the substring.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rindex(sub="et") Expression = str_rindex(text, sub='et') Length: 5 dtype: int64 (expression) ----------------------------------- 0 3 1 7 2 -1 3 -1 4 -1
- rjust(width, fillchar=' ')¶
Fills the left side of string samples with a specified character such that the strings are left-hand justified.
- Parameters
- Returns
an expression containing the filled strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rjust(width=10, fillchar='!') Expression = str_rjust(text, width=10, fillchar='!') Length: 5 dtype: str (expression) --------------------------------- 0 !Something 1 very pretty 2 !is coming 3 !!!!!!!our 4 !!!!!!way.
- rstrip(to_strip=None)¶
Remove trailing characters from a string sample.
- Parameters
to_strip (str) – The string to be removed
- Returns
an expression containing the modified string column.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.rstrip(to_strip='ing') Expression = str_rstrip(text, to_strip='ing') Length: 5 dtype: str (expression) --------------------------------- 0 Someth 1 very pretty 2 is com 3 our 4 way.
- slice(start=0, stop=None)¶
Slice substrings from each string element in a column.
- Parameters
- Returns
an expression containing the sliced substrings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.slice(start=2, stop=5) Expression = str_pandas_slice(text, start=2, stop=5) Length: 5 dtype: str (expression) --------------------------------- 0 met 1 ry 2 co 3 r 4 y.
- startswith(pat)¶
Check if a start of a string matches a pattern.
- Parameters
pat (str) – A string pattern. Regular expressions are not supported.
- Returns
an expression which is evaluated to True if the pattern is found at the start of a string sample, False otherwise.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.startswith(pat='is') Expression = str_startswith(text, pat='is') Length: 5 dtype: bool (expression) ---------------------------------- 0 False 1 False 2 True 3 False 4 False
- strip(to_strip=None)¶
Removes leading and trailing characters.
Strips whitespaces (including new lines), or a set of specified characters from each string saple in a column, both from the left right sides.
- Parameters
to_strip (str) – The characters to be removed. All combinations of the characters will be removed. If None, it removes whitespaces.
returns – an expression containing the modified string samples.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.strip(to_strip='very') Expression = str_strip(text, to_strip='very') Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 prett 2 is coming 3 ou 4 way.
- title()¶
Converts all string samples to titlecase.
- Returns
an expression containing the converted strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.title() Expression = str_title(text) Length: 5 dtype: str (expression) --------------------------------- 0 Something 1 Very Pretty 2 Is Coming 3 Our 4 Way.
- upper(ascii=False)¶
Converts all strings in a column to uppercase.
- Parameters
ascii (bool) – Transform only ascii characters (usually faster).
- Returns
an expression containing the converted strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.upper() Expression = str_upper(text) Length: 5 dtype: str (expression) --------------------------------- 0 SOMETHING 1 VERY PRETTY 2 IS COMING 3 OUR 4 WAY.
- zfill(width)¶
Pad strings in a column by prepanding “0” characters.
- Parameters
width (int) – The minimum length of the resulting string. Strings shorter less than width will be prepended with zeros.
- Returns
an expression containing the modified strings.
Example:
>>> import vaex >>> text = ['Something', 'very pretty', 'is coming', 'our', 'way.'] >>> df = vaex.from_arrays(text=text) >>> df # text 0 Something 1 very pretty 2 is coming 3 our 4 way.
>>> df.text.str.zfill(width=12) Expression = str_zfill(text, width=12) Length: 5 dtype: str (expression) --------------------------------- 0 000Something 1 0very pretty 2 000is coming 3 000000000our 4 00000000way.
String (pandas) operations¶
- class vaex.expression.StringOperationsPandas(expression)[source]¶
Bases:
object
String operations using Pandas Series (much slower)
- __weakref__¶
list of weak references to the object (if defined)
- byte_length(**kwargs)¶
Wrapper around pandas.Series.byte_length
- capitalize(**kwargs)¶
Wrapper around pandas.Series.capitalize
- cat(**kwargs)¶
Wrapper around pandas.Series.cat
- center(**kwargs)¶
Wrapper around pandas.Series.center
- contains(**kwargs)¶
Wrapper around pandas.Series.contains
- count(**kwargs)¶
Wrapper around pandas.Series.count
- endswith(**kwargs)¶
Wrapper around pandas.Series.endswith
- equals(**kwargs)¶
Wrapper around pandas.Series.equals
- extract_regex(**kwargs)¶
Wrapper around pandas.Series.extract_regex
- find(**kwargs)¶
Wrapper around pandas.Series.find
- get(**kwargs)¶
Wrapper around pandas.Series.get
- index(**kwargs)¶
Wrapper around pandas.Series.index
- isalnum(**kwargs)¶
Wrapper around pandas.Series.isalnum
- isalpha(**kwargs)¶
Wrapper around pandas.Series.isalpha
- isdigit(**kwargs)¶
Wrapper around pandas.Series.isdigit
- islower(**kwargs)¶
Wrapper around pandas.Series.islower
- isspace(**kwargs)¶
Wrapper around pandas.Series.isspace
- istitle(**kwargs)¶
Wrapper around pandas.Series.istitle
- isupper(**kwargs)¶
Wrapper around pandas.Series.isupper
- join(**kwargs)¶
Wrapper around pandas.Series.join
- len(**kwargs)¶
Wrapper around pandas.Series.len
- ljust(**kwargs)¶
Wrapper around pandas.Series.ljust
- lower(**kwargs)¶
Wrapper around pandas.Series.lower
- lstrip(**kwargs)¶
Wrapper around pandas.Series.lstrip
- match(**kwargs)¶
Wrapper around pandas.Series.match
- notequals(**kwargs)¶
Wrapper around pandas.Series.notequals
- pad(**kwargs)¶
Wrapper around pandas.Series.pad
- repeat(**kwargs)¶
Wrapper around pandas.Series.repeat
- replace(**kwargs)¶
Wrapper around pandas.Series.replace
- rfind(**kwargs)¶
Wrapper around pandas.Series.rfind
- rindex(**kwargs)¶
Wrapper around pandas.Series.rindex
- rjust(**kwargs)¶
Wrapper around pandas.Series.rjust
- rsplit(**kwargs)¶
Wrapper around pandas.Series.rsplit
- rstrip(**kwargs)¶
Wrapper around pandas.Series.rstrip
- slice(**kwargs)¶
Wrapper around pandas.Series.slice
- split(**kwargs)¶
Wrapper around pandas.Series.split
- startswith(**kwargs)¶
Wrapper around pandas.Series.startswith
- strip(**kwargs)¶
Wrapper around pandas.Series.strip
- title(**kwargs)¶
Wrapper around pandas.Series.title
- upper(**kwargs)¶
Wrapper around pandas.Series.upper
- zfill(**kwargs)¶
Wrapper around pandas.Series.zfill
Struct (arrow) operations¶
- class vaex.expression.StructOperations(expression)[source]¶
Bases:
collections.abc.Mapping
Struct Array operations.
Usually accessed using e.g. df.name.struct.get(‘field1’)
- __getitem__(key)[source]¶
Return struct field by either field name (string) or index position (index).
In case of ambiguous field names, a LookupError is raised.
- __weakref__¶
list of weak references to the object (if defined)
- property dtypes¶
Return all field names along with corresponding types.
- Returns
a pandas series with keys as index and types as values.
Example:
>>> import vaex >>> import pyarrow as pa >>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"]) >>> df = vaex.from_arrays(array=array) >>> df # array 0 {'col1': 1, 'col2': 'a'} 1 {'col1': 2, 'col2': 'b'}
>>> df.array.struct.dtypes col1 int64 col2 string dtype: object
- get(field)¶
Return a single field from a struct array. You may also use the shorthand notation df.name[:, ‘field’].
Please note, in case of duplicated field labels, a field can’t be uniquely identified. Please use index position based access instead. To get corresponding field indices, please use {{idx: key for idx, key in enumerate(df.array.struct)}}.
- Parameters
field ({str, int}) – A string (label) or integer (index position) identifying a struct field.
- Returns
an expression containing a struct field.
Example:
>>> import vaex >>> import pyarrow as pa >>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"]) >>> df = vaex.from_arrays(array=array) >>> df # array 0 {'col1': 1, 'col2': 'a'} 1 {'col1': 2, 'col2': 'b'}
>>> df.array.struct.get("col1") Expression = struct_get(array, 'col1') Length: 2 dtype: int64 (expression) ----------------------------------- 0 1 1 2
>>> df.array.struct.get(0) Expression = struct_get(array, 0) Length: 2 dtype: int64 (expression) ----------------------------------- 0 1 1 2
>>> df.array[:, 'col1'] Expression = struct_get(array, 'col1') Length: 2 dtype: int64 (expression) ----------------------------------- 0 1 1 2
- items()[source]¶
Return all fields with names along with corresponding vaex expressions.
- Returns
list of tuples with field names and fields as vaex expressions.
Example:
>>> import vaex >>> import pyarrow as pa >>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"]) >>> df = vaex.from_arrays(array=array) >>> df # array 0 {'col1': 1, 'col2': 'a'} 1 {'col1': 2, 'col2': 'b'}
>>> df.array.struct.items() [('col1', Expression = struct_get(array, 0) Length: 2 dtype: int64 (expression) ----------------------------------- 0 1 1 2), ('col2', Expression = struct_get(array, 1) Length: 2 dtype: string (expression) ------------------------------------ 0 a 1 b)]
- keys()[source]¶
Return all field names contained in struct array.
- Returns
list of field names.
Example:
>>> import vaex >>> import pyarrow as pa >>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"]) >>> df = vaex.from_arrays(array=array) >>> df # array 0 {'col1': 1, 'col2': 'a'} 1 {'col1': 2, 'col2': 'b'}
>>> df.array.struct.keys() ["col1", "col2"]
- project(fields)¶
Project one or more fields of a struct array to a new struct array. You may also use the shorthand notation df.name[:, [‘field1’, ‘field2’]].
- Parameters
field (list) – A list of strings (label) or integers (index position) identifying one or more fields.
- Returns
an expression containing a struct array.
Example:
>>> import vaex >>> import pyarrow as pa >>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"], [3, 4]], names=["col1", "col2", "col3"]) >>> df = vaex.from_arrays(array=array) >>> df # array 0 {'col1': 1, 'col2': 'a', 'col3': 3} 1 {'col1': 2, 'col2': 'b', 'col3': 4}
>>> df.array.struct.project(["col3", "col1"]) Expression = struct_project(array, ['col3', 'col1']) Length: 2 dtype: struct<col3: int64, col1: int64> (expression) -------------------------------------------------------------- 0 {'col3': 3, 'col1': 1} 1 {'col3': 4, 'col1': 2}
>>> df.array.struct.project([2, 0]) Expression = struct_project(array, [2, 0]) Length: 2 dtype: struct<col3: int64, col1: int64> (expression) -------------------------------------------------------------- 0 {'col3': 3, 'col1': 1} 1 {'col3': 4, 'col1': 2}
>>> df.array[:, ["col3", "col1"]] Expression = struct_project(array, ['col3', 'col1']) Length: 2 dtype: struct<col3: int64, col1: int64> (expression) -------------------------------------------------------------- 0 {'col3': 3, 'col1': 1} 1 {'col3': 4, 'col1': 2}
- values()[source]¶
Return all fields as vaex expressions.
- Returns
list of vaex expressions corresponding to each field in struct.
Example:
>>> import vaex >>> import pyarrow as pa >>> array = pa.StructArray.from_arrays(arrays=[[1,2], ["a", "b"]], names=["col1", "col2"]) >>> df = vaex.from_arrays(array=array) >>> df # array 0 {'col1': 1, 'col2': 'a'} 1 {'col1': 2, 'col2': 'b'}
>>> df.array.struct.values() [Expression = struct_get(array, 0) Length: 2 dtype: int64 (expression) ----------------------------------- 0 1 1 2, Expression = struct_get(array, 1) Length: 2 dtype: string (expression) ------------------------------------ 0 a 1 b]
Timedelta operations¶
- class vaex.expression.TimeDelta(expression)[source]¶
Bases:
object
TimeDelta operations
Usually accessed using e.g. df.delay.td.days
- __weakref__¶
list of weak references to the object (if defined)
- property days¶
Number of days in each timedelta sample.
- Returns
an expression containing the number of days in a timedelta sample.
Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.days Expression = td_days(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 204 1 1 2 471 3 -22
- property microseconds¶
Number of microseconds (>= 0 and less than 1 second) in each timedelta sample.
- Returns
an expression containing the number of microseconds in a timedelta sample.
Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.microseconds Expression = td_microseconds(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 290448 1 978582 2 19583 3 709551
- property nanoseconds¶
Number of nanoseconds (>= 0 and less than 1 microsecond) in each timedelta sample.
- Returns
an expression containing the number of nanoseconds in a timedelta sample.
Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.nanoseconds Expression = td_nanoseconds(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 384 1 16 2 488 3 616
- property seconds¶
Number of seconds (>= 0 and less than 1 day) in each timedelta sample.
- Returns
an expression containing the number of seconds in a timedelta sample.
Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.seconds Expression = td_seconds(delta) Length: 4 dtype: int64 (expression) ----------------------------------- 0 30436 1 39086 2 28681 3 23519
- total_seconds()¶
Total duration of each timedelta sample expressed in seconds.
- Returns
an expression containing the total number of seconds in a timedelta sample.
Example:
>>> import vaex >>> import numpy as np >>> delta = np.array([17658720110, 11047049384039, 40712636304958, -18161254954], dtype='timedelta64[s]') >>> df = vaex.from_arrays(delta=delta) >>> df # delta 0 204 days +9:12:00 1 1 days +6:41:10 2 471 days +5:03:56 3 -22 days +23:31:15
>>> df.delta.td.total_seconds() Expression = td_total_seconds(delta) Length: 4 dtype: float64 (expression) ------------------------------------- 0 -7.88024e+08 1 -2.55032e+09 2 6.72134e+08 3 2.85489e+08
vaex-graphql¶
- class vaex.graphql.DataFrameAccessorGraphQL(df)[source]¶
Bases:
object
Exposes a GraphQL layer to a DataFrame
See the GraphQL example for more usage.
The easiest way to learn to use the GraphQL language/vaex interface is to launch a server, and play with the GraphiQL graphical interface, its autocomplete, and the schema explorer.
We try to stay close to the Hasura API: https://docs.hasura.io/1.0/graphql/manual/api-reference/graphql-api/query.html
- __weakref__¶
list of weak references to the object (if defined)
vaex-jupyter¶
- class vaex.jupyter.DataFrameAccessorWidget(df)[source]¶
Bases:
object
- __weakref__¶
list of weak references to the object (if defined)
- data_array(axes=[], selection=None, shared=False, display_function=<function display>, **kwargs)[source]¶
Create a
vaex.jupyter.model.DataArray()
model andvaex.jupyter.view.DataArray()
widget and links them.This is a convenience method to create the model and view, and hook them up.
- execute_debounced()[source]¶
Schedules an execution of dataframe tasks in the near future (debounced).
- expression(value=None, label='Custom expression')[source]¶
Create a widget to edit a vaex expression.
If value is an
vaex.jupyter.model.Axis
object, its expression will be (bi-directionally) linked to the widget.- Parameters
value – Valid expression (string or Expression object), or Axis
- vaex.jupyter.debounced(delay_seconds=0.5, skip_gather=False, on_error=None, reentrant=True)[source]¶
A decorator to debounce many method/function call into 1 call.
Note: this only works in an async environment, such as a Jupyter notebook context. Outside of this context, calling
flush()
will execute pending calls.- Parameters
delay_seconds (float) – The amount of seconds that should pass without any call, before the (final) call will be executed.
method (bool) – The decorator should know if the callable is a a method or not, otherwise the debounced is on a per-class basis.
skip_gather (bool) – The decorated function will be be waited for when calling vaex.jupyter.gather()
on_error – callback function that takes an exception as argument.
reentrant (bool) – reentrant function or not
- vaex.jupyter.flush(recursive_counts=- 1, ignore_exceptions=False, all=False)[source]¶
Run all non-executed debounced functions.
If execution of debounced calls lead to scheduling of new calls, they will be recursively executed, with a limit or recursive_counts calls. recursive_counts=-1 means infinite.
vaex.jupyter.model¶
- class vaex.jupyter.model.Axis(*, bin_centers=None, df, exception=None, expression=None, max=None, min=None, shape=None, shape_default=64, slice=None, status=Status.NO_LIMITS, **kwargs)[source]¶
Bases:
vaex.jupyter.model._HasState
- class Status(value)[source]¶
Bases:
enum.Enum
State transitions NO_LIMITS -> STAGED_CALCULATING_LIMITS -> CALCULATING_LIMITS -> CALCULATED_LIMITS -> READY
- when expression changes:
- STAGED_CALCULATING_LIMITS:
calculation.cancel() ->NO_LIMITS
- CALCULATING_LIMITS:
calculation.cancel() ->NO_LIMITS
- when min/max changes:
- STAGED_CALCULATING_LIMITS:
calculation.cancel() ->NO_LIMITS
- CALCULATING_LIMITS:
calculation.cancel() ->NO_LIMITS
- ABORTED = 7¶
- CALCULATED_LIMITS = 4¶
- CALCULATING_LIMITS = 3¶
- EXCEPTION = 6¶
- NO_LIMITS = 1¶
- READY = 5¶
- STAGED_CALCULATING_LIMITS = 2¶
- bin_centers¶
A trait which allows any value.
- df¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- exception¶
A trait which allows any value.
- expression¶
- property has_missing_limit¶
- max¶
A casting version of the float trait.
- min¶
A casting version of the float trait.
- on_change_limits¶
- shape¶
A casting version of the int trait.
- shape_default¶
A casting version of the int trait.
- slice¶
A casting version of the int trait.
- status¶
Use a Enum class as model for the data type description. Note that if no default-value is provided, the first enum-value is used as default-value.
# -- SINCE: Python 3.4 (or install backport: pip install enum34) import enum from traitlets import HasTraits, UseEnum class Color(enum.Enum): red = 1 # -- IMPLICIT: default_value blue = 2 green = 3 class MyEntity(HasTraits): color = UseEnum(Color, default_value=Color.blue) entity = MyEntity(color=Color.red) entity.color = Color.green # USE: Enum-value (preferred) entity.color = "green" # USE: name (as string) entity.color = "Color.green" # USE: scoped-name (as string) entity.color = 3 # USE: number (as int) assert entity.color is Color.green
- class vaex.jupyter.model.DataArray(*, axes, df, exception=None, grid, grid_sliced, selection=None, shape=64, status=Status.MISSING_LIMITS, status_text='Initializing', **kwargs)[source]¶
Bases:
vaex.jupyter.model._HasState
- class Status(value)[source]¶
Bases:
enum.Enum
An enumeration.
- CALCULATED_GRID = 9¶
- CALCULATED_LIMITS = 5¶
- CALCULATING_GRID = 8¶
- CALCULATING_LIMITS = 4¶
- EXCEPTION = 11¶
- MISSING_LIMITS = 1¶
- NEEDS_CALCULATING_GRID = 6¶
- READY = 10¶
- STAGED_CALCULATING_GRID = 7¶
- STAGED_CALCULATING_LIMITS = 3¶
- axes¶
An instance of a Python list.
- df¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- exception¶
A trait which allows any value.
- grid¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- grid_sliced¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- property has_missing_limits¶
- selection¶
A trait which allows any value.
- shape¶
A casting version of the int trait.
- status¶
Use a Enum class as model for the data type description. Note that if no default-value is provided, the first enum-value is used as default-value.
# -- SINCE: Python 3.4 (or install backport: pip install enum34) import enum from traitlets import HasTraits, UseEnum class Color(enum.Enum): red = 1 # -- IMPLICIT: default_value blue = 2 green = 3 class MyEntity(HasTraits): color = UseEnum(Color, default_value=Color.blue) entity = MyEntity(color=Color.red) entity.color = Color.green # USE: Enum-value (preferred) entity.color = "green" # USE: name (as string) entity.color = "Color.green" # USE: scoped-name (as string) entity.color = 3 # USE: number (as int) assert entity.color is Color.green
- status_text¶
A trait for unicode strings.
- class vaex.jupyter.model.GridCalculator(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.model._HasState
A grid is responsible for scheduling the grid calculations and possible slicing
- class Status(value)[source]¶
Bases:
enum.Enum
An enumeration.
- CALCULATING = 4¶
- READY = 9¶
- STAGED_CALCULATION = 3¶
- VOID = 1¶
- df¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- models¶
An instance of a Python list.
- status¶
Use a Enum class as model for the data type description. Note that if no default-value is provided, the first enum-value is used as default-value.
# -- SINCE: Python 3.4 (or install backport: pip install enum34) import enum from traitlets import HasTraits, UseEnum class Color(enum.Enum): red = 1 # -- IMPLICIT: default_value blue = 2 green = 3 class MyEntity(HasTraits): color = UseEnum(Color, default_value=Color.blue) entity = MyEntity(color=Color.red) entity.color = Color.green # USE: Enum-value (preferred) entity.color = "green" # USE: name (as string) entity.color = "Color.green" # USE: scoped-name (as string) entity.color = 3 # USE: number (as int) assert entity.color is Color.green
- class vaex.jupyter.model.Heatmap(*, axes, df, exception=None, grid, grid_sliced, selection=None, shape=64, status=Status.MISSING_LIMITS, status_text='Initializing', **kwargs)[source]¶
Bases:
vaex.jupyter.model.DataArray
- x¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- y¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- class vaex.jupyter.model.Histogram(*, axes, df, exception=None, grid, grid_sliced, selection=None, shape=64, status=Status.MISSING_LIMITS, status_text='Initializing', **kwargs)[source]¶
Bases:
vaex.jupyter.model.DataArray
- x¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
vaex.jupyter.view¶
- class vaex.jupyter.view.DataArray(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.view.ViewBase
Will display a DataArray interactively, with an optional custom display_function.
By default, it will simply display(…) the DataArray, using xarray’s default display mechanism.
Public constructor
- clear_output¶
Clear output each time the data changes
- display_function¶
A trait which allows any value.
- matplotlib_autoshow¶
Will call plt.show() inside output context if open figure handles exist
- model¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- numpy_errstate¶
Default numpy errstate during display to avoid showing error messsages, see
numpy.errstate
- class vaex.jupyter.view.Heatmap(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.view.ViewBase
Public constructor
- TOOLS_SUPPORTED = ['pan-zoom', 'select-rect', 'select-x']¶
- blend¶
A trait for unicode strings.
- colormap¶
A trait for unicode strings.
- dimension_alternative¶
A trait for unicode strings.
- dimension_facets¶
A trait for unicode strings.
- dimension_fade¶
A trait for unicode strings.
- model¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- normalize¶
A boolean (True, False) trait.
- supports_normalize = False¶
- supports_transforms = True¶
- tool¶
A trait for unicode strings.
- transform¶
A trait for unicode strings.
- class vaex.jupyter.view.Histogram(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.view.ViewBase
Public constructor
- TOOLS_SUPPORTED = ['pan-zoom', 'select-x']¶
- dimension_facets¶
A trait for unicode strings.
- dimension_groups¶
A trait for unicode strings.
- dimension_overplot¶
A trait for unicode strings.
- model¶
A trait whose value must be an instance of a specified class.
The value can also be an instance of a subclass of the specified class.
Subclasses can declare default classes by overriding the klass attribute
- normalize¶
A boolean (True, False) trait.
- supports_normalize = True¶
- supports_transforms = False¶
- transform¶
A trait for unicode strings.
- class vaex.jupyter.view.PieChart(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.view.Histogram
Public constructor
- radius_split_fraction = 0.8¶
vaex.jupyter.widgets¶
- class vaex.jupyter.widgets.ColumnExpressionAdder(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.widgets.ColumnPicker
Public constructor
- component¶
A trait which allows any value.
- target¶
A trait for unicode strings.
- class vaex.jupyter.widgets.ColumnList(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
,vaex.jupyter.traitlets.ColumnsMixin
Public constructor
- column_filter¶
A trait for unicode strings.
- dialog_open¶
A boolean (True, False) trait.
- editor¶
A trait which allows any value.
- editor_open¶
A boolean (True, False) trait.
- template¶
A trait for unicode strings.
- tooltip¶
A trait for unicode strings.
- valid_expression¶
A boolean (True, False) trait.
- class vaex.jupyter.widgets.ColumnPicker(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
,vaex.jupyter.traitlets.ColumnsMixin
Public constructor
- label¶
A trait for unicode strings.
- template¶
A trait for unicode strings.
- value¶
- class vaex.jupyter.widgets.ColumnSelectionAdder(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.widgets.ColumnPicker
Public constructor
- component¶
A trait which allows any value.
- target¶
A trait for unicode strings.
- class vaex.jupyter.widgets.ContainerCard(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- card_props¶
An instance of a Python dict.
One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.
Changed in version 5.0: Added key_trait for validating dict keys.
Changed in version 5.0: Deprecated ambiguous
trait
,traits
args in favor ofvalue_trait
,per_key_traits
.
- controls¶
An instance of a Python list.
- main¶
A trait which allows any value.
- main_props¶
An instance of a Python dict.
One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.
Changed in version 5.0: Added key_trait for validating dict keys.
Changed in version 5.0: Deprecated ambiguous
trait
,traits
args in favor ofvalue_trait
,per_key_traits
.
- show_controls¶
A boolean (True, False) trait.
- subtitle¶
A trait for unicode strings.
- text¶
A trait for unicode strings.
- title¶
A trait for unicode strings.
- class vaex.jupyter.widgets.Counter(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- characters¶
An instance of a Python list.
- format¶
A trait for unicode strings.
- postfix¶
A trait for unicode strings.
- prefix¶
A trait for unicode strings.
- template¶
A trait for unicode strings.
- value¶
An int trait.
- class vaex.jupyter.widgets.Expression(**kwargs: Any)[source]¶
Bases:
ipyvuetify.generated.TextField.TextField
Public constructor
- df¶
A trait which allows any value.
- valid¶
A boolean (True, False) trait.
- value¶
- class vaex.jupyter.widgets.ExpressionSelectionTextArea(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.widgets.Expression
Public constructor
- selection_name¶
A trait which allows any value.
- update_custom_selection¶
- vaex.jupyter.widgets.ExpressionTextArea¶
alias of
vaex.jupyter.widgets.Expression
- class vaex.jupyter.widgets.Html(**kwargs: Any)[source]¶
Bases:
ipyvuetify.Html.Html
Public constructor
- class vaex.jupyter.widgets.LinkList(**kwargs: Any)[source]¶
Bases:
vaex.jupyter.widgets.VuetifyTemplate
Public constructor
- items¶
An instance of a Python list.
- class vaex.jupyter.widgets.PlotTemplate(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- button_text¶
A trait for unicode strings.
- clipped¶
A boolean (True, False) trait.
- components¶
An instance of a Python dict.
One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.
Changed in version 5.0: Added key_trait for validating dict keys.
Changed in version 5.0: Deprecated ambiguous
trait
,traits
args in favor ofvalue_trait
,per_key_traits
.
- dark¶
A boolean (True, False) trait.
- drawer¶
A boolean (True, False) trait.
- drawers¶
A trait which allows any value.
- floating¶
A boolean (True, False) trait.
- items¶
An instance of a Python list.
- mini¶
A boolean (True, False) trait.
- model¶
A trait which allows any value.
- new_output¶
A boolean (True, False) trait.
- show_output¶
A boolean (True, False) trait.
- template¶
A trait for unicode strings.
- title¶
A trait for unicode strings.
- type¶
A trait for unicode strings.
- class vaex.jupyter.widgets.ProgressCircularNoAnimation(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
v-progress-circular that avoids animations
Public constructor
- color¶
A trait for unicode strings.
A boolean (True, False) trait.
- parts¶
An instance of a Python list.
- size¶
An int trait.
- template¶
A trait for unicode strings.
- text¶
A trait for unicode strings.
- value¶
A float trait.
- width¶
An int trait.
- class vaex.jupyter.widgets.Selection(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- df¶
A trait which allows any value.
- name¶
A trait for unicode strings.
- value¶
A trait for unicode strings.
- class vaex.jupyter.widgets.SelectionEditor(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- adder¶
A trait which allows any value.
- components¶
An instance of a Python dict.
One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.
Changed in version 5.0: Added key_trait for validating dict keys.
Changed in version 5.0: Deprecated ambiguous
trait
,traits
args in favor ofvalue_trait
,per_key_traits
.
- df¶
A trait which allows any value.
- input¶
A trait which allows any value.
- on_close¶
A trait which allows any value.
- template¶
A trait for unicode strings.
- class vaex.jupyter.widgets.SelectionToggleList(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- df¶
A trait which allows any value.
- selection_names¶
An instance of a Python list.
- title¶
A trait for unicode strings.
- value¶
An instance of a Python list.
- class vaex.jupyter.widgets.SettingsEditor(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- schema¶
An instance of a Python dict.
One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.
Changed in version 5.0: Added key_trait for validating dict keys.
Changed in version 5.0: Deprecated ambiguous
trait
,traits
args in favor ofvalue_trait
,per_key_traits
.
- template_file = '/home/docs/checkouts/readthedocs.org/user_builds/vaex/envs/latest/lib/python3.7/site-packages/vaex/jupyter/vue/vjsf.vue'¶
- valid¶
A boolean (True, False) trait.
- values¶
An instance of a Python dict.
One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.
Changed in version 5.0: Added key_trait for validating dict keys.
Changed in version 5.0: Deprecated ambiguous
trait
,traits
args in favor ofvalue_trait
,per_key_traits
.
- vjsf_loaded¶
A boolean (True, False) trait.
- class vaex.jupyter.widgets.Status(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- template¶
A trait for unicode strings.
- value¶
A trait for unicode strings.
- class vaex.jupyter.widgets.ToolsSpeedDial(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- children¶
An instance of a Python list.
- expand¶
A boolean (True, False) trait.
- items¶
A trait which allows any value.
- template¶
A trait for unicode strings.
- value¶
A trait for unicode strings.
- class vaex.jupyter.widgets.ToolsToolbar(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- interact_items¶
A trait which allows any value.
- interact_value¶
A trait for unicode strings.
- normalize¶
A boolean (True, False) trait.
- selection_mode¶
A trait for unicode strings.
- selection_mode_items¶
A trait which allows any value.
- supports_normalize¶
A boolean (True, False) trait.
- supports_transforms¶
A boolean (True, False) trait.
- transform_items¶
An instance of a Python list.
- transform_value¶
A trait for unicode strings.
- z_normalize¶
A boolean (True, False) trait.
- class vaex.jupyter.widgets.UsesVaexComponents(**kwargs: Any)[source]¶
Bases:
traitlets.traitlets.HasTraits
- class vaex.jupyter.widgets.VirtualColumnEditor(**kwargs: Any)[source]¶
Bases:
ipyvuetify.VuetifyTemplate.VuetifyTemplate
Public constructor
- adder¶
A trait which allows any value.
- column_name¶
A trait for unicode strings.
- components¶
An instance of a Python dict.
One or more traits can be passed to the constructor to validate the keys and/or values of the dict. If you need more detailed validation, you may use a custom validator method.
Changed in version 5.0: Added key_trait for validating dict keys.
Changed in version 5.0: Deprecated ambiguous
trait
,traits
args in favor ofvalue_trait
,per_key_traits
.
- df¶
A trait which allows any value.
- editor¶
A trait which allows any value.
- on_close¶
A trait which allows any value.
- template¶
A trait for unicode strings.
vaex-ml¶
See the ML tutorial an introduction, and the ML examples for more advanced usage.
Transformers & Encoders¶
|
Encode categorical columns by the frequency of their respective samples. |
|
Encode categorical columns with integer values between 0 and num_classes-1. |
|
Scale features by their maximum absolute value. |
|
Will scale a set of features to a given range. |
|
Encode categorical columns according ot the One-Hot scheme. |
|
Encode categorical columns according to a binary multi-hot scheme. |
|
Transform a set of features using a Principal Component Analysis. |
|
The RobustScaler removes the median and scales the data according to a given percentile range. |
|
Standardize features by removing thir mean and scaling them to unit variance. |
|
A strategy for transforming cyclical features (e.g. |
|
Encode categorical variables with a Bayesian Target Encoder. |
|
Encode categorical variables with a Weight of Evidence Encoder. |
|
Bin continous features into discrete bins. |
|
The GroupByTransformer creates aggregations via the groupby operation, which are joined to a DataFrame. |
- class vaex.ml.transformations.FrequencyEncoder(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Encode categorical columns by the frequency of their respective samples.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red', 'green']) >>> df # color 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.FrequencyEncoder(features=['color']) >>> encoder.fit_transform(df) # color frequency_encoded_color 0 red 0.333333 1 green 0.5 2 green 0.5 3 blue 0.166667 4 red 0.333333 5 green 0.5
- Parameters
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
unseen – Strategy to deal with unseen values.
- prefix¶
Prefix for the names of the transformed features.
- transform(df)[source]¶
Transform a DataFrame with a fitted FrequencyEncoder.
- Parameters
df – A vaex DataFrame.
- Returns
A shallow copy of the DataFrame that includes the encodings.
- Return type
- unseen¶
Strategy to deal with unseen values.
- class vaex.ml.transformations.LabelEncoder(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Encode categorical columns with integer values between 0 and num_classes-1.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red']) >>> df # color 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.LabelEncoder(features=['color']) >>> encoder.fit_transform(df) # color label_encoded_color 0 red 2 1 green 1 2 green 1 3 blue 0 4 red 2
- Parameters
allow_unseen – If True, unseen values will be encoded with -1, otherwise an error is raised
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
- allow_unseen¶
If True, unseen values will be encoded with -1, otherwise an error is raised
- labels_¶
The encoded labels of each feature.
- prefix¶
Prefix for the names of the transformed features.
- class vaex.ml.transformations.MaxAbsScaler(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Scale features by their maximum absolute value.
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.MaxAbsScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y absmax_scaled_x absmax_scaled_y 0 2 -2 0.133333 -0.2 1 5 3 0.333333 0.3 2 7 0 0.466667 0 3 2 0 0.133333 0 4 15 10 1 1
- Parameters
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
- absmax_¶
Tha maximum absolute value of a feature.
- prefix¶
Prefix for the names of the transformed features.
- class vaex.ml.transformations.MinMaxScaler(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Will scale a set of features to a given range.
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.MinMaxScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y minmax_scaled_x minmax_scaled_y 0 2 -2 0 0 1 5 3 0.230769 0.416667 2 7 0 0.384615 0.166667 3 2 0 0 0.166667 4 15 10 1 1
- Parameters
feature_range – The range the features are scaled to.
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
- feature_range¶
The range the features are scaled to.
- fmax_¶
The minimum value of a feature.
- fmin_¶
The maximum value of a feature.
- prefix¶
Prefix for the names of the transformed features.
- class vaex.ml.transformations.OneHotEncoder(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Encode categorical columns according ot the One-Hot scheme.
Example:
>>> import vaex >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red']) >>> df # color® 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.OneHotEncoder(features=['color']) >>> encoder.fit_transform(df) # color color_blue color_green color_red 0 red 0 0 1 1 green 0 1 0 2 green 0 1 0 3 blue 1 0 0 4 red 0 0 1
- Parameters
features – List of features to transform.
one – Value to encode when a category is present.
prefix – Prefix for the names of the transformed features.
zero – Value to encode when category is absent.
- one¶
Value to encode when a category is present.
- prefix¶
Prefix for the names of the transformed features.
- transform(df)[source]¶
Transform a DataFrame with a fitted OneHotEncoder.
- Parameters
df – A vaex DataFrame.
- Returns
A shallow copy of the DataFrame that includes the encodings.
- Return type
- uniques_¶
The unique elements found in each feature.
- zero¶
Value to encode when category is absent.
- class vaex.ml.transformations.MultiHotEncoder(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Encode categorical columns according to a binary multi-hot scheme.
With Multi-Hot Encoder (sometimes called Binary Encoder), the categorical variables are first ordinal encoded, and those encodings are converted to a binary number. Each digit of that binary number is a separate column, containing either a “0” or a “1”. This is can be considered as an improvement over the One-Hot encoder as it guards against generating too many new columns when the cardinality of the categorical column is high, while effecively removing the ordinality that an Ordinal Encoder would introduce.
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(color=['red', 'green', 'green', 'blue', 'red']) >>> df # color 0 red 1 green 2 green 3 blue 4 red >>> encoder = vaex.ml.MultiHotEncoder(features=['color']) >>> encoder.fit_transform(df) # color color_0 color_1 color_2 0 red 0 1 1 1 green 0 1 0 2 green 0 1 0 3 blue 0 0 1 4 red 0 1 1
- Parameters
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
- labels_¶
The ordinal-encoded labels of each feature.
- prefix¶
Prefix for the names of the transformed features.
- class vaex.ml.transformations.PCA(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Transform a set of features using a Principal Component Analysis.
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> pca = vaex.ml.PCA(n_components=2, features=['x', 'y']) >>> pca.fit_transform(df) # x y PCA_0 PCA_1 0 2 -2 5.92532 0.413011 1 5 3 0.380494 -1.39112 2 7 0 0.840049 2.18502 3 2 0 4.61287 -1.09612 4 15 10 -11.7587 -0.110794
- Parameters
features – List of features to transform.
n_components – Number of components to retain. If None, all the components will be retained.
prefix – Prefix for the names of the transformed features.
whiten – If True perform whitening, i.e. remove the relative variance schale of the transformed components.
- eigen_values_¶
The eigen values that correspond to each feature.
- eigen_vectors_¶
The eigen vectors corresponding to each feature
- explained_variance_¶
Variance explained by each of the components. Same as the eigen values.
- explained_variance_ratio_¶
Percentage of variance explained by each of the selected components.
- fit(df, progress=None)[source]¶
Fit the PCA model to the DataFrame.
- Parameters
df – A vaex DataFrame.
progress – If True or ‘widget’, display a progressbar of the fitting process.
- means_¶
The mean of each feature
- n_components¶
Number of components to retain. If None, all the components will be retained.
- prefix¶
Prefix for the names of the transformed features.
- transform(df, n_components=None)[source]¶
Apply the PCA transformation to the DataFrame.
- Parameters
df – A vaex DataFrame.
n_components – The number of PCA components to retain.
- Return copy
A shallow copy of the DataFrame that includes the PCA components.
- Return type
- whiten¶
If True perform whitening, i.e. remove the relative variance schale of the transformed components.
- class vaex.ml.transformations.RobustScaler(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
The RobustScaler removes the median and scales the data according to a given percentile range. By default, the scaling is done between the 25th and the 75th percentile. Centering and scaling happens independently for each feature (column).
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.MaxAbsScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y robust_scaled_x robust_scaled_y 0 2 -2 -0.333686 -0.266302 1 5 3 -0.000596934 0.399453 2 7 0 0.221462 0 3 2 0 -0.333686 0 4 15 10 1.1097 1.33151
- Parameters
features – List of features to transform.
percentile_range – The percentile range to which to scale each feature to.
prefix – Prefix for the names of the transformed features.
with_centering – If True, remove the median.
with_scaling – If True, scale each feature between the specified percentile range.
- center_¶
The median of each feature.
- percentile_range¶
The percentile range to which to scale each feature to.
- prefix¶
Prefix for the names of the transformed features.
- scale_¶
The percentile range for each feature.
- transform(df)[source]¶
Transform a DataFrame with a fitted RobustScaler.
- Parameters
df – A vaex DataFrame.
- Returns copy
a shallow copy of the DataFrame that includes the scaled features.
- Return type
- with_centering¶
If True, remove the median.
- with_scaling¶
If True, scale each feature between the specified percentile range.
- class vaex.ml.transformations.StandardScaler(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Standardize features by removing thir mean and scaling them to unit variance.
Example:
>>> import vaex >>> df = vaex.from_arrays(x=[2,5,7,2,15], y=[-2,3,0,0,10]) >>> df # x y 0 2 -2 1 5 3 2 7 0 3 2 0 4 15 10 >>> scaler = vaex.ml.StandardScaler(features=['x', 'y']) >>> scaler.fit_transform(df) # x y standard_scaled_x standard_scaled_y 0 2 -2 -0.876523 -0.996616 1 5 3 -0.250435 0.189832 2 7 0 0.166957 -0.522037 3 2 0 -0.876523 -0.522037 4 15 10 1.83652 1.85086
- Parameters
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
with_mean – If True, remove the mean from each feature.
with_std – If True, scale each feature to unit variance.
- mean_¶
The mean of each feature
- prefix¶
Prefix for the names of the transformed features.
- std_¶
The standard deviation of each feature.
- transform(df)[source]¶
Transform a DataFrame with a fitted StandardScaler.
- Parameters
df – A vaex DataFrame.
- Returns copy
a shallow copy of the DataFrame that includes the scaled features.
- Return type
- with_mean¶
If True, remove the mean from each feature.
- with_std¶
If True, scale each feature to unit variance.
- class vaex.ml.transformations.CycleTransformer(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
A strategy for transforming cyclical features (e.g. angles, time).
Think of each feature as an angle of a unit circle in polar coordinates, and then and then obtaining the x and y coordinate projections, or the cos and sin components respectively.
Suitable for a variaty of machine learning tasks. It preserves the cyclical continuity of the feature. Inspired by: http://blog.davidkaleko.com/feature-engineering-cyclical-features.html
>>> df = vaex.from_arrays(days=[0, 1, 2, 3, 4, 5, 6]) >>> cyctrans = vaex.ml.CycleTransformer(n=7, features=['days']) >>> cyctrans.fit_transform(df) # days days_x days_y 0 0 1 0 1 1 0.62349 0.781831 2 2 -0.222521 0.974928 3 3 -0.900969 0.433884 4 4 -0.900969 -0.433884 5 5 -0.222521 -0.974928 6 6 0.62349 -0.781831
- Parameters
features – List of features to transform.
n – The number of elements in one cycle.
prefix_x – Prefix for the x-component of the transformed features.
prefix_y – Prefix for the y-component of the transformed features.
suffix_x – Suffix for the x-component of the transformed features.
suffix_y – Suffix for the y-component of the transformed features.
- fit(df)[source]¶
Fit a CycleTransformer to the DataFrame.
This is a dummy method, as it is not needed for the transformation to be applied.
- Parameters
df – A vaex DataFrame.
- n¶
The number of elements in one cycle.
- prefix_x¶
Prefix for the x-component of the transformed features.
- prefix_y¶
Prefix for the y-component of the transformed features.
- suffix_x¶
Suffix for the x-component of the transformed features.
- suffix_y¶
Suffix for the y-component of the transformed features.
- class vaex.ml.transformations.BayesianTargetEncoder(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Encode categorical variables with a Bayesian Target Encoder.
The categories are encoded by the mean of their target value, which is adjusted by the global mean value of the target variable using a Bayesian schema. For a larger weight value, the target encodings are smoothed toward the global mean, while for a weight of 0, the encodings are just the mean target value per class.
Reference: https://www.wikiwand.com/en/Bayes_estimator#/Practical_example_of_Bayes_estimators
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(x=['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], ... y=[1, 1, 1, 0, 0, 0, 0, 1]) >>> target_encoder = vaex.ml.BayesianTargetEncoder(features=['x'], weight=4) >>> target_encoder.fit_transform(df, 'y') # x y mean_encoded_x 0 a 1 0.625 1 a 1 0.625 2 a 1 0.625 3 a 0 0.625 4 b 0 0.375 5 b 0 0.375 6 b 0 0.375 7 b 1 0.375
- Parameters
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
target – The name of the column containing the target variable.
unseen – Strategy to deal with unseen values.
weight – Weight to be applied to the mean encodings (smoothing parameter).
- prefix¶
Prefix for the names of the transformed features.
- target¶
The name of the column containing the target variable.
- transform(df)[source]¶
Transform a DataFrame with a fitted BayesianTargetEncoder.
- Parameters
df – A vaex DataFrame.
- Returns
A shallow copy of the DataFrame that includes the encodings.
- Return type
- unseen¶
Strategy to deal with unseen values.
- weight¶
Weight to be applied to the mean encodings (smoothing parameter).
- class vaex.ml.transformations.WeightOfEvidenceEncoder(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Encode categorical variables with a Weight of Evidence Encoder.
Weight of Evidence measures how well a particular feature supports the given hypothesis (i.e. the target variable). With this encoder, each category in a categorical feature is encoded by its “strength” i.e. Weight of Evidence value. The target feature can be a boolean or numerical column, where True/1 is seen as ‘Good’, and False/0 is seen as ‘Bad’
Reference: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(x=['a', 'a', 'b', 'b', 'b', 'c', 'c'], ... y=[1, 1, 0, 0, 1, 1, 0]) >>> woe_encoder = vaex.ml.WeightOfEvidenceEncoder(target='y', features=['x']) >>> woe_encoder.fit_transform(df) # x y mean_encoded_x 0 a 1 13.8155 1 a 1 13.8155 2 b 0 -0.693147 3 b 0 -0.693147 4 b 1 -0.693147 5 c 1 0 6 c 0 0
- Parameters
epsilon – Small value taken as minimum fot the negatives, to avoid a division by zero
features – List of features to transform.
prefix – Prefix for the names of the transformed features.
target – The name of the column containing the target variable.
unseen – Strategy to deal with unseen values.
- epsilon¶
Small value taken as minimum fot the negatives, to avoid a division by zero
- prefix¶
Prefix for the names of the transformed features.
- target¶
The name of the column containing the target variable.
- transform(df)[source]¶
Transform a DataFrame with a fitted WeightOfEvidenceEncoder.
- Parameters
df – A vaex DataFrame.
- Returns
A shallow copy of the DataFrame that includes the encodings.
- Return type
- unseen¶
Strategy to deal with unseen values.
- class vaex.ml.transformations.KBinsDiscretizer(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
Bin continous features into discrete bins.
A stretegy to encode continuous features into discrete bins. The transformed columns contain the bin label each sample falls into. In a way this transformer Label/Ordinal encodes continous features.
Example:
>>> import vaex >>> import vaex.ml >>> df = vaex.from_arrays(x=[0, 2.5, 5, 7.5, 10, 12.5, 15]) >>> bin_trans = vaex.ml.KBinsDiscretizer(features=['x'], n_bins=3, strategy='uniform') >>> bin_trans.fit_transform(df) # x binned_x 0 0 0 1 2.5 0 2 5 1 3 7.5 1 4 10 2 5 12.5 2 6 15 2
- Parameters
epsilon – Tiny value added to the bin edges ensuring samples close to the bin edges are binned correcly.
features – List of features to transform.
n_bins – Number of bins. Must be greater than 1.
prefix – Prefix for the names of the transformed features.
strategy – Strategy used to define the widths of the bins. Can be either “uniform”, “quantile” or “kmeans”.
- bin_edges_¶
The bin edges for each binned feature
- epsilon¶
Tiny value added to the bin edges ensuring samples close to the bin edges are binned correcly.
- n_bins¶
Number of bins. Must be greater than 1.
- n_bins_¶
Number of bins per feature.
- prefix¶
Prefix for the names of the transformed features.
- strategy¶
Strategy used to define the widths of the bins. Can be either “uniform”, “quantile” or “kmeans”.
- class vaex.ml.transformations.GroupByTransformer(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
The GroupByTransformer creates aggregations via the groupby operation, which are joined to a DataFrame. This is useful for creating aggregate features.
Example:
>>> import vaex >>> import vaex.ml >>> df_train = vaex.from_arrays(x=['dog', 'dog', 'dog', 'cat', 'cat'], y=[2, 3, 4, 10, 20]) >>> df_test = vaex.from_arrays(x=['dog', 'cat', 'dog', 'mouse'], y=[5, 5, 5, 5]) >>> group_trans = vaex.ml.GroupByTransformer(by='x', agg={'mean_y': vaex.agg.mean('y')}, rsuffix='_agg') >>> group_trans.fit_transform(df_train) # x y x_agg mean_y 0 dog 2 dog 3 1 dog 3 dog 3 2 dog 4 dog 3 3 cat 10 cat 15 4 cat 20 cat 15 >>> group_trans.transform(df_test) # x y x_agg mean_y 0 dog 5 dog 3.0 1 cat 5 cat 15.0 2 dog 5 dog 3.0 3 mouse 5 -- --
- Parameters
agg – Dict where the keys are feature names and the values are vaex.agg objects.
by – The feature on which to do the grouping.
features – List of features to transform.
rprefix – Prefix for the names of the aggregate features in case of a collision.
rsuffix – Suffix for the names of the aggregate features in case of a collision.
- agg¶
Dict where the keys are feature names and the values are vaex.agg objects.
- by¶
The feature on which to do the grouping.
- rprefix¶
Prefix for the names of the aggregate features in case of a collision.
- rsuffix¶
Suffix for the names of the aggregate features in case of a collision.
Clustering¶
|
The KMeans clustering algorithm. |
- class vaex.ml.cluster.KMeans(**kwargs: Any)[source]¶
Bases:
vaex.ml.transformations.Transformer
The KMeans clustering algorithm.
Example:
>>> import vaex.ml >>> import vaex.ml.cluster >>> df = vaex.datasets.iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> cls = vaex.ml.cluster.KMeans(n_clusters=3, features=features, init='random', max_iter=10) >>> cls.fit(df) >>> df = cls.transform(df) >>> df.head(5) # sepal_width petal_length sepal_length petal_width class_ prediction_kmeans 0 3 4.2 5.9 1.5 1 2 1 3 4.6 6.1 1.4 1 2 2 2.9 4.6 6.6 1.3 1 2 3 3.3 5.7 6.7 2.1 2 0 4 4.2 1.4 5.5 0.2 0 1
- Parameters
cluster_centers – Coordinates of cluster centers.
features – List of features to cluster.
inertia – Sum of squared distances of samples to their closest cluster center.
init – Method for initializing the centroids.
max_iter – Maximum number of iterations of the KMeans algorithm for a single run.
n_clusters – Number of clusters to form.
n_init – Number of centroid initializations. The KMeans algorithm will be run for each initialization, and the final results will be the best output of the n_init consecutive runs in terms of inertia.
prediction_label – The name of the virtual column that houses the cluster labels for each point.
random_state – Random number generation for centroid initialization. If an int is specified, the randomness becomes deterministic.
verbose – If True, enable verbosity mode.
- cluster_centers¶
Coordinates of cluster centers.
- features¶
List of features to cluster.
- fit(dataframe)[source]¶
Fit the KMeans model to the dataframe.
- Parameters
dataframe – A vaex DataFrame.
- inertia¶
Sum of squared distances of samples to their closest cluster center.
- init¶
Method for initializing the centroids.
- max_iter¶
Maximum number of iterations of the KMeans algorithm for a single run.
- n_clusters¶
Number of clusters to form.
- n_init¶
Number of centroid initializations. The KMeans algorithm will be run for each initialization, and the final results will be the best output of the n_init consecutive runs in terms of inertia.
- prediction_label¶
The name of the virtual column that houses the cluster labels for each point.
- random_state¶
Random number generation for centroid initialization. If an int is specified, the randomness becomes deterministic.
- transform(dataframe)[source]¶
Label a DataFrame with a fitted KMeans model.
- Parameters
dataframe – A vaex DataFrame.
- Returns copy
A shallow copy of the DataFrame that includes the cluster labels.
- Return type
- verbose¶
If True, enable verbosity mode.
Metrics¶
- class vaex.ml.metrics.DataFrameAccessorMetrics(ml)[source]¶
Bases:
object
Common metrics for evaluating machine learning tasks.
This DataFrame Accessor contains a number of common machine learning evaluation metrics. The idea is that the metrics can be evaluated out-of-core, and without the need to materialize the target and predicted columns.
See https://vaex.io/docs/api.html#metrics for a list of supported evaluation metrics.
- accuracy_score(y_true, y_pred, selection=None, array_type='python')[source]¶
Calculates the accuracy classification score.
- Parameters
y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
The accuracy score.
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0], y_pred=[1, 0, 0, 1, 1]) >>> df.ml.metrics.accuracy_score(df.y_true, df.y_pred) 0.6
- classification_report(y_true, y_pred, average='binary', decimals=3)[source]¶
Returns a text report showing the main classification metrics
The accuracy, precision, recall, and F1-score are shown.
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1]) >>> report = df.ml.metrics.classification_report(df.y_true, df.y_pred) >>> print(report) >>> print(report) Classification report:
Accuracy: 0.667 Precision: 0.75 Recall: 0.75 F1: 0.75
- confusion_matrix(y_true, y_pred, selection=None, array_type=None)[source]¶
Docstrings :param y_true: expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y :param y_pred: expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y :param selection: Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections :param array_type: Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list :returns: The confusion matrix
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1]) >>> df.ml.metrics.confusion_matrix(df.y_true, df.y_pred) array([[1, 1], [1, 3]]
- f1_score(y_true, y_pred, average='binary', selection=None, array_type=None)[source]¶
Calculates the F1 score.
This is the harmonic average between the precision and the recall.
For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.
For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.
- Parameters
y_true – {expression_one}
y_pred – {expression_one}
average – Should be either ‘binary’ or ‘macro’.
selection – {selection}
array_type – {array_type}
- Returns
The recall score
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1]) >>> df.ml.metrics.recall_score(df.y_true, df.y_pred) 0.75
- matthews_correlation_coefficient(y_true, y_pred, selection=None, array_type=None)[source]¶
Calculates the Matthews correlation coefficient.
This metric can be used for both binary and multiclass classification problems.
- Parameters
y_true – {expression_one}
y_pred – {expression_one}
selection – {selection}
- Returns
The Matthews correlation coefficient.
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1]) >>> df.ml.metrics.matthews_correlation_coefficient(df.y_true, df.y_pred) 0.25
- mean_absolute_error(y_true, y_pred, selection=None, array_type='python')[source]¶
Calculate the mean absolute error.
- Parameters
y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
The mean absolute error
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.datasets.iris() >>> df.ml.metrics.mean_absolute_error(df.sepal_length, df.petal_length) 2.0846666666666667
- mean_squared_error(y_true, y_pred, selection=None, array_type='python')[source]¶
Calculates the mean squared error.
- Parameters
y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
The mean squared error
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.datasets.iris() >>> df.ml.metrics.mean_squared_error(df.sepal_length, df.petal_length) 5.589000000000001
- precision_recall_fscore(y_true, y_pred, average='binary', selection=None, array_type=None)[source]¶
Calculates the precision, recall and f1 score for a classification problem.
These metrics are defined as follows: - precision = tp / (tp + fp) - recall = tp / (tp + fn) - f1 = tp / (tp + 0.5 * (fp + fn)) where “tp” are true positives, “fp” are false positives, and “fn” are false negatives.
For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.
For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.
- Y_true
expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
- Y_pred
expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
- Average
Should be either ‘binary’ or ‘macro’.
- Selection
Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
- Array_type
Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
The precision, recall and f1 score
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1]) >>> df.ml.metrics.precision_score(df.y_true, df.y_pred) (0.75, 0.75, 0.75)
- precision_score(y_true, y_pred, average='binary', selection=None, array_type=None)[source]¶
Calculates the precision classification score.
For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.
For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.
- Parameters
y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
average – Should be either ‘binary’ or ‘macro’.
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
The precision score
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1]) >>> df.ml.metrics.precision_score(df.y_true, df.y_pred) 0.75
- r2_score(y_true, y_pred)[source]¶
Calculates the R**2 (coefficient of determination) regression score function.
- Parameters
y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
The R**2 score
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.datasets.iris() >>> df.ml.metrics.r2_score(df.sepal_length, df.petal_length) -7.205575765485069
- recall_score(y_true, y_pred, average='binary', selection=None, array_type=None)[source]¶
Calculates the recall classification score.
For a binary classification problem, average should be set to “binary”. In this case it is assumed that the input data is encoded in 0 and 1 integers, where the class of importance is labeled as 1.
For multiclass classification problems, average should be set to “macro”. The “macro” average is the unweighted mean of a metric for each label. For multiclass problems the data can be ordinal encoded, but class names are also supported.
- Parameters
y_true – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y_pred – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
average – Should be either ‘binary’ or ‘macro’.
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list
- Returns
The recall score
Example:
>>> import vaex >>> import vaex.ml.metrics >>> df = vaex.from_arrays(y_true=[1, 1, 0, 1, 0, 1], y_pred=[1, 0, 0, 1, 1, 1]) >>> df.ml.metrics.recall_score(df.y_true, df.y_pred) 0.75
Scikit-learn¶
|
This class wraps any scikit-learn estimator (a.k.a predictions) that has a .partial_fit method, and makes it a vaex pipeline object. |
|
This class wraps any scikit-learn estimator (a.k.a predictor) making it a vaex pipeline object. |
- class vaex.ml.sklearn.IncrementalPredictor(**kwargs: Any)[source]¶
Bases:
vaex.ml.state.HasState
This class wraps any scikit-learn estimator (a.k.a predictions) that has a .partial_fit method, and makes it a vaex pipeline object.
By wrapping “on-line” scikit-learn estimators with this class, they become a vaex pipeline object. Thus, they can take full advantage of the serialization and pipeline system of vaex. While the underlying estimator need to call the .partial_fit method, this class contains the standard .fit method, and the rest happens behind the scenes. One can also iterate over the data multiple times (epochs), and optionally shuffle each batch before it is sent to the estimator. The predict method returns a numpy array, while the transform method adds the prediction as a virtual column to a vaex DataFrame.
Note: the .fit method will use as much memory as needed to copy one batch of data, while the .predict method will require as much memory as needed to output the predictions as a numpy array. The transform method is evaluated lazily, and no memory copies are made.
Note: we are using normal sklearn without modifications here.
Example:
>>> import vaex >>> import vaex.ml >>> from vaex.ml.sklearn import IncrementalPredictor >>> from sklearn.linear_model import SGDRegressor >>> >>> df = vaex.example() >>> >>> features = df.column_names[:6] >>> target = 'FeH' >>> >>> standard_scaler = vaex.ml.StandardScaler(features=features) >>> df = standard_scaler.fit_transform(df) >>> >>> features = df.get_column_names(regex='^standard') >>> model = SGDRegressor(learning_rate='constant', eta0=0.01, random_state=42) >>> >>> incremental = IncrementalPredictor(model=model, ... features=features, ... target=target, ... batch_size=10_000, ... num_epochs=3, ... shuffle=True, ... prediction_name='pred_FeH') >>> incremental.fit(df=df) >>> df = incremental.transform(df) >>> df.head(5)[['FeH', 'pred_FeH']] # FeH pred_FeH 0 -2.30923 -1.66226 1 -1.78874 -1.68218 2 -0.761811 -1.59562 3 -1.52088 -1.62225 4 -2.65534 -1.61991
- Parameters
batch_size – Number of samples to be sent to the model in each batch.
features – List of features to use.
model – A scikit-learn estimator with a .fit_predict method.
num_epochs – Number of times each batch is sent to the model.
partial_fit_kwargs – A dictionary of key word arguments to be passed on to the fit_predict method of the model.
prediction_name – The name of the virtual column housing the predictions.
prediction_type – Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.
shuffle – If True, shuffle the samples before sending them to the model.
target – The name of the target column.
- batch_size¶
Number of samples to be sent to the model in each batch.
- features¶
List of features to use.
- fit(df, progress=None)[source]¶
Fit the IncrementalPredictor to the DataFrame.
- Parameters
df – A vaex DataFrame containing the features and target on which to train the model.
progress – If True, display a progressbar which tracks the training progress.
- model¶
A scikit-learn estimator with a .fit_predict method.
- num_epochs¶
Number of times each batch is sent to the model.
- partial_fit_kwargs¶
A dictionary of key word arguments to be passed on to the fit_predict method of the model.
- predict(df)[source]¶
Get an in-memory numpy array with the predictions of the Predictor
- Parameters
df – A vaex DataFrame, containing the input features.
- Returns
A in-memory numpy array containing the Predictor predictions.
- Return type
numpy.array
- prediction_name¶
The name of the virtual column housing the predictions.
- prediction_type¶
Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.
- shuffle¶
If True, shuffle the samples before sending them to the model.
- snake_name = 'sklearn_incremental_predictor'¶
- target¶
The name of the target column.
- transform(df)[source]¶
Transform a DataFrame such that it contains the predictions of the IncrementalPredictor. in form of a virtual column.
- Parameters
df – A vaex DataFrame.
- Return copy
A shallow copy of the DataFrame that includes the IncrementalPredictor prediction as a virtual column.
- Return type
- class vaex.ml.sklearn.Predictor(**kwargs: Any)[source]¶
Bases:
vaex.ml.state.HasState
This class wraps any scikit-learn estimator (a.k.a predictor) making it a vaex pipeline object.
By wrapping any scikit-learn estimators with this class, it becomes a vaex pipeline object. Thus, it can take full advantage of the serialization and pipeline system of vaex. One can use the predict method to get a numpy array as an output of a fitted estimator, or the transform method do add such a prediction to a vaex DataFrame as a virtual column.
Note that a full memory copy of the data used is created when the fit and predict are called. The transform method is evaluated lazily.
The scikit-learn estimators themselves are not modified at all, they are taken from your local installation of scikit-learn.
Example:
>>> import vaex.ml >>> from vaex.ml.sklearn import Predictor >>> from sklearn.linear_model import LinearRegression >>> df = vaex.datasets.iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length'] >>> df_train, df_test = df.ml.train_test_split() >>> model = Predictor(model=LinearRegression(), features=features, target='petal_width', prediction_name='pred') >>> model.fit(df_train) >>> df_train = model.transform(df_train) >>> df_train.head(3) # sepal_length sepal_width petal_length petal_width class_ pred 0 5.4 3 4.5 1.5 1 1.64701 1 4.8 3.4 1.6 0.2 0 0.352236 2 6.9 3.1 4.9 1.5 1 1.59336 >>> df_test = model.transform(df_test) >>> df_test.head(3) # sepal_length sepal_width petal_length petal_width class_ pred 0 5.9 3 4.2 1.5 1 1.39437 1 6.1 3 4.6 1.4 1 1.56469 2 6.6 2.9 4.6 1.3 1 1.44276
- Parameters
features – List of features to use.
model – A scikit-learn estimator.
prediction_name – The name of the virtual column housing the predictions.
prediction_type – Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.
target – The name of the target column.
- features¶
List of features to use.
- fit(df, **kwargs)[source]¶
Fit the Predictor to the DataFrame.
- Parameters
df – A vaex DataFrame containing the features and target on which to train the model.
- model¶
A scikit-learn estimator.
- predict(df)[source]¶
Get an in-memory numpy array with the predictions of the Predictor.
- Parameters
df – A vaex DataFrame, containing the input features.
- Returns
A in-memory numpy array containing the Predictor predictions.
- Return type
numpy.array
- prediction_name¶
The name of the virtual column housing the predictions.
- prediction_type¶
Which method to use to get the predictions. Can be “predict”, “predict_proba” or “predict_log_proba”.
- snake_name = 'sklearn_predictor'¶
- target¶
The name of the target column.
Boosted trees¶
|
The LightGBM algorithm. |
|
The XGBoost algorithm. |
|
The CatBoost algorithm. |
- class vaex.ml.lightgbm.LightGBMModel(**kwargs: Any)[source]¶
Bases:
vaex.ml.state.HasState
The LightGBM algorithm.
This class provides an interface to the LightGBM algorithm, with some optimizations for better memory efficiency when training large datasets. The algorithm itself is not modified at all.
LightGBM is a fast gradient boosting algorithm based on decision trees and is mainly used for classification, regression and ranking tasks. It is under the umbrella of the Distributed Machine Learning Toolkit (DMTK) project of Microsoft. For more information, please visit https://github.com/Microsoft/LightGBM/.
Example:
>>> import vaex.ml >>> import vaex.ml.lightgbm >>> df = vaex.datasets.iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = df.ml.train_test_split() >>> params = { 'boosting': 'gbdt', 'max_depth': 5, 'learning_rate': 0.1, 'application': 'multiclass', 'num_class': 3, 'subsample': 0.80, 'colsample_bytree': 0.80} >>> booster = vaex.ml.lightgbm.LightGBMModel(features=features, target='class_', num_boost_round=100, params=params) >>> booster.fit(df_train) >>> df_train = booster.transform(df_train) >>> df_train.head(3) # sepal_width petal_length sepal_length petal_width class_ lightgbm_prediction 0 3 4.5 5.4 1.5 1 [0.00165619 0.98097899 0.01736482] 1 3.4 1.6 4.8 0.2 0 [9.99803930e-01 1.17346471e-04 7.87235133e-05] 2 3.1 4.9 6.9 1.5 1 [0.00107541 0.9848717 0.01405289] >>> df_test = booster.transform(df_test) >>> df_test.head(3) # sepal_width petal_length sepal_length petal_width class_ lightgbm_prediction 0 3 4.2 5.9 1.5 1 [0.00208904 0.9821348 0.01577616] 1 3 4.6 6.1 1.4 1 [0.00182039 0.98491357 0.01326604] 2 2.9 4.6 6.6 1.3 1 [2.50915444e-04 9.98431777e-01 1.31730785e-03]
- Parameters
features – List of features to use when fitting the LightGBMModel.
num_boost_round – Number of boosting iterations.
params – parameters to be passed on the to the LightGBM model.
prediction_name – The name of the virtual column housing the predictions.
target – The name of the target column.
- features¶
List of features to use when fitting the LightGBMModel.
- fit(df, valid_sets=None, valid_names=None, early_stopping_rounds=None, evals_result=None, verbose_eval=None, **kwargs)[source]¶
Fit the LightGBMModel to the DataFrame.
The model will train until the validation score stops improving. Validation score needs to improve at least every early_stopping_rounds rounds to continue training. Requires at least one validation DataFrame, metric specified. If there’s more than one, will check all of them, but the training data is ignored anyway. If early stopping occurs, the model will add
best_iteration
field to the booster object.- Parameters
df – A vaex DataFrame containing the features and target on which to train the model.
valid_sets (list) – A list of DataFrames to be used for validation.
valid_names (list) – A list of strings to label the validation sets.
int (early_stopping_rounds) – Activates early stopping.
evals_result (dict) – A dictionary storing the evaluation results of all valid_sets.
verbose_eval (bool) – Requires at least one item in valid_sets. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.
- num_boost_round¶
Number of boosting iterations.
- params¶
parameters to be passed on the to the LightGBM model.
- predict(df, **kwargs)[source]¶
Get an in-memory numpy array with the predictions of the LightGBMModel on a vaex DataFrame. This method accepts the key word arguments of the predict method from LightGBM.
- Parameters
df – A vaex DataFrame.
- Returns
A in-memory numpy array containing the LightGBMModel predictions.
- Return type
numpy.array
- prediction_name¶
The name of the virtual column housing the predictions.
- target¶
The name of the target column.
- class vaex.ml.xgboost.XGBoostModel(**kwargs: Any)[source]¶
Bases:
vaex.ml.state.HasState
The XGBoost algorithm.
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. (https://github.com/dmlc/xgboost)
Example:
>>> import vaex >>> import vaex.ml.xgboost >>> df = vaex.datasets.iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = df.ml.train_test_split() >>> params = { 'max_depth': 5, 'learning_rate': 0.1, 'objective': 'multi:softmax', 'num_class': 3, 'subsample': 0.80, 'colsample_bytree': 0.80, 'silent': 1} >>> booster = vaex.ml.xgboost.XGBoostModel(features=features, target='class_', num_boost_round=100, params=params) >>> booster.fit(df_train) >>> df_train = booster.transform(df_train) >>> df_train.head(3) # sepal_length sepal_width petal_length petal_width class_ xgboost_prediction 0 5.4 3 4.5 1.5 1 1 1 4.8 3.4 1.6 0.2 0 0 2 6.9 3.1 4.9 1.5 1 1 >>> df_test = booster.transform(df_test) >>> df_test.head(3) # sepal_length sepal_width petal_length petal_width class_ xgboost_prediction 0 5.9 3 4.2 1.5 1 1 1 6.1 3 4.6 1.4 1 1 2 6.6 2.9 4.6 1.3 1 1
- Parameters
features – List of features to use when fitting the XGBoostModel.
num_boost_round – Number of boosting iterations.
params – A dictionary of parameters to be passed on to the XGBoost model.
prediction_name – The name of the virtual column housing the predictions.
target – The name of the target column.
- features¶
List of features to use when fitting the XGBoostModel.
- fit(df, evals=(), early_stopping_rounds=None, evals_result=None, verbose_eval=False, **kwargs)[source]¶
Fit the XGBoost model given a DataFrame.
This method accepts all key word arguments for the xgboost.train method.
- Parameters
df – A vaex DataFrame containing the features and target on which to train the model.
evals – A list of pairs (DataFrame, string). List of items to be evaluated during training, this allows user to watch performance on the validation set.
early_stopping_rounds (int) – Activates early stopping. Validation error needs to decrease at least every early_stopping_rounds round(s) to continue training. Requires at least one item in evals. If there’s more than one, will use the last. Returns the model from the last iteration (not the best one).
evals_result (dict) – A dictionary storing the evaluation results of all the items in evals.
verbose_eval (bool) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.
- num_boost_round¶
Number of boosting iterations.
- params¶
A dictionary of parameters to be passed on to the XGBoost model.
- predict(df, **kwargs)[source]¶
Provided a vaex DataFrame, get an in-memory numpy array with the predictions from the XGBoost model. This method accepts the key word arguments of the predict method from XGBoost.
- Returns
A in-memory numpy array containing the XGBoostModel predictions.
- Return type
numpy.array
- prediction_name¶
The name of the virtual column housing the predictions.
- target¶
The name of the target column.
- transform(df)[source]¶
Transform a DataFrame such that it contains the predictions of the XGBoostModel in form of a virtual column.
- Parameters
df – A vaex DataFrame. It should have the same columns as the DataFrame used to train the model.
- Return copy
A shallow copy of the DataFrame that includes the XGBoostModel prediction as a virtual column.
- Return type
- class vaex.ml.catboost.CatBoostModel(**kwargs: Any)[source]¶
Bases:
vaex.ml.state.HasState
The CatBoost algorithm.
This class provides an interface to the CatBoost aloritham. CatBoost is a fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks. For more information please visit https://github.com/catboost/catboost
Example:
>>> import vaex >>> import vaex.ml.catboost >>> df = vaex.datasets.iris() >>> features = ['sepal_width', 'petal_length', 'sepal_length', 'petal_width'] >>> df_train, df_test = df.ml.train_test_split() >>> params = { 'leaf_estimation_method': 'Gradient', 'learning_rate': 0.1, 'max_depth': 3, 'bootstrap_type': 'Bernoulli', 'objective': 'MultiClass', 'eval_metric': 'MultiClass', 'subsample': 0.8, 'random_state': 42, 'verbose': 0} >>> booster = vaex.ml.catboost.CatBoostModel(features=features, target='class_', num_boost_round=100, params=params) >>> booster.fit(df_train) >>> df_train = booster.transform(df_train) >>> df_train.head(3) # sepal_length sepal_width petal_length petal_width class_ catboost_prediction 0 5.4 3 4.5 1.5 1 [0.00615039 0.98024259 0.01360702] 1 4.8 3.4 1.6 0.2 0 [0.99034267 0.00526382 0.0043935 ] 2 6.9 3.1 4.9 1.5 1 [0.00688241 0.95190908 0.04120851] >>> df_test = booster.transform(df_test) >>> df_test.head(3) # sepal_length sepal_width petal_length petal_width class_ catboost_prediction 0 5.9 3 4.2 1.5 1 [0.00464228 0.98883351 0.00652421] 1 6.1 3 4.6 1.4 1 [0.00350424 0.9882139 0.00828186] 2 6.6 2.9 4.6 1.3 1 [0.00325705 0.98891631 0.00782664]
- Parameters
batch_size – If provided, will train in batches of this size.
batch_weights – Weights to sum models at the end of training in batches.
ctr_merge_policy – Strategy for summing up models. Only used when training in batches. See the CatBoost documentation for more info.
evals_result – Evaluation results
features – List of features to use when fitting the CatBoostModel.
num_boost_round – Number of boosting iterations.
params – A dictionary of parameters to be passed on to the CatBoostModel model.
pool_params – A dictionary of parameters to be passed to the Pool data object construction
prediction_name – The name of the virtual column housing the predictions.
prediction_type – The form of the predictions. Can be “RawFormulaVal”, “Probability” or “Class”.
target – The name of the target column.
- batch_size¶
If provided, will train in batches of this size.
- batch_weights¶
Weights to sum models at the end of training in batches.
- ctr_merge_policy¶
Strategy for summing up models. Only used when training in batches. See the CatBoost documentation for more info.
- evals_result_¶
Evaluation results
- features¶
List of features to use when fitting the CatBoostModel.
- fit(df, evals=None, early_stopping_rounds=None, verbose_eval=None, plot=False, progress=None, **kwargs)[source]¶
Fit the CatBoostModel model given a DataFrame. This method accepts all key word arguments for the catboost.train method.
- Parameters
df – A vaex DataFrame containing the features and target on which to train the model.
evals – A list of DataFrames to be evaluated during training. This allows user to watch performance on the validation sets.
early_stopping_rounds (int) – Activates early stopping.
verbose_eval (bool) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.
plot (bool) – if True, display an interactive widget in the Jupyter notebook of how the train and validation sets score on each boosting iteration.
progress – If True display a progressbar when the training is done in batches.
- num_boost_round¶
Number of boosting iterations.
- params¶
A dictionary of parameters to be passed on to the CatBoostModel model.
- pool_params¶
A dictionary of parameters to be passed to the Pool data object construction
- predict(df, **kwargs)[source]¶
Provided a vaex DataFrame, get an in-memory numpy array with the predictions from the CatBoostModel model. This method accepts the key word arguments of the predict method from catboost.
- Parameters
df – a vaex DataFrame
- Returns
A in-memory numpy array containing the CatBoostModel predictions.
- Return type
numpy.array
- prediction_name¶
The name of the virtual column housing the predictions.
- prediction_type¶
The form of the predictions. Can be “RawFormulaVal”, “Probability” or “Class”.
- target¶
The name of the target column.
- transform(df)[source]¶
Transform a DataFrame such that it contains the predictions of the CatBoostModel in form of a virtual column.
- Parameters
df – A vaex DataFrame. It should have the same columns as the DataFrame used to train the model.
- Return copy
A shallow copy of the DataFrame that includes the CatBoostModel prediction as a virtual column.
- Return type
Tensorflow¶
Incubator/experimental¶
These models are in the incubator phase and may disappear in the future
- class vaex.ml.incubator.annoy.ANNOYModel(**kwargs: Any)[source]¶
Bases:
vaex.ml.state.HasState
- Parameters
features – List of features to use.
metric – Metric to use for distance calculations
n_neighbours – Now many neighbours
n_trees – Number of trees to build.
predcition_name – Output column name for the neighbours when transforming a DataFrame
prediction_name – Output column name for the neighbours when transforming a DataFrame
search_k – Jovan?
- features¶
List of features to use.
- metric¶
Metric to use for distance calculations
- n_neighbours¶
Now many neighbours
- n_trees¶
Number of trees to build.
- predcition_name¶
Output column name for the neighbours when transforming a DataFrame
- prediction_name¶
Output column name for the neighbours when transforming a DataFrame
- search_k¶
Jovan?
- class vaex.ml.incubator.river.RiverModel(**kwargs: Any)[source]¶
Bases:
vaex.ml.state.HasState
This class wraps River (github.com/online-ml/river) estimators, making them vaex pipeline objects.
This class conveniently wraps River models making them vaex pipeline objects. Thus they take full advantage of the serialization and pipeline system of vaex. Only the River models that implement the learn_many are compatible. One can also wrap an entire River pipeline, as long as each pipeline step implements the learn_many method. With the wrapper one can iterate over the data multiple times (epochs), and optinally shuffle each batch before it is sent to the estimator. The predict method wil require as much memory as needed to output the predictions as a numpy array, while the transform method is evaluated lazily, and no memory copies are made.
Example:
>>> import vaex >>> import vaex.ml >>> from vaex.ml.incubator.river import RiverModel >>> from river.linear_model import LinearRegression >>> from river import optim >>> >>> df = vaex.example() >>> >>> features = df.column_names[:6] >>> target = 'FeH' >>> >>> df = df.ml.standard_scaler(features=features, prefix='scaled_') >>> >>> features = df.get_column_names(regex='^scaled_') >>> model = LinearRegression(optimizer=optim.SGD(lr=0.1), intercept_lr=0.1) >>> >>> river_model = RiverModel(model=model, features=features, target=target, batch_size=10_000, num_epochs=3, shuffle=True, prediction_name='pred_FeH') >>> >>> river_model.fit(df=df) >>> df = river_model.transform(df) >>> df.head(5)[['FeH', 'pred_FeH']] # FeH pred_FeH 0 -1.00539 -1.6332 1 -1.70867 -1.56632 2 -1.83361 -1.55338 3 -1.47869 -1.60646 4 -1.85705 -1.5996
- Parameters
batch_size – Number of samples to be sent to the model in each batch.
features – List of features to use.
model – A River model which implements the learn_many method.
num_epochs – Number of times each batch is sent to the model.
prediction_name – The name of the virtual column housing the predictions.
prediction_type – Which method to use to get the predictions. Can be “predict” or “predict_proba” which correspond to “predict_many” and “predict_proba_many in a River model respectively.
shuffle – If True, shuffle the samples before sending them to the model.
target – The name of the target column.
- batch_size¶
Number of samples to be sent to the model in each batch.
- features¶
List of features to use.
- fit(df, progress=None)[source]¶
Fit the RiverModel to the DataFrame.
- Parameters
df – A vaex DataFrame containig the features and target on which to train the model
progress – If True, display a progressbar which tracks the training progress.
- model¶
A River model which implements the learn_many method.
- num_epochs¶
Number of times each batch is sent to the model.
- predict(df)[source]¶
Get an in memory numpy array with the predictions of the Model
- Parameters
df – A vaex DataFrame containing the input features
- Returns
A in-memory numpy array containing the Model predictions
- Return type
numpy.array
- prediction_name¶
The name of the virtual column housing the predictions.
- prediction_type¶
Which method to use to get the predictions. Can be “predict” or “predict_proba” which correspond to “predict_many” and “predict_proba_many in a River model respectively.
- shuffle¶
If True, shuffle the samples before sending them to the model.
- target¶
The name of the target column.
vaex-viz¶
- class vaex.viz.DataFrameAccessorViz(df)[source]¶
Bases:
object
- __weakref__¶
list of weak references to the object (if defined)
- healpix_heatmap(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0), **kwargs)¶
Viz data in 2d using a healpix column.
- Parameters
healpix_expression – {healpix_max_level}
healpix_max_level – {healpix_max_level}
healpix_level – {healpix_level}
what – {what}
selection – {selection}
grid – {grid}
healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”.
healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”.
f – function to apply to the data
colormap – matplotlib colormap
grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid)))
image_size – size for the image that healpy uses for rendering
nest – If the healpix data is in nested (True) or ring (False)
figsize – If given, modify the matplotlib figure size. Example (14,9)
interactive – (Experimental, uses healpy.mollzoom is True)
title – Title of figure
smooth – apply gaussian smoothing, in degrees
show – Call matplotlib’s show (True) or not (False, defaut)
rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
- Returns
- heatmap(x=None, y=None, z=None, what='count(*)', vwhat=None, reduce=['colormap'], f=None, normalize='normalize', normalize_axis='what', vmin=None, vmax=None, shape=256, vshape=32, limits=None, grid=None, colormap='afmhot', figsize=None, xlabel=None, ylabel=None, aspect='auto', tight_layout=True, interpolation='nearest', show=False, colorbar=True, colorbar_label=None, selection=None, selection_labels=None, title=None, background_color='white', pre_blend=False, background_alpha=1.0, visual={'column': 'what', 'fade': 'selection', 'layer': 'z', 'row': 'subspace', 'x': 'x', 'y': 'y'}, smooth_pre=None, smooth_post=None, wrap=True, wrap_columns=4, return_extra=False, hardcopy=None)¶
Viz data in a 2d histogram/heatmap.
Declarative plotting of statistical plots using matplotlib, supports subplots, selections, layers.
Instead of passing x and y, pass a list as x argument for multiple panels. Give what a list of options to have multiple panels. When both are present then will be origanized in a column/row order.
This methods creates a 6 dimensional ‘grid’, where each dimension can map the a visual dimension. The grid dimensions are:
x: shape determined by shape, content by x argument or the first dimension of each space
y: ,,
z: related to the z argument
selection: shape equals length of selection argument
what: shape equals length of what argument
space: shape equals length of x argument if multiple values are given
By default, this its shape is (1, 1, 1, 1, shape, shape) (where x is the last dimension)
The visual dimensions are
x: x coordinate on a plot / image (default maps to grid’s x)
y: y ,, (default maps to grid’s y)
layer: each image in this dimension is blended togeher to one image (default maps to z)
fade: each image is shown faded after the next image (default mapt to selection)
row: rows of subplots (default maps to space)
columns: columns of subplot (default maps to what)
All these mappings can be changes by the visual argument, some examples:
>>> df.viz.heatmap('x', 'y', what=['mean(x)', 'correlation(vx, vy)'])
Will plot each ‘what’ as a column.
>>> df.viz.heatmap('x', 'y', selection=['FeH < -3', '(FeH >= -3) & (FeH < -2)'], visual=dict(column='selection'))
Will plot each selection as a column, instead of a faded on top of each other.
- Parameters
x – Expression to bin in the x direction (by default maps to x), or list of pairs, like [[‘x’, ‘y’], [‘x’, ‘z’]], if multiple pairs are given, this dimension maps to rows by default
y – y (by default maps to y)
z – Expression to bin in the z direction, followed by a :start,end,shape signature, like ‘FeH:-3,1:5’ will produce 5 layers between -10 and 10 (by default maps to layer)
what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)
reduce –
f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
normalize – normalization function, currently only ‘normalize’ is supported
normalize_axis – which axes to normalize on, None means normalize by the global maximum.
vmin – instead of automatic normalization, (using normalize and normalization_axis) scale the data between vmin and vmax to [0, 1]
vmax – see vmin
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
grid – If grid is given, instead if compuation a statistic given by what, use this Nd-numpy array instead, this is often useful when a custom computation/statistic is calculated, but you still want to use the plotting machinery.
colormap – matplotlib colormap to use
figsize – (x, y) tuple passed to plt.figure for setting the figure size
xlabel –
ylabel –
aspect –
tight_layout – call plt.tight_layout or not
colorbar – plot a colorbar or not
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)
interpolation – interpolation for imshow, possible options are: ‘nearest’, ‘bilinear’, ‘bicubic’, see matplotlib for more
return_extra –
- Returns
- histogram(x=None, what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, progress=None, **kwargs)¶
Plot a histogram.
Example:
>>> df.histogram(df.x) >>> df.histogram(df.x, limits=[0, 100], shape=100) >>> df.histogram(df.x, what='mean(y)', limits=[0, 100], shape=100)
If you want to do a computation yourself, pass the grid argument, but you are responsible for passing the same limits arguments:
>>> counts = df.mean(df.y, binby=df.x, limits=[0, 100], shape=100)/100. >>> df.histogram(df.x, limits=[0, 100], shape=100, grid=means, label='mean(y)/100')
- Parameters
x – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum
grid – If grid is given, instead if compuation a statistic given by what, use this Nd-numpy array instead, this is often useful when a custom computation/statistic is calculated, but you still want to use the plotting machinery.
shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]
facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)
limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]
figsize – (x, y) tuple passed to plt.figure for setting the figure size
f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
n – normalization function, currently only ‘normalize’ is supported, or None for no normalization
normalize_axis – which axes to normalize on, None means normalize by the global maximum.
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections
xlabel – String for label on x axis (may contain latex)
ylabel – Same for y axis
kwargs – extra argument passed to plt.plot
- Param
tight_layout: call plt.tight_layout or not
- Returns
- scatter(x, y, xerr=None, yerr=None, cov=None, corr=None, s_expr=None, c_expr=None, labels=None, selection=None, length_limit=50000, length_check=True, label=None, xlabel=None, ylabel=None, errorbar_kwargs={}, ellipse_kwargs={}, **kwargs)¶
Viz (small amounts) of data in 2d using a scatter plot
Convenience wrapper around plt.scatter when for working with small DataFrames or selections
- Parameters
x – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
y – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y
s_expr – When given, use if for the s (size) argument of plt.scatter
c_expr – When given, use if for the c (color) argument of plt.scatter
labels – Annotate the points with these text values
selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)
length_limit – maximum number of rows it will plot
length_check – should we do the maximum row check or not?
label – label for the legend
xlabel – label for x axis, if None .label(x) is used
ylabel – label for y axis, if None .label(y) is used
errorbar_kwargs – extra dict with arguments passed to plt.errorbar
kwargs – extra arguments passed to plt.scatter
- Returns
- class vaex.viz.ExpressionAccessorViz(expression)[source]¶
Bases:
object
- __weakref__¶
list of weak references to the object (if defined)
- histogram(what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, progress=None, **kwargs)[source]¶
Plot a histogram of the expression. This is a convenience method for df.histogram(…)
Example:
>>> df.x.histogram() >>> df.x.histogram(limits=[0, 100], shape=100) >>> df.x.histogram(what='mean(y)', limits=[0, 100], shape=100)
If you want to do a computation yourself, pass the grid argument, but you are responsible for passing the same limits arguments:
>>> counts = df.mean(df.y, binby=df.x, limits=[0, 100], shape=100)/100. >>> df.plot1d(df.x, limits=[0, 100], shape=100, grid=means, label='mean(y)/100')
- Parameters
x – Expression to bin in the x direction
what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum
grid – If the binning is done before by yourself, you can pass it
facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1)
limits – list of [xmin, xmax], or a description such as ‘minmax’, ‘99%’
figsize – (x, y) tuple passed to plt.figure for setting the figure size
f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value
n – normalization function, currently only ‘normalize’ is supported, or None for no normalization
normalize_axis – which axes to normalize on, None means normalize by the global maximum.
normalize_axis –
xlabel – String for label on x axis (may contain latex)
ylabel – Same for y axis
kwargs – extra argument passed to plt.plot
- Param
tight_layout: call plt.tight_layout or not
- Returns