Using vaex with arrow

Vaex supports Arrow. We will demonstrate vaex+arrow by giving a quick look at a large dataset that does not fit into memory. The NYC taxi dataset for the year 2015 contains about 150 million rows containing information about taxi trips in New York, and is about 23GB in size. You can download it here:

In case you want to convert it to the arrow format, use the code below:

ds_hdf5 = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.hdf5')
# this may take a while to export
ds_hdf5.export('./nyc_taxi2015.arrow')

Also make sure you install vaex-arrow:

$ pip install vaex-arrow
[1]:
!ls -alh /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
-rw-r--r--  1 maartenbreddels  staff    23G Oct 31 18:56 /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
[3]:
import vaex

Opens instantly

Opening the file goes instantly, since nothing is being copied to memory. The data is only memory mapped, a technique that will only read the data when needed.

[4]:
%time
df = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow')
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs
[5]:
df
[5]:
#VendorIDdropoff_dayofweekdropoff_hourdropoff_latitudedropoff_longitudeextrafare_amountimprovement_surchargemta_taxpassenger_countpayment_typepickup_dayofweekpickup_hourpickup_latitudepickup_longitudetip_amounttolls_amounttotal_amounttpep_dropoff_datetimetpep_pickup_datetimetrip_distance
023.019.040.75061798095703-73.974784851074221.012.00.30.5113.019.040.7501106262207-73.9938964843753.250.017.05numpy.datetime64('2015-01-15T19:23:42.000000000')numpy.datetime64('2015-01-15T19:05:39.000000000')1.59
115.020.040.75910949707031-73.994415283203120.514.50.30.5115.020.040.7242431640625-74.001647949218752.00.017.8numpy.datetime64('2015-01-10T20:53:28.000000000')numpy.datetime64('2015-01-10T20:33:38.000000000')3.3
215.020.040.82441329956055-73.951820373535160.59.50.30.5125.020.040.80278778076172-73.963340759277340.00.010.8numpy.datetime64('2015-01-10T20:43:41.000000000')numpy.datetime64('2015-01-10T20:33:38.000000000')1.8
315.020.040.71998596191406-74.004325866699230.53.50.30.5125.020.040.71381759643555-74.009086608886720.00.04.8numpy.datetime64('2015-01-10T20:35:31.000000000')numpy.datetime64('2015-01-10T20:33:39.000000000')0.5
415.020.040.742652893066406-74.004180908203120.515.00.30.5125.020.040.762428283691406-73.971176147460940.00.016.3numpy.datetime64('2015-01-10T20:52:58.000000000')numpy.datetime64('2015-01-10T20:33:39.000000000')3.0
..................................................................
146,112,98424.00.040.722469329833984-73.986213684082030.57.50.30.5513.023.040.72087097167969-73.993812561035161.760.010.56numpy.datetime64('2016-01-01T00:08:18.000000000')numpy.datetime64('2015-12-31T23:59:56.000000000')1.2
146,112,98514.00.040.75238800048828-73.939514160156250.57.50.30.5223.023.040.76028060913085-73.965270996093750.00.08.8numpy.datetime64('2016-01-01T00:05:19.000000000')numpy.datetime64('2015-12-31T23:59:58.000000000')2.0
146,112,98614.00.040.69329833984375-73.98867034912110.513.50.30.5223.023.040.73907852172852-73.987297058105470.00.014.8numpy.datetime64('2016-01-01T00:12:55.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')3.8
146,112,98724.00.040.705322265625-74.017120361328120.58.50.30.5123.023.040.72569274902344-73.997558593750.00.09.8numpy.datetime64('2016-01-01T00:10:26.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')1.96
146,112,98824.00.040.76057052612305-73.990982055664060.513.50.30.5113.023.040.76725769042969-73.984397888183582.960.017.76numpy.datetime64('2016-01-01T00:21:30.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')1.06

Quick viz of 146 million rows

As can be seen, this dataset contains 146 million rows. Using plot, we can generate a quick overview what the data contains. The pickup locations nicely outline Manhattan.

[6]:
df.plot(df.pickup_longitude, df.pickup_latitude, f='log');
_images/example_arrow_8_0.png
[7]:
df.total_amount.minmax()
[7]:
array([-4.9630000e+02,  3.9506116e+06])

Data cleansing: outliers

As can be seen from the total_amount columns (how much people payed), this dataset contains outliers. From a quick 1d plot, we can see reasonable ways to filter the data

[8]:
df.plot1d(df.total_amount, shape=100, limits=[0, 100])
[8]:
[<matplotlib.lines.Line2D at 0x121d26320>]
_images/example_arrow_11_1.png
[9]:
# filter the dataset
dff = df[(df.total_amount >= 0) & (df.total_amount < 100)]

Shallow copies

This filtered dataset did not copy any data (otherwise it would have costed us about ~23GB of RAM). Shallow copies of the data are made instead and a booleans mask tracks which rows should be used.

[10]:
dff['ratio'] = dff.tip_amount/dff.total_amount

Virtual column

The new column ratio does not do any computation yet, it only stored the expression and does not waste any memory. However, the new (virtual) column can be used in calculations as if it were a normal column.

[11]:
dff.ratio.mean()
<string>:1: RuntimeWarning: invalid value encountered in true_divide
[11]:
0.09601926650107262

Result

Our final result, the percentage of the tip, can be easily calcualted for this large dataset, it did not require any excessive amount of memory.

Interoperability

Since the data lives as Arrow arrays, we can pass them around to other libraries such as pandas, or even pass it to other processes.

[12]:
arrow_table = df.to_arrow_table()
arrow_table
[12]:
pyarrow.Table
VendorID: int64
dropoff_dayofweek: double
dropoff_hour: double
dropoff_latitude: double
dropoff_longitude: double
extra: double
fare_amount: double
improvement_surcharge: double
mta_tax: double
passenger_count: int64
payment_type: int64
pickup_dayofweek: double
pickup_hour: double
pickup_latitude: double
pickup_longitude: double
tip_amount: double
tolls_amount: double
total_amount: double
tpep_dropoff_datetime: timestamp[ns]
tpep_pickup_datetime: timestamp[ns]
trip_distance: double
[13]:
# Although you can 'convert' (pass the data) in to pandas,
# some memory will be wasted (at least an index will be created by pandas)
# here we just pass a subset of the data
df_pandas = df[:10000].to_pandas_df()
df_pandas
[13]:
VendorID dropoff_dayofweek dropoff_hour dropoff_latitude dropoff_longitude extra fare_amount improvement_surcharge mta_tax passenger_count ... pickup_dayofweek pickup_hour pickup_latitude pickup_longitude tip_amount tolls_amount total_amount tpep_dropoff_datetime tpep_pickup_datetime trip_distance
0 2 3.0 19.0 40.750618 -73.974785 1.0 12.0 0.3 0.5 1 ... 3.0 19.0 40.750111 -73.993896 3.25 0.00 17.05 2015-01-15 19:23:42 2015-01-15 19:05:39 1.59
1 1 5.0 20.0 40.759109 -73.994415 0.5 14.5 0.3 0.5 1 ... 5.0 20.0 40.724243 -74.001648 2.00 0.00 17.80 2015-01-10 20:53:28 2015-01-10 20:33:38 3.30
2 1 5.0 20.0 40.824413 -73.951820 0.5 9.5 0.3 0.5 1 ... 5.0 20.0 40.802788 -73.963341 0.00 0.00 10.80 2015-01-10 20:43:41 2015-01-10 20:33:38 1.80
3 1 5.0 20.0 40.719986 -74.004326 0.5 3.5 0.3 0.5 1 ... 5.0 20.0 40.713818 -74.009087 0.00 0.00 4.80 2015-01-10 20:35:31 2015-01-10 20:33:39 0.50
4 1 5.0 20.0 40.742653 -74.004181 0.5 15.0 0.3 0.5 1 ... 5.0 20.0 40.762428 -73.971176 0.00 0.00 16.30 2015-01-10 20:52:58 2015-01-10 20:33:39 3.00
5 1 5.0 20.0 40.758194 -73.986977 0.5 27.0 0.3 0.5 1 ... 5.0 20.0 40.774048 -73.874374 6.70 5.33 40.33 2015-01-10 20:53:52 2015-01-10 20:33:39 9.00
6 1 5.0 20.0 40.749634 -73.992470 0.5 14.0 0.3 0.5 1 ... 5.0 20.0 40.726009 -73.983276 0.00 0.00 15.30 2015-01-10 20:58:31 2015-01-10 20:33:39 2.20
7 1 5.0 20.0 40.726326 -73.995010 0.5 7.0 0.3 0.5 3 ... 5.0 20.0 40.734142 -74.002663 1.66 0.00 9.96 2015-01-10 20:42:20 2015-01-10 20:33:39 0.80
8 1 5.0 21.0 40.759357 -73.987595 0.0 52.0 0.3 0.5 3 ... 5.0 20.0 40.644356 -73.783043 0.00 5.33 58.13 2015-01-10 21:11:35 2015-01-10 20:33:39 18.20
9 1 5.0 20.0 40.759365 -73.985916 0.5 6.5 0.3 0.5 2 ... 5.0 20.0 40.767948 -73.985588 1.55 0.00 9.35 2015-01-10 20:40:44 2015-01-10 20:33:40 0.90
10 1 5.0 20.0 40.728584 -74.004395 0.5 7.0 0.3 0.5 1 ... 5.0 20.0 40.723103 -73.988617 1.66 0.00 9.96 2015-01-10 20:41:39 2015-01-10 20:33:40 0.90
11 1 5.0 20.0 40.757217 -73.967407 0.5 7.5 0.3 0.5 1 ... 5.0 20.0 40.751419 -73.993782 1.00 0.00 9.80 2015-01-10 20:43:26 2015-01-10 20:33:41 1.10
12 1 5.0 20.0 40.707726 -74.009773 0.5 3.0 0.3 0.5 1 ... 5.0 20.0 40.704376 -74.008362 0.00 0.00 4.30 2015-01-10 20:35:23 2015-01-10 20:33:41 0.30
13 1 5.0 21.0 40.735210 -73.997345 0.5 19.0 0.3 0.5 1 ... 5.0 20.0 40.760448 -73.973946 3.00 0.00 23.30 2015-01-10 21:03:04 2015-01-10 20:33:41 3.10
14 1 5.0 20.0 40.739895 -73.995216 0.5 6.0 0.3 0.5 1 ... 5.0 20.0 40.731777 -74.006721 0.00 0.00 7.30 2015-01-10 20:39:23 2015-01-10 20:33:41 1.10
15 2 3.0 19.0 40.757889 -73.983978 1.0 16.5 0.3 0.5 1 ... 3.0 19.0 40.739811 -73.976425 4.38 0.00 22.68 2015-01-15 19:32:00 2015-01-15 19:05:39 2.38
16 2 3.0 19.0 40.786858 -73.955124 1.0 12.5 0.3 0.5 5 ... 3.0 19.0 40.754246 -73.968704 0.00 0.00 14.30 2015-01-15 19:21:00 2015-01-15 19:05:40 2.83
17 2 3.0 19.0 40.785782 -73.952713 1.0 26.0 0.3 0.5 5 ... 3.0 19.0 40.769581 -73.863060 8.08 5.33 41.21 2015-01-15 19:28:18 2015-01-15 19:05:40 8.33
18 2 3.0 19.0 40.786083 -73.980850 1.0 11.5 0.3 0.5 1 ... 3.0 19.0 40.779423 -73.945541 0.00 0.00 13.30 2015-01-15 19:20:36 2015-01-15 19:05:41 2.37
19 2 3.0 19.0 40.718590 -73.952377 1.0 21.5 0.3 0.5 2 ... 3.0 19.0 40.774010 -73.874458 4.50 0.00 27.80 2015-01-15 19:20:22 2015-01-15 19:05:41 7.13
20 2 3.0 19.0 40.714596 -73.998924 1.0 17.5 0.3 0.5 1 ... 3.0 19.0 40.751896 -73.976601 0.00 0.00 19.30 2015-01-15 19:31:00 2015-01-15 19:05:41 3.60
21 2 3.0 19.0 40.734650 -73.999939 1.0 5.5 0.3 0.5 1 ... 3.0 19.0 40.745079 -73.994957 1.62 0.00 8.92 2015-01-15 19:10:22 2015-01-15 19:05:41 0.89
22 2 3.0 19.0 40.735512 -74.003563 1.0 5.5 0.3 0.5 1 ... 3.0 19.0 40.747063 -74.000938 1.30 0.00 8.60 2015-01-15 19:10:55 2015-01-15 19:05:41 0.96
23 2 3.0 19.0 40.704220 -74.007919 1.0 6.5 0.3 0.5 2 ... 3.0 19.0 40.717892 -74.002777 1.50 0.00 9.80 2015-01-15 19:12:36 2015-01-15 19:05:41 1.25
24 2 3.0 19.0 40.761856 -73.978172 1.0 11.5 0.3 0.5 5 ... 3.0 19.0 40.736362 -73.997459 2.50 0.00 15.80 2015-01-15 19:22:11 2015-01-15 19:05:41 2.11
25 2 3.0 19.0 40.811089 -73.953339 1.0 7.5 0.3 0.5 5 ... 3.0 19.0 40.823994 -73.952278 1.70 0.00 11.00 2015-01-15 19:14:05 2015-01-15 19:05:41 1.15
26 2 3.0 19.0 40.734890 -73.988609 1.0 9.0 0.3 0.5 1 ... 3.0 19.0 40.750080 -73.991127 0.00 0.00 10.80 2015-01-15 19:16:18 2015-01-15 19:05:42 1.53
27 2 3.0 19.0 40.743530 -73.985603 0.0 52.0 0.3 0.5 1 ... 3.0 19.0 40.644127 -73.786575 6.00 5.33 64.13 2015-01-15 19:49:07 2015-01-15 19:05:42 18.06
28 2 3.0 19.0 40.757721 -73.994514 1.0 10.0 0.3 0.5 1 ... 3.0 19.0 40.741447 -73.993668 2.36 0.00 14.16 2015-01-15 19:18:33 2015-01-15 19:05:42 1.76
29 2 3.0 19.0 40.704689 -74.009079 1.0 17.5 0.3 0.5 6 ... 3.0 19.0 40.744083 -73.985291 3.70 0.00 23.00 2015-01-15 19:21:40 2015-01-15 19:05:42 5.19
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9970 1 4.0 11.0 40.719917 -73.955521 0.0 20.0 0.3 0.5 1 ... 4.0 10.0 40.725979 -74.009071 4.00 0.00 24.80 2015-01-30 11:20:08 2015-01-30 10:51:40 3.70
9971 1 4.0 10.0 40.720398 -73.984940 1.0 6.5 0.3 0.5 1 ... 4.0 10.0 40.732452 -73.985001 1.65 0.00 9.95 2015-01-30 10:58:58 2015-01-30 10:51:40 1.10
9972 1 4.0 11.0 40.755405 -74.002457 0.0 8.5 0.3 0.5 2 ... 4.0 10.0 40.751358 -73.990479 1.00 0.00 10.30 2015-01-30 11:03:41 2015-01-30 10:51:41 0.70
9973 2 1.0 19.0 40.763626 -73.969666 1.0 24.5 0.3 0.5 1 ... 1.0 18.0 40.708790 -74.017281 5.10 0.00 31.40 2015-01-13 19:22:18 2015-01-13 18:55:41 7.08
9974 2 1.0 19.0 40.772366 -73.960800 1.0 5.5 0.3 0.5 5 ... 1.0 18.0 40.780003 -73.954681 1.00 0.00 8.30 2015-01-13 19:02:03 2015-01-13 18:55:41 0.64
9975 2 1.0 19.0 40.733429 -73.984154 1.0 9.0 0.3 0.5 1 ... 1.0 18.0 40.749680 -73.991531 0.00 0.00 10.80 2015-01-13 19:06:56 2015-01-13 18:55:41 1.67
9976 2 1.0 19.0 40.774780 -73.957779 1.0 20.0 0.3 0.5 3 ... 1.0 18.0 40.751801 -74.002327 2.00 0.00 23.80 2015-01-13 19:18:39 2015-01-13 18:55:42 5.28
9977 2 1.0 19.0 40.751698 -73.989746 1.0 8.5 0.3 0.5 2 ... 1.0 18.0 40.768433 -73.986137 0.00 0.00 10.30 2015-01-13 19:06:38 2015-01-13 18:55:42 1.38
9978 2 1.0 19.0 40.752941 -73.977470 1.0 7.5 0.3 0.5 1 ... 1.0 18.0 40.745071 -73.987068 1.00 0.00 10.30 2015-01-13 19:05:34 2015-01-13 18:55:42 0.88
9979 2 1.0 19.0 40.735130 -73.976120 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.751259 -73.977814 0.00 0.00 10.30 2015-01-13 19:05:41 2015-01-13 18:55:42 1.58
9980 2 1.0 19.0 40.745541 -73.984383 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.731110 -74.001350 0.00 0.00 10.30 2015-01-13 19:05:32 2015-01-13 18:55:42 1.58
9981 2 1.0 19.0 40.793671 -73.974327 1.0 5.0 0.3 0.5 2 ... 1.0 18.0 40.791222 -73.965118 0.00 0.00 6.80 2015-01-13 19:00:05 2015-01-13 18:55:42 0.63
9982 2 1.0 19.0 40.754639 -73.986343 1.0 11.0 0.3 0.5 1 ... 1.0 18.0 40.764175 -73.968994 1.00 0.00 13.80 2015-01-13 19:11:57 2015-01-13 18:55:43 1.63
9983 2 1.0 18.0 40.723721 -73.989494 1.0 4.5 0.3 0.5 1 ... 1.0 18.0 40.714985 -73.992409 2.00 0.00 8.30 2015-01-13 18:59:19 2015-01-13 18:55:43 0.70
9984 2 1.0 19.0 40.774590 -73.963249 1.0 5.5 0.3 0.5 5 ... 1.0 18.0 40.764881 -73.968529 1.30 0.00 8.60 2015-01-13 19:01:19 2015-01-13 18:55:44 0.94
9985 2 1.0 19.0 40.774872 -73.982613 1.0 7.0 0.3 0.5 1 ... 1.0 18.0 40.762344 -73.985695 1.60 0.00 10.40 2015-01-13 19:03:54 2015-01-13 18:55:44 1.04
9986 2 1.0 19.0 40.787998 -73.953888 1.0 5.0 0.3 0.5 2 ... 1.0 18.0 40.779526 -73.957619 1.20 0.00 8.00 2015-01-13 19:00:06 2015-01-13 18:55:44 0.74
9987 2 1.0 19.0 40.790218 -73.975128 1.0 11.5 0.3 0.5 1 ... 1.0 18.0 40.762226 -73.985916 2.50 0.00 15.80 2015-01-13 19:10:46 2015-01-13 18:55:44 2.19
9988 2 1.0 19.0 40.739487 -73.989059 1.0 9.5 0.3 0.5 1 ... 1.0 18.0 40.725056 -73.984329 2.10 0.00 13.40 2015-01-13 19:08:40 2015-01-13 18:55:44 1.48
9989 2 1.0 19.0 40.780548 -73.959030 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.778542 -73.981949 1.00 0.00 11.30 2015-01-13 19:04:44 2015-01-13 18:55:45 1.83
9990 2 1.0 19.0 40.761524 -73.960602 1.0 15.0 0.3 0.5 1 ... 1.0 18.0 40.746319 -74.001114 0.00 0.00 16.80 2015-01-13 19:14:59 2015-01-13 18:55:45 3.27
9991 2 1.0 19.0 40.720646 -73.989716 1.0 8.0 0.3 0.5 1 ... 1.0 18.0 40.738167 -73.987434 1.00 0.00 10.80 2015-01-13 19:04:58 2015-01-13 18:55:45 1.56
9992 2 1.0 19.0 40.795898 -73.972610 1.0 20.5 0.3 0.5 1 ... 1.0 18.0 40.740582 -73.989738 4.30 0.00 26.60 2015-01-13 19:18:18 2015-01-13 18:55:45 5.40
9993 2 1.0 18.0 40.769939 -73.981316 1.0 4.5 0.3 0.5 1 ... 1.0 18.0 40.772015 -73.979416 1.10 0.00 7.40 2015-01-13 18:59:40 2015-01-13 18:55:45 0.34
9994 2 4.0 18.0 40.773521 -73.955353 1.0 31.0 0.3 0.5 1 ... 4.0 18.0 40.713215 -74.013542 5.00 0.00 37.80 2015-01-23 18:59:52 2015-01-23 18:22:55 9.05
9995 2 4.0 18.0 40.774670 -73.947845 1.0 11.5 0.3 0.5 1 ... 4.0 18.0 40.773186 -73.978043 0.00 0.00 13.30 2015-01-23 18:37:44 2015-01-23 18:22:55 2.32
9996 2 4.0 18.0 40.758148 -73.985626 1.0 8.5 0.3 0.5 2 ... 4.0 18.0 40.752003 -73.973198 0.00 0.00 10.30 2015-01-23 18:34:48 2015-01-23 18:22:56 0.92
9997 2 4.0 18.0 40.768131 -73.964516 1.0 10.5 0.3 0.5 1 ... 4.0 18.0 40.740456 -73.986252 2.46 0.00 14.76 2015-01-23 18:33:58 2015-01-23 18:22:56 2.36
9998 2 4.0 18.0 40.759171 -73.975189 1.0 6.5 0.3 0.5 3 ... 4.0 18.0 40.770500 -73.981323 2.08 0.00 10.38 2015-01-23 18:29:22 2015-01-23 18:22:56 1.05
9999 2 4.0 18.0 40.752113 -73.975189 1.0 5.0 0.3 0.5 1 ... 4.0 18.0 40.761505 -73.968452 0.00 0.00 6.80 2015-01-23 18:27:58 2015-01-23 18:22:57 0.75

10000 rows × 21 columns

Tutorial

If you want to learn more on vaex, take a look at the tutorials to see what is possible.