DataStream.with_columns
This will create new columns from certain columns in the dataframe. This is similar to Polars with_columns
, Spark with_columns
, etc. As usual,
this function is not in-place, and will return a new DataStream, with the new column.
This is a separate API from transform
because the semantics allow for projection and predicate pushdown through this node,
since the original columns are all preserved. Use this instead of transform
if possible.
The arguments are a bit different from Polars with_columns
. You need to specify a dictionary where key is new column name and value is either a Quokka
Expression or a Python function (lambda function or regular function). I think this is better than the Polars way and removes the possibility of having
column names colliding.
The preferred way is to supply Quokka Expressions for things that the Expression syntax supports. In case a function is supplied, it must assume a single input, which is a Polars DataFrame. Please look at the examples.
A DataStream is implemented as a stream of batches. In the runtime, your function will be applied to each of those batches. The function must
take as input a Polars DataFrame and produce a Polars DataFrame. This is a different mental model from Pandas or Polars df.apply
,
where the function is written for each row. The function's output must be a Polars Series (or DataFrame with one column)!
Of course you can call Polars or Pandas apply inside of this function if you have to do things row by row.
You can use whatever libraries you have installed in your Python environment in this function. If you are using this on a
cloud cluster, you have to make sure the necessary libraries are installed on each machine. You can use the utils
package in pyquokka to help
you do this, in particular you can use mananger.install_python_package(cluster, package_name)
.
Importantly, your function can take full advantage of Polars' columnar APIs to make use of SIMD and other forms of speedy goodness.
You can even use Polars LazyFrame abstractions inside of this function. Of course, for ultimate flexbility, you are more than welcome to convert
the Polars DataFrame to a Pandas DataFrame and use df.apply
. Just remember to convert it back to a Polars DataFrame with only the result column in the end!
Parameters:
Name | Type | Description | Default |
---|---|---|---|
new_columns |
dict
|
A dictionary of column names to UDFs or Expressions. The UDFs must take as input a Polars DataFrame and output a Polars DataFrame. The UDFs must not have expectations on the length of its input. |
required |
required_columns |
list or set
|
The names of the columns that are required for all your function. If this is not specified then Quokka assumes all the columns are required for your function. Early projection past this function becomes impossible. If you specify this and you got it wrong, you will get an error at runtime. This is only required for UDFs. If you use Quokka Expressions, then Quokka will automatically figure out the required columns. |
set()
|
foldable |
bool
|
Whether or not the function can be executed as part of the batch post-processing of the previous operation in the execution graph. This is set to True by default. Correctly setting this flag requires some insight into how Quokka works. Lightweight |
True
|
Return
A new DataStream with new columns made by the user defined functions.
Examples:
>>> d = qc.read_csv("lineitem.csv")
You can use Polars APIs inside of custom lambda functions:
>>> d = d.with_columns({"high": lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": lambda x:(x["o_orderpriority"] == "5-LOW") | (x["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})
You can also use Quokka Expressions. You don't need to specify required columns if you use only Quokka Expressions:
>>> d = d.with_columns({"high": (d["o_orderpriority"] == "1-URGENT") | (d["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")})
Or mix the two. You then have to specify required columns again. It is the set of columns required for all your functions.
>>> d = d.with_columns({"high": (lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})
Source code in pyquokka/datastream.py
1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 |
|