Skip to content

DataStream.with_columns

This will create new columns from certain columns in the dataframe. This is similar to Polars with_columns, Spark with_columns, etc. As usual, this function is not in-place, and will return a new DataStream, with the new column.

This is a separate API from transform because the semantics allow for projection and predicate pushdown through this node, since the original columns are all preserved. Use this instead of transform if possible.

The arguments are a bit different from Polars with_columns. You need to specify a dictionary where key is new column name and value is either a Quokka Expression or a Python function (lambda function or regular function). I think this is better than the Polars way and removes the possibility of having column names colliding.

The preferred way is to supply Quokka Expressions for things that the Expression syntax supports. In case a function is supplied, it must assume a single input, which is a Polars DataFrame. Please look at the examples.

A DataStream is implemented as a stream of batches. In the runtime, your function will be applied to each of those batches. The function must take as input a Polars DataFrame and produce a Polars DataFrame. This is a different mental model from Pandas or Polars df.apply, where the function is written for each row. The function's output must be a Polars Series (or DataFrame with one column)! Of course you can call Polars or Pandas apply inside of this function if you have to do things row by row.

You can use whatever libraries you have installed in your Python environment in this function. If you are using this on a cloud cluster, you have to make sure the necessary libraries are installed on each machine. You can use the utils package in pyquokka to help you do this, in particular you can use mananger.install_python_package(cluster, package_name).

Importantly, your function can take full advantage of Polars' columnar APIs to make use of SIMD and other forms of speedy goodness. You can even use Polars LazyFrame abstractions inside of this function. Of course, for ultimate flexbility, you are more than welcome to convert the Polars DataFrame to a Pandas DataFrame and use df.apply. Just remember to convert it back to a Polars DataFrame with only the result column in the end!

Parameters:

Name Type Description Default
new_columns dict

A dictionary of column names to UDFs or Expressions. The UDFs must take as input a Polars DataFrame and output a Polars DataFrame. The UDFs must not have expectations on the length of its input.

required
required_columns list or set

The names of the columns that are required for all your function. If this is not specified then Quokka assumes all the columns are required for your function. Early projection past this function becomes impossible. If you specify this and you got it wrong, you will get an error at runtime. This is only required for UDFs. If you use Quokka Expressions, then Quokka will automatically figure out the required columns.

set()
foldable bool

Whether or not the function can be executed as part of the batch post-processing of the previous operation in the execution graph. This is set to True by default. Correctly setting this flag requires some insight into how Quokka works. Lightweight

True
Return

A new DataStream with new columns made by the user defined functions.

Examples:

>>> d = qc.read_csv("lineitem.csv")

You can use Polars APIs inside of custom lambda functions:

>>> d = d.with_columns({"high": lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": lambda x:(x["o_orderpriority"] == "5-LOW") | (x["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})

You can also use Quokka Expressions. You don't need to specify required columns if you use only Quokka Expressions:

>>> d = d.with_columns({"high": (d["o_orderpriority"] == "1-URGENT") | (d["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")})

Or mix the two. You then have to specify required columns again. It is the set of columns required for all your functions.

>>> d = d.with_columns({"high": (lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})
Source code in pyquokka/datastream.py
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
def with_columns(self, new_columns: dict, required_columns=set(), foldable=True):

    """

    This will create new columns from certain columns in the dataframe. This is similar to Polars `with_columns`, Spark `with_columns`, etc. As usual,
    this function is not in-place, and will return a new DataStream, with the new column.

    This is a separate API from `transform` because the semantics allow for projection and predicate pushdown through this node, 
    since the original columns are all preserved. Use this instead of `transform` if possible.

    The arguments are a bit different from Polars `with_columns`. You need to specify a dictionary where key is new column name and value is either a Quokka
    Expression or a Python function (lambda function or regular function). I think this is better than the Polars way and removes the possibility of having 
    column names colliding.

    The preferred way is to supply Quokka Expressions for things that the Expression syntax supports. 
    In case a function is supplied, it must assume a single input, which is a Polars DataFrame. Please look at the examples. 

    A DataStream is implemented as a stream of batches. In the runtime, your function will be applied to each of those batches. The function must
    take as input a Polars DataFrame and produce a Polars DataFrame. This is a different mental model from Pandas or Polars `df.apply`, 
    where the function is written for each row. **The function's output must be a Polars Series (or DataFrame with one column)! **
    Of course you can call Polars or Pandas apply inside of this function if you have to do things row by row.

    You can use whatever libraries you have installed in your Python environment in this function. If you are using this on a
    cloud cluster, you have to make sure the necessary libraries are installed on each machine. You can use the `utils` package in pyquokka to help
    you do this, in particular you can use `mananger.install_python_package(cluster, package_name)`.

    Importantly, your function can take full advantage of Polars' columnar APIs to make use of SIMD and other forms of speedy goodness. 
    You can even use Polars LazyFrame abstractions inside of this function. Of course, for ultimate flexbility, you are more than welcome to convert 
    the Polars DataFrame to a Pandas DataFrame and use `df.apply`. Just remember to convert it back to a Polars DataFrame with only the result column in the end!

    Args:
        new_columns (dict): A dictionary of column names to UDFs or Expressions. The UDFs must take as input a Polars DataFrame and output a Polars DataFrame.
            The UDFs must not have expectations on the length of its input.
        required_columns (list or set): The names of the columns that are required for all your function. If this is not specified then Quokka assumes
            all the columns are required for your function. Early projection past this function becomes impossible. If you specify this and you got it wrong,
            you will get an error at runtime. This is only required for UDFs. If you use Quokka Expressions, then Quokka will automatically figure out the required columns.
        foldable (bool): Whether or not the function can be executed as part of the batch post-processing of the previous operation in the
            execution graph. This is set to True by default. Correctly setting this flag requires some insight into how Quokka works. Lightweight

    Return:
        A new DataStream with new columns made by the user defined functions.

    Examples:

        >>> d = qc.read_csv("lineitem.csv")

        You can use Polars APIs inside of custom lambda functions: 

        >>> d = d.with_columns({"high": lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": lambda x:(x["o_orderpriority"] == "5-LOW") | (x["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})

        You can also use Quokka Expressions. You don't need to specify required columns if you use only Quokka Expressions:

        >>> d = d.with_columns({"high": (d["o_orderpriority"] == "1-URGENT") | (d["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")})

        Or mix the two. You then have to specify required columns again. It is the set of columns required for *all* your functions.

        >>> d = d.with_columns({"high": (lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})
    """

    assert type(required_columns) == set

    # fix the new column ordering
    new_column_names = list(new_columns.keys())

    sql_statement = "select *"

    for new_column in new_columns:
        assert new_column not in self.schema, "For now new columns cannot have same names as existing columns"
        transform = new_columns[new_column]
        assert type(transform) == type(lambda x:1) or type(transform) == Expression, "Transform must be a function or a Quokka Expression"
        if type(transform) == Expression:
            required_columns = required_columns.union(transform.required_columns())
            sql_statement += ", " + transform.sql() + " as " + new_column 

        else:
            # detected an arbitrary function. If required columns are not specified, assume all columns are required
            if len(required_columns) == 0:
                required_columns = set(self.schema)

    def polars_func(batch):

        con = duckdb.connect().execute('PRAGMA threads=%d' % 8)
        if sql_statement != "select *": # if there are any columns to add
            batch_arrow = batch.to_arrow()
            batch = polars.from_arrow(con.execute(sql_statement + " from batch_arrow").arrow())

        batch = batch.with_columns([polars.Series(name=column_name, values=new_columns[column_name](batch)) for column_name in new_column_names if type(new_columns[column_name]) == type(lambda x:1)])
        return batch

    return self.quokka_context.new_stream(
        sources={0: self},
        partitioners={0: PassThroughPartitioner()},
        node=MapNode(
            schema=self.schema+ new_column_names,
            schema_mapping={
                **{new_column: {-1: new_column} for new_column in new_column_names}, **{col: {0: col} for col in self.schema}},
            required_columns={0: required_columns},
            function=polars_func,
            foldable=foldable),
        schema=self.schema + new_column_names,
        sorted = self.sorted
        )