DataStream.with_columns

This will create new columns from certain columns in the dataframe. This is similar to Polars with_columns, Spark with_columns, etc. As usual, this function is not in-place, and will return a new DataStream, with the new column.

This is a separate API from transform because the semantics allow for projection and predicate pushdown through this node, since the original columns are all preserved. Use this instead of transform if possible.

The arguments are a bit different from Polars with_columns. You need to specify a dictionary where key is new column name and value is either a Quokka Expression or a Python function (lambda function or regular function). I think this is better than the Polars way and removes the possibility of having column names colliding.

The preferred way is to supply Quokka Expressions for things that the Expression syntax supports. In case a function is supplied, it must assume a single input, which is a Polars DataFrame. Please look at the examples.

A DataStream is implemented as a stream of batches. In the runtime, your function will be applied to each of those batches. The function must take as input a Polars DataFrame and produce a Polars DataFrame. This is a different mental model from Pandas or Polars df.apply, where the function is written for each row. The function's output must be a Polars Series (or DataFrame with one column)! Of course you can call Polars or Pandas apply inside of this function if you have to do things row by row.

You can use whatever libraries you have installed in your Python environment in this function. If you are using this on a cloud cluster, you have to make sure the necessary libraries are installed on each machine. You can use the utils package in pyquokka to help you do this, in particular you can use mananger.install_python_package(cluster, package_name).

Importantly, your function can take full advantage of Polars' columnar APIs to make use of SIMD and other forms of speedy goodness. You can even use Polars LazyFrame abstractions inside of this function. Of course, for ultimate flexbility, you are more than welcome to convert the Polars DataFrame to a Pandas DataFrame and use df.apply. Just remember to convert it back to a Polars DataFrame with only the result column in the end!

Parameters:

Name	Type	Description	Default
`new_columns`	`dict`	A dictionary of column names to UDFs or Expressions. The UDFs must take as input a Polars DataFrame and output a Polars DataFrame. The UDFs must not have expectations on the length of its input.	required
`required_columns`	`list or set`	The names of the columns that are required for all your function. If this is not specified then Quokka assumes all the columns are required for your function. Early projection past this function becomes impossible. If you specify this and you got it wrong, you will get an error at runtime. This is only required for UDFs. If you use Quokka Expressions, then Quokka will automatically figure out the required columns.	`set()`
`foldable`	`bool`	Whether or not the function can be executed as part of the batch post-processing of the previous operation in the execution graph. This is set to True by default. Correctly setting this flag requires some insight into how Quokka works. Lightweight	`True`

Return

A new DataStream with new columns made by the user defined functions.

Examples:

>>> d = qc.read_csv("lineitem.csv")

You can use Polars APIs inside of custom lambda functions:

>>> d = d.with_columns({"high": lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": lambda x:(x["o_orderpriority"] == "5-LOW") | (x["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})

You can also use Quokka Expressions. You don't need to specify required columns if you use only Quokka Expressions:

>>> d = d.with_columns({"high": (d["o_orderpriority"] == "1-URGENT") | (d["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")})

Or mix the two. You then have to specify required columns again. It is the set of columns required for all your functions.

>>> d = d.with_columns({"high": (lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})

Source code in pyquokka/datastream.py

def with_columns(self, new_columns: dict, required_columns=set(), foldable=True):

    """

    This will create new columns from certain columns in the dataframe. This is similar to Polars `with_columns`, Spark `with_columns`, etc. As usual,
    this function is not in-place, and will return a new DataStream, with the new column.

    This is a separate API from `transform` because the semantics allow for projection and predicate pushdown through this node, 
    since the original columns are all preserved. Use this instead of `transform` if possible.

    The arguments are a bit different from Polars `with_columns`. You need to specify a dictionary where key is new column name and value is either a Quokka
    Expression or a Python function (lambda function or regular function). I think this is better than the Polars way and removes the possibility of having 
    column names colliding.

    The preferred way is to supply Quokka Expressions for things that the Expression syntax supports. 
    In case a function is supplied, it must assume a single input, which is a Polars DataFrame. Please look at the examples. 

    A DataStream is implemented as a stream of batches. In the runtime, your function will be applied to each of those batches. The function must
    take as input a Polars DataFrame and produce a Polars DataFrame. This is a different mental model from Pandas or Polars `df.apply`, 
    where the function is written for each row. **The function's output must be a Polars Series (or DataFrame with one column)! **
    Of course you can call Polars or Pandas apply inside of this function if you have to do things row by row.

    You can use whatever libraries you have installed in your Python environment in this function. If you are using this on a
    cloud cluster, you have to make sure the necessary libraries are installed on each machine. You can use the `utils` package in pyquokka to help
    you do this, in particular you can use `mananger.install_python_package(cluster, package_name)`.

    Importantly, your function can take full advantage of Polars' columnar APIs to make use of SIMD and other forms of speedy goodness. 
    You can even use Polars LazyFrame abstractions inside of this function. Of course, for ultimate flexbility, you are more than welcome to convert 
    the Polars DataFrame to a Pandas DataFrame and use `df.apply`. Just remember to convert it back to a Polars DataFrame with only the result column in the end!

    Args:
        new_columns (dict): A dictionary of column names to UDFs or Expressions. The UDFs must take as input a Polars DataFrame and output a Polars DataFrame.
            The UDFs must not have expectations on the length of its input.
        required_columns (list or set): The names of the columns that are required for all your function. If this is not specified then Quokka assumes
            all the columns are required for your function. Early projection past this function becomes impossible. If you specify this and you got it wrong,
            you will get an error at runtime. This is only required for UDFs. If you use Quokka Expressions, then Quokka will automatically figure out the required columns.
        foldable (bool): Whether or not the function can be executed as part of the batch post-processing of the previous operation in the
            execution graph. This is set to True by default. Correctly setting this flag requires some insight into how Quokka works. Lightweight

    Return:
        A new DataStream with new columns made by the user defined functions.

    Examples:

        >>> d = qc.read_csv("lineitem.csv")

        You can use Polars APIs inside of custom lambda functions: 

        >>> d = d.with_columns({"high": lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": lambda x:(x["o_orderpriority"] == "5-LOW") | (x["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})

        You can also use Quokka Expressions. You don't need to specify required columns if you use only Quokka Expressions:

        >>> d = d.with_columns({"high": (d["o_orderpriority"] == "1-URGENT") | (d["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")})

        Or mix the two. You then have to specify required columns again. It is the set of columns required for *all* your functions.

        >>> d = d.with_columns({"high": (lambda x:(x["o_orderpriority"] == "1-URGENT") | (x["o_orderpriority"] == "2-HIGH"), "low": (d["o_orderpriority"] == "5-LOW") | (d["o_orderpriority"] == "4-NOT SPECIFIED")}, required_columns = {"o_orderpriority"})
    """

    assert type(required_columns) == set

    # fix the new column ordering
    new_column_names = list(new_columns.keys())

    sql_statement = "select *"

    for new_column in new_columns:
        assert new_column not in self.schema, "For now new columns cannot have same names as existing columns"
        transform = new_columns[new_column]
        assert type(transform) == type(lambda x:1) or type(transform) == Expression, "Transform must be a function or a Quokka Expression"
        if type(transform) == Expression:
            required_columns = required_columns.union(transform.required_columns())
            sql_statement += ", " + transform.sql() + " as " + new_column 

        else:
            # detected an arbitrary function. If required columns are not specified, assume all columns are required
            if len(required_columns) == 0:
                required_columns = set(self.schema)

    def polars_func(batch):

        con = duckdb.connect().execute('PRAGMA threads=%d' % 8)
        if sql_statement != "select *": # if there are any columns to add
            batch_arrow = batch.to_arrow()
            batch = polars.from_arrow(con.execute(sql_statement + " from batch_arrow").arrow())

        batch = batch.with_columns([polars.Series(name=column_name, values=new_columns[column_name](batch)) for column_name in new_column_names if type(new_columns[column_name]) == type(lambda x:1)])
        return batch

    return self.quokka_context.new_stream(
        sources={0: self},
        partitioners={0: PassThroughPartitioner()},
        node=MapNode(
            schema=self.schema+ new_column_names,
            schema_mapping={
                **{new_column: {-1: new_column} for new_column in new_column_names}, **{col: {0: col} for col in self.schema}},
            required_columns={0: required_columns},
            function=polars_func,
            foldable=foldable),
        schema=self.schema + new_column_names,
        sorted = self.sorted
        )