DataStream.filter

This will filter the DataStream to contain only rows that match a certain predicate specified in SQL syntax. You can write any SQL clause you would generally put in a WHERE statement containing arbitrary conjunctions and disjunctions. The columns in your statement must be in the schema of this DataStream!

Since a DataStream is implemented as a stream of batches, you might be tempted to think of a filtered DataStream as a stream of batches where each batch directly results from a filter being applied to a batch in the source DataStream. While this certainly may be the case, filters are aggressively optimized by Quokka and is most likely pushed all the way down to the input readers. As a result, you typically should not see a filter node in a Quokka execution plan shown by explain().

Parameters:

Name	Type	Description	Default
`predicate`	`Expression`	an Expression.	required

Return

A DataStream consisting of rows from the source DataStream that match the predicate.

Examples:

>>> f = qc.read_csv("lineitem.csv")

Filter for all the rows where l_orderkey smaller than 10 and l_partkey greater than 5

>>> f = f.filter((f["l_orderkey"] < 10) & (f["l_partkey"] > 5"))

Nested conditions are supported.

>>> f = f.filter(f["l_orderkey"] < 10 & (f["l_partkey"] > 5 or f["l_partkey"] < 1))

You can do some really complicated stuff! For details on the .str and .dt namespaces see the API reference. Quokka strives to support all the functionality of Polars, so if you see something you need that is not supported, please file an issue on Github.

>>> f = f.filter((f["l_shipdate"].str.strptime().dt.offset_by(1, "M").dt.week() == 3) & (f["l_orderkey"] < 1000))

This will fail! Assuming c_custkey is not in f.schema

>>> f = f.filter(f["c_custkey"] > 10)

Source code in pyquokka/datastream.py

def filter(self, predicate: Expression):

    """
    This will filter the DataStream to contain only rows that match a certain predicate specified in SQL syntax. 
    You can write any SQL clause you would generally put in a WHERE statement containing arbitrary conjunctions and 
    disjunctions. The columns in your statement must be in the schema of this DataStream! 

    Since a DataStream is implemented as a stream of batches, you might be tempted to think of a filtered DataStream
    as a stream of batches where each batch directly results from a filter being applied to a batch in the source DataStream. 
    While this certainly may be the case, filters are aggressively optimized by Quokka and is most likely pushed all the way down
    to the input readers. As a result, you typically should not see a filter node in a Quokka execution plan shown by `explain()`. 

    Args:
        predicate (Expression): an Expression.

    Return:
        A DataStream consisting of rows from the source DataStream that match the predicate.

    Examples:

        >>> f = qc.read_csv("lineitem.csv")

        Filter for all the rows where l_orderkey smaller than 10 and l_partkey greater than 5

        >>> f = f.filter((f["l_orderkey"] < 10) & (f["l_partkey"] > 5")) 

        Nested conditions are supported.

        >>> f = f.filter(f["l_orderkey"] < 10 & (f["l_partkey"] > 5 or f["l_partkey"] < 1)) 

        You can do some really complicated stuff! For details on the .str and .dt namespaces see the API reference.
        Quokka strives to support all the functionality of Polars, so if you see something you need that is not supported, please
        file an issue on Github.

        >>> f = f.filter((f["l_shipdate"].str.strptime().dt.offset_by(1, "M").dt.week() == 3) & (f["l_orderkey"] < 1000))

        This will fail! Assuming c_custkey is not in f.schema

        >>> f = f.filter(f["c_custkey"] > 10)
    """

    assert type(predicate) == Expression, "Must supply an Expression."
    return self.filter_sql(predicate.sql())