Skip to content

DataStream.filter

This will filter the DataStream to contain only rows that match a certain predicate specified in SQL syntax. You can write any SQL clause you would generally put in a WHERE statement containing arbitrary conjunctions and disjunctions. The columns in your statement must be in the schema of this DataStream!

Since a DataStream is implemented as a stream of batches, you might be tempted to think of a filtered DataStream as a stream of batches where each batch directly results from a filter being applied to a batch in the source DataStream. While this certainly may be the case, filters are aggressively optimized by Quokka and is most likely pushed all the way down to the input readers. As a result, you typically should not see a filter node in a Quokka execution plan shown by explain().

Parameters:

Name Type Description Default
predicate Expression

an Expression.

required
Return

A DataStream consisting of rows from the source DataStream that match the predicate.

Examples:

>>> f = qc.read_csv("lineitem.csv")

Filter for all the rows where l_orderkey smaller than 10 and l_partkey greater than 5

>>> f = f.filter((f["l_orderkey"] < 10) & (f["l_partkey"] > 5")) 

Nested conditions are supported.

>>> f = f.filter(f["l_orderkey"] < 10 & (f["l_partkey"] > 5 or f["l_partkey"] < 1)) 

You can do some really complicated stuff! For details on the .str and .dt namespaces see the API reference. Quokka strives to support all the functionality of Polars, so if you see something you need that is not supported, please file an issue on Github.

>>> f = f.filter((f["l_shipdate"].str.strptime().dt.offset_by(1, "M").dt.week() == 3) & (f["l_orderkey"] < 1000))

This will fail! Assuming c_custkey is not in f.schema

>>> f = f.filter(f["c_custkey"] > 10)
Source code in pyquokka/datastream.py
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
def filter(self, predicate: Expression):

    """
    This will filter the DataStream to contain only rows that match a certain predicate specified in SQL syntax. 
    You can write any SQL clause you would generally put in a WHERE statement containing arbitrary conjunctions and 
    disjunctions. The columns in your statement must be in the schema of this DataStream! 

    Since a DataStream is implemented as a stream of batches, you might be tempted to think of a filtered DataStream
    as a stream of batches where each batch directly results from a filter being applied to a batch in the source DataStream. 
    While this certainly may be the case, filters are aggressively optimized by Quokka and is most likely pushed all the way down
    to the input readers. As a result, you typically should not see a filter node in a Quokka execution plan shown by `explain()`. 

    Args:
        predicate (Expression): an Expression.

    Return:
        A DataStream consisting of rows from the source DataStream that match the predicate.

    Examples:

        >>> f = qc.read_csv("lineitem.csv")

        Filter for all the rows where l_orderkey smaller than 10 and l_partkey greater than 5

        >>> f = f.filter((f["l_orderkey"] < 10) & (f["l_partkey"] > 5")) 

        Nested conditions are supported.

        >>> f = f.filter(f["l_orderkey"] < 10 & (f["l_partkey"] > 5 or f["l_partkey"] < 1)) 

        You can do some really complicated stuff! For details on the .str and .dt namespaces see the API reference.
        Quokka strives to support all the functionality of Polars, so if you see something you need that is not supported, please
        file an issue on Github.

        >>> f = f.filter((f["l_shipdate"].str.strptime().dt.offset_by(1, "M").dt.week() == 3) & (f["l_orderkey"] < 1000))

        This will fail! Assuming c_custkey is not in f.schema

        >>> f = f.filter(f["c_custkey"] > 10)
    """

    assert type(predicate) == Expression, "Must supply an Expression."
    return self.filter_sql(predicate.sql())