DataStream.filter
This will filter the DataStream to contain only rows that match a certain predicate specified in SQL syntax. You can write any SQL clause you would generally put in a WHERE statement containing arbitrary conjunctions and disjunctions. The columns in your statement must be in the schema of this DataStream!
Since a DataStream is implemented as a stream of batches, you might be tempted to think of a filtered DataStream
as a stream of batches where each batch directly results from a filter being applied to a batch in the source DataStream.
While this certainly may be the case, filters are aggressively optimized by Quokka and is most likely pushed all the way down
to the input readers. As a result, you typically should not see a filter node in a Quokka execution plan shown by explain()
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
predicate |
Expression
|
an Expression. |
required |
Return
A DataStream consisting of rows from the source DataStream that match the predicate.
Examples:
>>> f = qc.read_csv("lineitem.csv")
Filter for all the rows where l_orderkey smaller than 10 and l_partkey greater than 5
>>> f = f.filter((f["l_orderkey"] < 10) & (f["l_partkey"] > 5"))
Nested conditions are supported.
>>> f = f.filter(f["l_orderkey"] < 10 & (f["l_partkey"] > 5 or f["l_partkey"] < 1))
You can do some really complicated stuff! For details on the .str and .dt namespaces see the API reference. Quokka strives to support all the functionality of Polars, so if you see something you need that is not supported, please file an issue on Github.
>>> f = f.filter((f["l_shipdate"].str.strptime().dt.offset_by(1, "M").dt.week() == 3) & (f["l_orderkey"] < 1000))
This will fail! Assuming c_custkey is not in f.schema
>>> f = f.filter(f["c_custkey"] > 10)
Source code in pyquokka/datastream.py
278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 |
|