Skip to content

DataStream.distinct

Return a new DataStream with specified columns and unique rows. This is like SELECT DISTINCT(KEYS) FROM ... in SQL.

Note all the other columns will be dropped, since their behavior is unspecified. If you want to do deduplication, you can use this operator with keys set to all the columns.

This could be accomplished by using groupby().agg() but using distinct is generally faster because it is nonblocking, compared to a groupby. Quokka really likes nonblocking operations because it can then pipeline it with other operators.

Parameters:

Name Type Description Default
keys list

a list of columns to select distinct on.

required
Return

A transformed DataStream whose columns are in keys and whose rows are unique.

Examples:

>>> f = qc.read_csv("lineitem.csv")

Select only the l_orderdate and l_orderkey columns, return only unique rows.

>>> f = f.distinct(["l_orderdate", "l_orderkey"])

This will now fail, since l_comment is no longer in f's schema.

>>> f = f.select(["l_comment"])
Source code in pyquokka/datastream.py
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
def distinct(self, keys: list):

    """
    Return a new DataStream with specified columns and unique rows. This is like `SELECT DISTINCT(KEYS) FROM ...` in SQL.

    Note all the other columns will be dropped, since their behavior is unspecified. If you want to do deduplication, you can use
    this operator with keys set to all the columns.

    This could be accomplished by using `groupby().agg()` but using `distinct` is generally faster because it is nonblocking, 
    compared to a groupby. Quokka really likes nonblocking operations because it can then pipeline it with other operators.

    Args:
        keys (list): a list of columns to select distinct on.

    Return:
        A transformed DataStream whose columns are in keys and whose rows are unique.

    Examples:

        >>> f = qc.read_csv("lineitem.csv")

        Select only the l_orderdate and l_orderkey columns, return only unique rows.

        >>> f = f.distinct(["l_orderdate", "l_orderkey"])

        This will now fail, since l_comment is no longer in f's schema.

        >>> f = f.select(["l_comment"])
    """

    if type(keys) == str:
        keys = [keys]
    assert type(keys) == list, "keys must be a list of column names"
    assert all([key in self.schema for key in keys]), "keys must be a subset of the columns in the DataStream"

    select_stream = self.select(keys)

    return self.quokka_context.new_stream(
        sources={0: select_stream},
        partitioners={0: HashPartitioner(keys[0])},
        node=StatefulNode(
            schema=keys,
            # this is a stateful node, but predicates and projections can be pushed down.
            schema_mapping={col: {0: col} for col in keys},
            required_columns={0: set(keys)},
            operator=DistinctExecutor(keys)
        ),
        schema=keys,

    )