DataStream.distinct

Return a new DataStream with specified columns and unique rows. This is like SELECT DISTINCT(KEYS) FROM ... in SQL.

Note all the other columns will be dropped, since their behavior is unspecified. If you want to do deduplication, you can use this operator with keys set to all the columns.

This could be accomplished by using groupby().agg() but using distinct is generally faster because it is nonblocking, compared to a groupby. Quokka really likes nonblocking operations because it can then pipeline it with other operators.

Parameters:

Name	Type	Description	Default
`keys`	`list`	a list of columns to select distinct on.	required

Return

A transformed DataStream whose columns are in keys and whose rows are unique.

Examples:

>>> f = qc.read_csv("lineitem.csv")

Select only the l_orderdate and l_orderkey columns, return only unique rows.

>>> f = f.distinct(["l_orderdate", "l_orderkey"])

This will now fail, since l_comment is no longer in f's schema.

>>> f = f.select(["l_comment"])

Source code in pyquokka/datastream.py

def distinct(self, keys: list):

    """
    Return a new DataStream with specified columns and unique rows. This is like `SELECT DISTINCT(KEYS) FROM ...` in SQL.

    Note all the other columns will be dropped, since their behavior is unspecified. If you want to do deduplication, you can use
    this operator with keys set to all the columns.

    This could be accomplished by using `groupby().agg()` but using `distinct` is generally faster because it is nonblocking, 
    compared to a groupby. Quokka really likes nonblocking operations because it can then pipeline it with other operators.

    Args:
        keys (list): a list of columns to select distinct on.

    Return:
        A transformed DataStream whose columns are in keys and whose rows are unique.

    Examples:

        >>> f = qc.read_csv("lineitem.csv")

        Select only the l_orderdate and l_orderkey columns, return only unique rows.

        >>> f = f.distinct(["l_orderdate", "l_orderkey"])

        This will now fail, since l_comment is no longer in f's schema.

        >>> f = f.select(["l_comment"])
    """

    if type(keys) == str:
        keys = [keys]
    assert type(keys) == list, "keys must be a list of column names"
    assert all([key in self.schema for key in keys]), "keys must be a subset of the columns in the DataStream"

    select_stream = self.select(keys)

    return self.quokka_context.new_stream(
        sources={0: select_stream},
        partitioners={0: HashPartitioner(keys[0])},
        node=StatefulNode(
            schema=keys,
            # this is a stateful node, but predicates and projections can be pushed down.
            schema_mapping={col: {0: col} for col in keys},
            required_columns={0: set(keys)},
            operator=DistinctExecutor(keys)
        ),
        schema=keys,

    )