Skip to content

DataStream.drop

Think of this as the anti-opereator to select. Instead of selecting columns, this will drop columns. This is implemented in Quokka as selecting the columns in the DataStream's schema that are not dropped.

Parameters:

Name Type Description Default
cols_to_drop list

a list of columns to drop from the source DataStream

required
Return

A DataStream consisting of all columns in the source DataStream that are not in cols_to_drop.

Examples:

>>> f = qc.read_csv("lineitem.csv")

Drop the l_orderdate and l_orderkey columns

>>> f = f.drop(["l_orderdate", "l_orderkey"])

This will now fail, since you dropped l_orderdate

>>> f = f.select(["l_orderdate"])
Source code in pyquokka/datastream.py
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
def drop(self, cols_to_drop: list):

    """
    Think of this as the anti-opereator to select. Instead of selecting columns, this will drop columns. 
    This is implemented in Quokka as selecting the columns in the DataStream's schema that are not dropped.

    Args:
        cols_to_drop (list): a list of columns to drop from the source DataStream

    Return:
        A DataStream consisting of all columns in the source DataStream that are not in `cols_to_drop`.

    Examples:
        >>> f = qc.read_csv("lineitem.csv")

        Drop the l_orderdate and l_orderkey columns

        >>> f = f.drop(["l_orderdate", "l_orderkey"])

        This will now fail, since you dropped l_orderdate

        >>> f = f.select(["l_orderdate"])
    """
    assert type(cols_to_drop) == list
    actual_cols_to_drop = []
    for col in cols_to_drop:
        if col in self.schema:
            actual_cols_to_drop.append(col)
        if self.sorted is not None:
            assert col not in self.sorted, "cannot drop a sort key!"
    if len(actual_cols_to_drop) == 0:
        return self
    else:
        if self.materialized:
            df = self._get_materialized_df().drop(actual_cols_to_drop)
            return self.quokka_context.from_polars(df)
        else:
            return self.select([col for col in self.schema if col not in cols_to_drop])