Skip to content

QuokkaContext.read_dataset

Convert a DataSet back to a DataStream to process it.

Parameters:

Name Type Description Default
dataset DataSet

The dataset to read from. Note this is a Quokka DataSet, not a Ray Data dataset!

required

Returns:

Name Type Description
DataStream

The data stream.

Examples:

>>> ds = qc.read_parquet("s3://my-bucket/my-data.parquet").compute()

ds will be a DataSet, i.e. a collection of Ray ObjectRefs.

>>> ds = qc.read_dataset(ds)

ds will now be a DataStream.

Source code in pyquokka/df.py
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
def read_dataset(self, dataset):

    """
    Convert a DataSet back to a DataStream to process it.

    Args:
        dataset (DataSet): The dataset to read from. Note this is a Quokka DataSet, not a Ray Data dataset!

    Returns:
        DataStream: The data stream.

    Examples:

        >>> ds = qc.read_parquet("s3://my-bucket/my-data.parquet").compute()

        `ds` will be a DataSet, i.e. a collection of Ray ObjectRefs. 

        >>> ds = qc.read_dataset(ds)

        `ds` will now be a DataStream.
    """

    self.nodes[self.latest_node_id] = InputRayDatasetNode(dataset)
    self.latest_node_id += 1
    return DataStream(self, dataset.schema, self.latest_node_id - 1)