Skip to content

QuokkaContext.set_config

This sets a config value for the entire cluster. You should do this at the very start of your program generally speaking.

The following keys are supported:

  1. optimize_joins: bool, whether to optimize joins based on cardinality estimates. Default to True

  2. s3_csv_materialize_threshold: int, the threshold in bytes for when to materialize a CSV file in S3

  3. disk_csv_materialize_threshold: int, the threshold in bytes for when to materialize a CSV file on disk

  4. s3_parquet_materialize_threshold: int, the threshold in bytes for when to materialize a Parquet file in S3

  5. disk_parquet_materialize_threshold: int, the threshold in bytes for when to materialize a Parquet file on disk

  6. hbq_path: str, the disk spill directory. Default to "/data"

  7. fault_tolerance: bool, whether to enable fault tolerance. Default to False

Parameters:

Name Type Description Default
key str

the key to set

required
value any

the value to set

required

Returns:

Type Description

None

Examples:

>>> from pyquokka.df import *
>>> qc = QuokkaContext()

Turn on join order optimization.

>>> qc.set_config("optimize_joins", True)

Turn off fault tolerance.

>>> qc.set_config("fault_tolerance", False)
Source code in pyquokka/df.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
def set_config(self, key, value):

    """
    This sets a config value for the entire cluster. You should do this at the very start of your program generally speaking.

    The following keys are supported:

    1. optimize_joins: bool, whether to optimize joins based on cardinality estimates. Default to True

    2. s3_csv_materialize_threshold: int, the threshold in bytes for when to materialize a CSV file in S3

    3. disk_csv_materialize_threshold: int, the threshold in bytes for when to materialize a CSV file on disk

    4. s3_parquet_materialize_threshold: int, the threshold in bytes for when to materialize a Parquet file in S3

    5. disk_parquet_materialize_threshold: int, the threshold in bytes for when to materialize a Parquet file on disk

    6. hbq_path: str, the disk spill directory. Default to "/data"

    7. fault_tolerance: bool, whether to enable fault tolerance. Default to False

    Args:
        key (str): the key to set
        value (any): the value to set

    Returns:
        None

    Examples:

        >>> from pyquokka.df import *
        >>> qc = QuokkaContext()

        Turn on join order optimization.

        >>> qc.set_config("optimize_joins", True)

        Turn off fault tolerance. 

        >>> qc.set_config("fault_tolerance", False)

    """

    if key in self.sql_config:
        self.sql_config[key] = value
    elif key in self.exec_config:
        self.exec_config[key] = value
        assert all(ray.get([task_manager.set_config.remote(key, value) for task_manager in self.task_managers.values()]))
    else:
        raise Exception("key not found in config")