DataStream.distinct
Return a new DataStream with specified columns and unique rows. This is like SELECT DISTINCT(KEYS) FROM ...
in SQL.
Note all the other columns will be dropped, since their behavior is unspecified. If you want to do deduplication, you can use this operator with keys set to all the columns.
This could be accomplished by using groupby().agg()
but using distinct
is generally faster because it is nonblocking,
compared to a groupby. Quokka really likes nonblocking operations because it can then pipeline it with other operators.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keys |
list
|
a list of columns to select distinct on. |
required |
Return
A transformed DataStream whose columns are in keys and whose rows are unique.
Examples:
>>> f = qc.read_csv("lineitem.csv")
Select only the l_orderdate and l_orderkey columns, return only unique rows.
>>> f = f.distinct(["l_orderdate", "l_orderkey"])
This will now fail, since l_comment is no longer in f's schema.
>>> f = f.select(["l_comment"])
Source code in pyquokka/datastream.py
1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 |
|