Version 1.8.0
Koalas 1.8.0 is the last minor release because Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.
Categorical type and ExtensionDtype
We added the support of pandas' categorical type (#2064, #2106).
>>> s = ks.Series(list("abbccc"), dtype="category")
>>> s
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.codes
0 0
1 1
2 1
3 2
4 2
5 2
dtype: int8
>>> idx = ks.CategoricalIndex(list("abbccc"))
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
categories=['a', 'b', 'c'], ordered=False, dtype='category')
>>> idx.codes
Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
>>> idx.categories
Index(['a', 'b', 'c'], dtype='object')
and ExtensionDtype as type arguments to annotate return types (#2120, #2123, #2132, #2127, #2126, #2125, #2124):
def func() -> ks.Series[pd.Int32Dtype()]:
...
Other new features, improvements and bug fixes
We added the following new features:
DataFrame:
Series:
DatetimeIndex:
Along with the following fixes:
- Support tuple to (DataFrame|Series).replace() (#2095)
- Check index_dtype and data_dtypes more strictly. (#2100)
- Return actual values via toPandas. (#2077)
- Add lines and orient to read_json and to_json to improve error message (#2110)
- Fix isin to accept numpy array (#2103)
- Allow multi-index column names for inferring return type schema with names. (#2117)
- Add a short JDBC user guide (#2148)
- Remove upper bound pandas 1.2 (#2141)
- Standardize exceptions of arithmetic operations on Datetime-like data (#2101)