Skip to content

Version 1.8.0

Compare
Choose a tag to compare
@HyukjinKwon HyukjinKwon released this 03 May 00:39
· 19 commits to master since this release

Koalas 1.8.0 is the last minor release because Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.

Categorical type and ExtensionDtype

We added the support of pandas' categorical type (#2064, #2106).

>>> s = ks.Series(list("abbccc"), dtype="category")
>>> s
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.codes
0    0
1    1
2    1
3    2
4    2
5    2
dtype: int8
>>> idx = ks.CategoricalIndex(list("abbccc"))
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
                 categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.codes
Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
>>> idx.categories
Index(['a', 'b', 'c'], dtype='object')

and ExtensionDtype as type arguments to annotate return types (#2120, #2123, #2132, #2127, #2126, #2125, #2124):

def func() -> ks.Series[pd.Int32Dtype()]:
    ...

Other new features, improvements and bug fixes

We added the following new features:

DataFrame:

Series:

DatetimeIndex:

  • indexer_between_time (#2104)
  • indexer_at_time (#2109)
  • between_time (#2111)

Along with the following fixes:

  • Support tuple to (DataFrame|Series).replace() (#2095)
  • Check index_dtype and data_dtypes more strictly. (#2100)
  • Return actual values via toPandas. (#2077)
  • Add lines and orient to read_json and to_json to improve error message (#2110)
  • Fix isin to accept numpy array (#2103)
  • Allow multi-index column names for inferring return type schema with names. (#2117)
  • Add a short JDBC user guide (#2148)
  • Remove upper bound pandas 1.2 (#2141)
  • Standardize exceptions of arithmetic operations on Datetime-like data (#2101)