[python-package] support sub-classing scikit-learn estimators #6783

jameslamb · 2025-01-10T06:39:24Z

I recently saw a Stack Overflow post ("Why can't I wrap LGBM?") expressing the same concerns from #4426 ... it's difficult to sub-class lightgbm's scikit-learn estimators.

It doesn't have to be! Look how minimal the code is for XGBRFRegressor:

https://github.com/dmlc/xgboost/blob/45009413ce9f0d2bdfcd0c9ea8af1e71e3c0a191/python-package/xgboost/sklearn.py#L1869

This PR proposes borrowing some patterns I learned while working on xgboost's scikit-learn estimators to make it easier to sub-class lightgbm estimators. This also has the nice side effect of simplifying the lightgbm.dask code 😁

Notes for Reviewers

Why make the breaking change of requiring keyword args?

As part of this PR, I'm proposing immediately switching the constructors for scikit-learn estimators here (including those in lightgbm.dask) to only supporting keyword arguments.

Why I'm proposing this instead of a deprecation cycle:

scikit-learn itself does this (HistGradientBoostingClassifier example)
- so all of its machinery passing parameters around as keyword arguments
- keyword arguments are recommended throughout https://scikit-learn.org/stable/developers/develop.html
I strongly suspect that using positional arguments for these constructors is rare
anyone relying on positional arguments will get a loud and easy-to-diagnose-and-fix error, so the effort to adjust should be minimal

import lightgbm as lgb
lgb.LGBMClassifier("gbdt")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# TypeError: LGBMClassifier.__init__() takes 1 positional argument but 2 were given

I posted a related answer to that Stack Overflow question

https://stackoverflow.com/a/79344862/3986677

…htGBM into python/sklearn-subclassing

jameslamb · 2025-01-13T05:32:33Z

tests/python_package_test/test_dask.py

-    assert dask_spec.args[:-1] == sklearn_spec.args
-    assert dask_spec.defaults[:-1] == sklearn_spec.defaults
-    assert dask_spec.args[-1] == "client"
+    assert dask_spec.kwonlyargs == [*sklearn_spec.kwonlyargs, "client"]


Made these changes based on this test failure:

> assert dask_spec.kwonlyargs == sklearn_spec.kwonlyargs E AssertionError: assert ['client'] == [] E E Left contains one more item: 'client' E Use -v to get more diff

(build link)

But also... if the changes I'm proposing in dask.py are accepted, we wouldn't even need to have this test any more, in my opinion. It was just here to ensure the 2 lists of keyword args (one in LGBMModel and one in the Dask estimators) was consistent.

I'd like to discuss removing this test as part of the other review conversation on this PR.

jameslamb added 3 commits January 4, 2025 01:59

[python-package] make sub-classing scikit-learn estimators easier

3b5f648

tests passing

02c48c3

add docs

7b720cb

jameslamb added in progress breaking labels Jan 10, 2025

jameslamb added 4 commits January 10, 2025 00:40

Update tests/python_package_test/test_sklearn.py

51b5e64

remove docs links

81178fd

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

110b0e1

…htGBM into python/sklearn-subclassing

Merge branch 'master' into python/sklearn-subclassing

104471a

jameslamb changed the title ~~WIP: [python-package] support sub-classing scikit-learn estimators~~ [python-package] support sub-classing scikit-learn estimators Jan 11, 2025

jameslamb added awaiting review and removed in progress labels Jan 11, 2025

jameslamb marked this pull request as ready for review January 11, 2025 05:06

jameslamb requested review from guolinke, shiyu1994, jmoralez, borchero and StrikerRUS as code owners January 11, 2025 05:06

jameslamb added 2 commits January 12, 2025 23:24

fix Dask tests

d80b0df

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

b7e041a

…htGBM into python/sklearn-subclassing

jameslamb commented Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] support sub-classing scikit-learn estimators #6783

[python-package] support sub-classing scikit-learn estimators #6783

jameslamb commented Jan 10, 2025 •

edited

Loading

jameslamb Jan 13, 2025

[python-package] support sub-classing scikit-learn estimators #6783

Are you sure you want to change the base?

[python-package] support sub-classing scikit-learn estimators #6783

Conversation

jameslamb commented Jan 10, 2025 • edited Loading

Notes for Reviewers

Why make the breaking change of requiring keyword args?

I posted a related answer to that Stack Overflow question

jameslamb Jan 13, 2025

Choose a reason for hiding this comment

jameslamb commented Jan 10, 2025 •

edited

Loading