Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) #6310

Merged
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
8209ffa
expose feature_name_ via sklearn consistent attribute feature_names_in_
nicklamiller Feb 12, 2024
52835d8
fix docstring
nicklamiller Feb 13, 2024
adc7683
raise error if estimator not fitted
nicklamiller Feb 13, 2024
08e67aa
ensure exact feature match for feature_names_in_ attribute
nicklamiller Mar 17, 2024
0ecc337
add test for numpy input
nicklamiller Mar 28, 2024
c110c9d
add test for pandas input with feature names
nicklamiller Mar 28, 2024
a8a5631
add documentation for when input data has no feature names
nicklamiller Mar 28, 2024
4e1f1dc
pre-commit fixes
nicklamiller Mar 28, 2024
b826426
feature_names_in_ returns a 1D numpy array
nicklamiller May 31, 2024
fd1ce7c
test LGBMModel, LGBMClassifier, LGBMRegressor, LGBMRanker
nicklamiller May 31, 2024
edd951a
rearrange feature name property docstrings
nicklamiller May 31, 2024
25888c6
add get_feature_names_out method
nicklamiller Jun 1, 2024
574d9ce
format reference to .feature_name_ with ticks
nicklamiller Jun 1, 2024
e55474f
Merge branch 'master' into add-sklearn-feature-attributes
nicklamiller Jun 6, 2024
8ac21d3
remove get_feature_names_out method, tidy up tests
nicklamiller Jun 11, 2024
318c3a4
Merge branch 'master' into add-sklearn-feature-attributes
nicklamiller Jun 13, 2024
be2bed0
Merge branch 'master' into add-sklearn-feature-attributes
nicklamiller Jun 14, 2024
d34d48f
Merge branch 'master' into add-sklearn-feature-attributes
nicklamiller Jun 21, 2024
346fb78
Merge branch 'master' into add-sklearn-feature-attributes
nicklamiller Jun 21, 2024
11c8334
Merge branch 'master' into add-sklearn-feature-attributes
nicklamiller Jun 22, 2024
a8ddc66
Merge branch 'master' into add-sklearn-feature-attributes
jameslamb Jul 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion python-package/lightgbm/sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -1043,6 +1043,12 @@ def predict(
**predict_params,
)

def get_feature_names_out(self) -> np.ndarray:
""":obj:`array` of shape = [n_features]: Get output features of fitted model."""
if not self.__sklearn_is_fitted__():
raise LGBMNotFittedError("Output features cannot be determined. Need to call fit beforehand.")
return self.feature_names_in_

predict.__doc__ = _lgbmmodel_doc_predict.format(
description="Return the predicted value for each sample.",
X_shape="numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.sparse, list of lists of int or float of shape = [n_samples, n_features]",
Expand Down Expand Up @@ -1144,11 +1150,23 @@ def feature_importances_(self) -> np.ndarray:

@property
def feature_name_(self) -> List[str]:
""":obj:`list` of shape = [n_features]: The names of features."""
""":obj:`list` of shape = [n_features]: The names of features.

.. note::

If input does not contain feature names, they will be added during fitting in the format ``Column_0``, ``Column_1``, ..., ``Column_N``.
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
"""
if not self.__sklearn_is_fitted__():
raise LGBMNotFittedError("No feature_name found. Need to call fit beforehand.")
return self._Booster.feature_name() # type: ignore[union-attr]

@property
def feature_names_in_(self) -> np.ndarray:
""":obj:`array` of shape = [n_features]: scikit-learn compatible version of ``.feature_name_``."""
if not self.__sklearn_is_fitted__():
raise LGBMNotFittedError("No feature_names_in_ found. Need to call fit beforehand.")
return np.array(self.feature_name_)


class LGBMRegressor(_LGBMRegressorBase, LGBMModel):
"""LightGBM regressor."""
Expand Down
84 changes: 84 additions & 0 deletions tests/python_package_test/test_sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -1290,6 +1290,90 @@ def test_max_depth_warning_is_never_raised(capsys, estimator_class, max_depth):
assert "Provided parameters constrain tree depth" not in capsys.readouterr().out


def test_getting_feature_names_in_np_input():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite a lot of repetition in these tests. Generally I'm supportive of repetition in tests in favor of making it easier to diagnose issues or selectively include / exclude tests... but for the cases of .feature_names_in_ and .get_feature_names_out(), I think the tests should be combined.

So can you please reduce these 4 tests down to 2? One for numpy input without feature names, one for pandas input with feature names?

Ending with assertions like this:

expected_col_names = np.array([f"Column_{i}" for i in range(X.shape[1])]
np.testing.assert_array_equal(model.feature_names_in_, expected_col_names)
np.testing.assert_array_equal(model.get_feature_names_out(), expected_col_names)

# input is a numpy array, which doesn't have feature names. LightGBM adds
# feature names to the fitted model, which is inconsistent with sklearn's behavior
X, y = load_digits(n_class=2, return_X_y=True)
est = lgb.LGBMModel(n_estimators=5, objective="binary")
clf = lgb.LGBMClassifier(n_estimators=5)
reg = lgb.LGBMRegressor(n_estimators=5)
rnk = lgb.LGBMRanker(n_estimators=5)
models = (est, clf, reg, rnk)
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group

for model in models:
with pytest.raises(lgb.compat.LGBMNotFittedError):
check_is_fitted(model)
if isinstance(model, lgb.LGBMRanker):
model.fit(X, y, group=group)
else:
model.fit(X, y)
np.testing.assert_array_equal(model.feature_names_in_, np.array([f"Column_{i}" for i in range(X.shape[1])]))


def test_getting_feature_names_in_pd_input():
# as_frame=True means input has column names and these should propagate to fitted model
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# as_frame=True means input has column names and these should propagate to fitted model
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
col_names = X.columns
assert isinstance(col_names, list) and all(isinstance(c, str) for c in col_names), "input data must have feature names for this test to cover the expected functionality"

Instead of using a code comment, could you please test for this directly? That'd ensure that if load_digits() behavior around feature names ever changes, this test will fail and alert us instead of silently passing or maybe failing in some other hard-to-understand way.

est = lgb.LGBMModel(n_estimators=5, objective="binary")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please extend these tests to cover all 4 estimators (LGBMModel, LGBMClassifier, LGBMRegressor, LGBMRanker)? I know that those last 3 inherit from LGBMModel, but if someone were to make a change in how this attributes for, say, LGBMClassifier only that breaks this behavior, we'd want a failing test to alert us to that.

Follow the same pattern used in the existing test right above these, test_check_is_fitted(), using the same data for all of the estimators.

clf = lgb.LGBMClassifier(n_estimators=5)
reg = lgb.LGBMRegressor(n_estimators=5)
rnk = lgb.LGBMRanker(n_estimators=5)
models = (est, clf, reg, rnk)
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group

for model in models:
with pytest.raises(lgb.compat.LGBMNotFittedError):
check_is_fitted(model)
if isinstance(model, lgb.LGBMRanker):
model.fit(X, y, group=group)
else:
model.fit(X, y)
np.testing.assert_array_equal(est.feature_names_in_, X.columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.testing.assert_array_equal(est.feature_names_in_, X.columns)
np.testing.assert_array_equal(model.feature_names_in_, X.columns)

Instead of doing this for loop approach, could you please change these tests to parameterize over classes, like this?

@pytest.mark.parametrize("estimator_class", [lgb.LGBMModel, lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker])

That'd reduce the risk of mistakes like this one (where only the LGBMModel instance, est is being tested).



def test_get_feature_names_out_np_input():
# input is a numpy array, which doesn't have feature names. LightGBM adds
# feature names to the fitted model, which is inconsistent with sklearn's behavior
X, y = load_digits(n_class=2, return_X_y=True)
est = lgb.LGBMModel(n_estimators=5, objective="binary")
clf = lgb.LGBMClassifier(n_estimators=5)
reg = lgb.LGBMRegressor(n_estimators=5)
rnk = lgb.LGBMRanker(n_estimators=5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the thing being tested in this PR isn't really dependent on the content of the learned model, could you please use n_estimators=2 and num_leaves=7 in all the tests? That'd make the tests slightly faster and cheaper without reducing their effectiveness in detecting issues.

models = (est, clf, reg, rnk)
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group
group = [X.shape[0]]

For simplicity, please just treat all samples in X as part of a single query group. LightGBM supports that, and it won't materially change the effectiveness of these tests.


for model in models:
with pytest.raises(lgb.compat.LGBMNotFittedError):
check_is_fitted(model)
if isinstance(model, lgb.LGBMRanker):
model.fit(X, y, group=group)
else:
model.fit(X, y)
np.testing.assert_array_equal(
model.get_feature_names_out(), np.array([f"Column_{i}" for i in range(X.shape[1])])
)


def test_get_feature_names_out_pd_input():
# as_frame=True means input has column names and these should propagate to fitted model
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
est = lgb.LGBMModel(n_estimators=5, objective="binary")
clf = lgb.LGBMClassifier(n_estimators=5)
reg = lgb.LGBMRegressor(n_estimators=5)
rnk = lgb.LGBMRanker(n_estimators=5)
models = (est, clf, reg, rnk)
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group

for model in models:
with pytest.raises(lgb.compat.LGBMNotFittedError):
check_is_fitted(model)
if isinstance(model, lgb.LGBMRanker):
model.fit(X, y, group=group)
else:
model.fit(X, y)
np.testing.assert_array_equal(model.get_feature_names_out(), X.columns)


@parametrize_with_checks([lgb.LGBMClassifier(), lgb.LGBMRegressor()])
def test_sklearn_integration(estimator, check):
estimator.set_params(min_child_samples=1, min_data_in_bin=1)
Expand Down
Loading