Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Failed to train model with dataset built incrementally #6770

Open
RandiHBK opened this issue Jan 1, 2025 · 1 comment
Open
Labels

Comments

@RandiHBK
Copy link

RandiHBK commented Jan 1, 2025

Description

Since my dataset is too large to fit into memory,
I tried to save separate feature datasets to disk using Dataset.save_binary(),
then merge them back together with Dataset.add_features_from().

However, LightGBM will raise warning when building the merged dataset with add_features_from(),
and later throws an error regarding validation dataset when training the model.

This error only occurs when using production data.
LightGBM does not throw the error with artificially generated data, although the warning still appears.

After some experimentation,
it seems that the issue is not related to the row or column size.
Maybe the cause is that LightGBM builds histograms differently when the input data is not normally distributed?

Issues that may be related:
#6151 (comment)
#2552

Reproducible example

Sample data and code for reproducing the issue is available here:
https://github.com/RandiHBK/lightgbm_issue_reproduction

Environment info

LightGBM version or commit hash:
LightGBM 4.5.0

Command(s) you used to install LightGBM

mamba env create -f environment.yml

Additional Comments

@jameslamb jameslamb changed the title Failed to train model with dataset built incrementally [python-package] Failed to train model with dataset built incrementally Jan 2, 2025
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM, and for taking the time to create a reproducible example.

Putting the error message here, so that this can be found easily from search when others have the same issue:

LightGBMError: Cannot add validation data, since it has different bin mappers with training data

And copying your code here into the issue:

import pathlib
import pyarrow.csv
import lightgbm as lgb

csv_dir_path = pathlib.Path("csvs") 
csv_dir_path

label_train_array = pyarrow.csv.read_csv(csv_dir_path / "label_train.csv")["y"]
label_validation_array = pyarrow.csv.read_csv(csv_dir_path / "label_validation.csv")["y"]
feature_set_1_train_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_1_train.csv")
feature_set_1_validation_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_1_validation.csv")
feature_set_2_train_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_2_train.csv")
feature_set_2_validation_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_2_validation.csv")

feature_set_1_train_dataset = lgb.Dataset(
    feature_set_1_train_table, 
    label_train_array,
)
feature_set_1_train_dataset.construct()
feature_set_1_validation_dataset = lgb.Dataset(
    feature_set_1_validation_table, 
    label_validation_array,
    reference=feature_set_1_train_dataset,
)
feature_set_1_validation_dataset.construct()

feature_set_2_train_dataset = lgb.Dataset(
    feature_set_2_train_table, 
    label_train_array,
)
feature_set_2_train_dataset.construct()
feature_set_2_validation_dataset = lgb.Dataset(
    feature_set_2_validation_table, 
    label_validation_array,
    reference=feature_set_2_train_dataset,
)
feature_set_2_validation_dataset.construct()

# UserWarning: Cannot add features from NoneType type of raw data to NoneType type of raw data.
train_dataset = feature_set_1_train_dataset.add_features_from(feature_set_2_train_dataset)

validation_dataset = feature_set_1_validation_dataset.add_features_from(feature_set_2_validation_dataset)

# LightGBMError: Cannot add validation data, since it has different bin mappers with training data
evals_result = {}
bst = lgb.train(
    {
        "force_col_wise": True,
    },
    train_dataset,
    valid_sets=[validation_dataset],
    valid_names=["validation"],
    callbacks=[
        lgb.early_stopping(stopping_rounds=10),
        lgb.record_evaluation(evals_result),
    ],
)

When someone has time, we'll help explain what's happening here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants