[python-package] Failed to train model with dataset built incrementally #6770

RandiHBK · 2025-01-01T15:06:31Z

Description

Since my dataset is too large to fit into memory,
I tried to save separate feature datasets to disk using Dataset.save_binary(),
then merge them back together with Dataset.add_features_from().

However, LightGBM will raise warning when building the merged dataset with add_features_from(),
and later throws an error regarding validation dataset when training the model.

This error only occurs when using production data.
LightGBM does not throw the error with artificially generated data, although the warning still appears.

After some experimentation,
it seems that the issue is not related to the row or column size.
Maybe the cause is that LightGBM builds histograms differently when the input data is not normally distributed?

Issues that may be related:
#6151 (comment)
#2552

Reproducible example

Sample data and code for reproducing the issue is available here:
https://github.com/RandiHBK/lightgbm_issue_reproduction

Environment info

LightGBM version or commit hash:
LightGBM 4.5.0

Command(s) you used to install LightGBM

mamba env create -f environment.yml

Additional Comments

The text was updated successfully, but these errors were encountered:

jameslamb · 2025-01-02T02:08:11Z

Thanks for using LightGBM, and for taking the time to create a reproducible example.

Putting the error message here, so that this can be found easily from search when others have the same issue:

LightGBMError: Cannot add validation data, since it has different bin mappers with training data

And copying your code here into the issue:

import pathlib
import pyarrow.csv
import lightgbm as lgb

csv_dir_path = pathlib.Path("csvs") 
csv_dir_path

label_train_array = pyarrow.csv.read_csv(csv_dir_path / "label_train.csv")["y"]
label_validation_array = pyarrow.csv.read_csv(csv_dir_path / "label_validation.csv")["y"]
feature_set_1_train_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_1_train.csv")
feature_set_1_validation_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_1_validation.csv")
feature_set_2_train_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_2_train.csv")
feature_set_2_validation_table = pyarrow.csv.read_csv(csv_dir_path / "feature_set_2_validation.csv")

feature_set_1_train_dataset = lgb.Dataset(
    feature_set_1_train_table, 
    label_train_array,
)
feature_set_1_train_dataset.construct()
feature_set_1_validation_dataset = lgb.Dataset(
    feature_set_1_validation_table, 
    label_validation_array,
    reference=feature_set_1_train_dataset,
)
feature_set_1_validation_dataset.construct()

feature_set_2_train_dataset = lgb.Dataset(
    feature_set_2_train_table, 
    label_train_array,
)
feature_set_2_train_dataset.construct()
feature_set_2_validation_dataset = lgb.Dataset(
    feature_set_2_validation_table, 
    label_validation_array,
    reference=feature_set_2_train_dataset,
)
feature_set_2_validation_dataset.construct()

# UserWarning: Cannot add features from NoneType type of raw data to NoneType type of raw data.
train_dataset = feature_set_1_train_dataset.add_features_from(feature_set_2_train_dataset)

validation_dataset = feature_set_1_validation_dataset.add_features_from(feature_set_2_validation_dataset)

# LightGBMError: Cannot add validation data, since it has different bin mappers with training data
evals_result = {}
bst = lgb.train(
    {
        "force_col_wise": True,
    },
    train_dataset,
    valid_sets=[validation_dataset],
    valid_names=["validation"],
    callbacks=[
        lgb.early_stopping(stopping_rounds=10),
        lgb.record_evaluation(evals_result),
    ],
)

When someone has time, we'll help explain what's happening here.

jameslamb changed the title ~~Failed to train model with dataset built incrementally~~ [python-package] Failed to train model with dataset built incrementally Jan 2, 2025

jameslamb added the question label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Failed to train model with dataset built incrementally #6770

[python-package] Failed to train model with dataset built incrementally #6770

RandiHBK commented Jan 1, 2025

jameslamb commented Jan 2, 2025