You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since my dataset is too large to fit into memory,
I tried to save separate feature datasets to disk using Dataset.save_binary(),
then merge them back together with Dataset.add_features_from().
However, LightGBM will raise warning when building the merged dataset with add_features_from(),
and later throws an error regarding validation dataset when training the model.
This error only occurs when using production data.
LightGBM does not throw the error with artificially generated data, although the warning still appears.
After some experimentation,
it seems that the issue is not related to the row or column size.
Maybe the cause is that LightGBM builds histograms differently when the input data is not normally distributed?
The text was updated successfully, but these errors were encountered:
jameslamb
changed the title
Failed to train model with dataset built incrementally
[python-package] Failed to train model with dataset built incrementally
Jan 2, 2025
Thanks for using LightGBM, and for taking the time to create a reproducible example.
Putting the error message here, so that this can be found easily from search when others have the same issue:
LightGBMError: Cannot add validation data, since it has different bin mappers with training data
And copying your code here into the issue:
importpathlibimportpyarrow.csvimportlightgbmaslgbcsv_dir_path=pathlib.Path("csvs")
csv_dir_pathlabel_train_array=pyarrow.csv.read_csv(csv_dir_path/"label_train.csv")["y"]
label_validation_array=pyarrow.csv.read_csv(csv_dir_path/"label_validation.csv")["y"]
feature_set_1_train_table=pyarrow.csv.read_csv(csv_dir_path/"feature_set_1_train.csv")
feature_set_1_validation_table=pyarrow.csv.read_csv(csv_dir_path/"feature_set_1_validation.csv")
feature_set_2_train_table=pyarrow.csv.read_csv(csv_dir_path/"feature_set_2_train.csv")
feature_set_2_validation_table=pyarrow.csv.read_csv(csv_dir_path/"feature_set_2_validation.csv")
feature_set_1_train_dataset=lgb.Dataset(
feature_set_1_train_table,
label_train_array,
)
feature_set_1_train_dataset.construct()
feature_set_1_validation_dataset=lgb.Dataset(
feature_set_1_validation_table,
label_validation_array,
reference=feature_set_1_train_dataset,
)
feature_set_1_validation_dataset.construct()
feature_set_2_train_dataset=lgb.Dataset(
feature_set_2_train_table,
label_train_array,
)
feature_set_2_train_dataset.construct()
feature_set_2_validation_dataset=lgb.Dataset(
feature_set_2_validation_table,
label_validation_array,
reference=feature_set_2_train_dataset,
)
feature_set_2_validation_dataset.construct()
# UserWarning: Cannot add features from NoneType type of raw data to NoneType type of raw data.train_dataset=feature_set_1_train_dataset.add_features_from(feature_set_2_train_dataset)
validation_dataset=feature_set_1_validation_dataset.add_features_from(feature_set_2_validation_dataset)
# LightGBMError: Cannot add validation data, since it has different bin mappers with training dataevals_result= {}
bst=lgb.train(
{
"force_col_wise": True,
},
train_dataset,
valid_sets=[validation_dataset],
valid_names=["validation"],
callbacks=[
lgb.early_stopping(stopping_rounds=10),
lgb.record_evaluation(evals_result),
],
)
When someone has time, we'll help explain what's happening here.
Description
Since my dataset is too large to fit into memory,
I tried to save separate feature datasets to disk using Dataset.save_binary(),
then merge them back together with Dataset.add_features_from().
However, LightGBM will raise warning when building the merged dataset with add_features_from(),
and later throws an error regarding validation dataset when training the model.
This error only occurs when using production data.
LightGBM does not throw the error with artificially generated data, although the warning still appears.
After some experimentation,
it seems that the issue is not related to the row or column size.
Maybe the cause is that LightGBM builds histograms differently when the input data is not normally distributed?
Issues that may be related:
#6151 (comment)
#2552
Reproducible example
Sample data and code for reproducing the issue is available here:
https://github.com/RandiHBK/lightgbm_issue_reproduction
Environment info
LightGBM version or commit hash:
LightGBM 4.5.0
Command(s) you used to install LightGBM
Additional Comments
The text was updated successfully, but these errors were encountered: