Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Categoric feature with 32k+1 n.o. categories causes fatal exception with device_type="cuda" #6784

Open
zansibal opened this issue Jan 11, 2025 · 0 comments
Labels

Comments

@zansibal
Copy link

zansibal commented Jan 11, 2025

Description

Having a categoric feature with exactly 32k+1 n.o. categories, for any positive integer k, causes fatal exception with device_type="cuda". It works fine for "cpu".

Reproducible example

import lightgbm as lgb
import numpy as np
X = np.random.randint(0, 97, (1000, 1)) # 97, or 32k+1 for any positive integer k triggers the bug
y = np.random.uniform(-1, 1, 1000)
lgb.train({"device_type": "cuda"}, lgb.Dataset(X, y, categorical_feature=[0]))

Output:

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Total Bins 97
[LightGBM] [Info] Number of data points in the train set: 1000, number of used features: 1
[LightGBM] [Info] Start training from score -0.009523
[LightGBM] [Fatal] [CUDA] invalid argument .../LightGBM4.5.0/src/treelearner/cuda/cuda_single_gpu_tree_learner.cu 235

---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
Cell In[2], line 5
      3 X = np.random.randint(0, 97, (1000, 1)) # 97, or 32k+1 for any positive integer k triggers the bug
      4 y = np.random.uniform(-1, 1, 1000)
----> 5 lgb.train({'device_type': 'cuda'}, lgb.Dataset(X, y, categorical_feature=[0]))

File .../lib/python3.12/site-packages/lightgbm/engine.py:307, in train(params, train_set, num_boost_round, valid_sets, valid_names, feval, init_model, feature_name, categorical_feature, keep_training_booster, callbacks)
    295 for cb in callbacks_before_iter:
    296     cb(
    297         callback.CallbackEnv(
    298             model=booster,
   (...)
    304         )
    305     )
--> 307 booster.update(fobj=fobj)
    309 evaluation_result_list: List[_LGBM_BoosterEvalMethodResultType] = []
    310 # check evaluation result.

File .../lib/python3.12/site-packages/lightgbm/basic.py:4135, in Booster.update(self, train_set, fobj)
   4133 if self.__set_objective_to_none:
   4134     raise LightGBMError("Cannot update due to null objective function.")
-> 4135 _safe_call(
   4136     _LIB.LGBM_BoosterUpdateOneIter(
   4137         self._handle,
   4138         ctypes.byref(is_finished),
   4139     )
   4140 )
   4141 self.__is_predicted_cur_iter = [False for _ in range(self.__num_dataset)]
   4142 return is_finished.value == 1

File .../lib/python3.12/site-packages/lightgbm/basic.py:296, in _safe_call(ret)
    288 """Check the return value from C API call.
    289 
    290 Parameters
   (...)
    293     The return value from C API calls.
    294 """
    295 if ret != 0:
--> 296     raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))

LightGBMError: [CUDA] invalid argument .../LightGBM4.5.0/src/treelearner/cuda/cuda_single_gpu_tree_learner.cu 235

Environment info

LightGBM version or commit hash: 4.5.0

Command(s) you used to install LightGBM

sudo apt install --no-install-recommends git cmake build-essential libboost-dev libboost-system-dev libboost-filesystem-dev
sudo apt install nvidia-cuda-toolkit
git clone --recursive https://github.com/microsoft/LightGBM LightGBM-4.5.0
cd LightGBM
git reset --hard 3f7e6081275624edfca1f9b3096bea7a81a744ed # version 4.5.0
mkdir build
cd build
cmake -DUSE_GPU=1 -DUSE_CUDA=1 -DCMAKE_C_COMPILER=/usr/bin/gcc-12 -DCMAKE_CXX_COMPILER=/usr/bin/g++-12 ..
make -j$(nproc)
cd ..
sudo apt install python3-pip
pip install setuptools numpy scipy scikit-learn -U
sh ./build-python.sh install --precompile

Ubuntu 24.04 LTS
Nvidia RTX 4090 GPU

Additional Comments

Tried on multiple GPUs (but same machine). I happened to have 97 categories in my use case. I then ran a loop testing with all n.o. categories up to 100, finding that 33 and 65 fails as well.

@jameslamb jameslamb changed the title Categoric feature with 32k+1 n.o. categories causes fatal exception with device_type="cuda" [CUDA] Categoric feature with 32k+1 n.o. categories causes fatal exception with device_type="cuda" Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants