Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fixed TypeError for Series.isin() when large series and values contains NA (#60678) #60736

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

akj2018
Copy link

@akj2018 akj2018 commented Jan 19, 2025

Issue

Series.isin() raises TypeError: boolean value of NA is ambiguous when Series is large enough (>1_000_000) and values contains NA

Reason

  • Series.isin() internally uses np.isin() for large series with smaller values to increase performance but it does not handles the case when values is of dtype=object and contains NA and passes it to np.isin

# GH16012
# Ensure np.isin doesn't get object types or it *may* throw an exception
# Albeit hashmap has O(1) look-up (vs. O(logn) in sorted array),
# isin is faster for small sizes
if (
len(comps_array) > _MINIMUM_COMP_ARR_LEN
and len(values) <= 26
and comps_array.dtype != object
):
# If the values include nan we need to check for nan explicitly
# since np.nan it not equal to np.nan
if isna(values).any():
def f(c, v):
return np.logical_or(np.isin(c, v).ravel(), np.isnan(c))
else:
f = lambda a, b: np.isin(a, b).ravel()
else:
common = np_find_common_type(values.dtype, comps_array.dtype)
values = values.astype(common, copy=False)
comps_array = comps_array.astype(common, copy=False)
f = htable.ismember
return f(comps_array, values)

            mask = np.zeros(len(ar1), dtype=bool)
            for a in ar2:
                mask |= (ar1 == a)
  • Using ar1 == NA raises a TypeError because the boolean value of pd.NA is ambiguous. refer docs.

Fix Implemented

Explicitly checking if values contains NA when large series and small number of values (<= 26) to avoid using np.isin in algorithms.py.

from pandas._libs.missing import NA

def isin(comps: ListLike, values: ListLike) -> npt.NDArray[np.bool_]:

   
    # GH60678
    # Ensure values don't contain <NA>, otherwise it throws exception with np.in1d
    values_contains_NA = False
    
    if comps_array.dtype != object and len(values) <= 26:  
        values_contains_NA = any(v is NA for v in values)

    if (
        len(comps_array) > _MINIMUM_COMP_ARR_LEN
        and len(values) <= 26
        and comps_array.dtype != object
        and values_contains_NA == False
    ):

Testing

Successfully pass all existing test cases in test_isin.py with tests added for large series with dtype as boolean, Int64 and Float64 as follow:

  1. Series dtype==boolean and values contain pd.NA
  2. Series dtype==boolean and values contains mixed data with pd.NA
  3. Series dtype==boolean and values empty
  4. Series dtype==Int64 and values contains pd.NA
  5. Series dtype==Float64 and values contains pd.NA

@akj2018 akj2018 force-pushed the bugfix/60678-pdNA-error branch from 7a3e501 to cb16826 Compare January 20, 2025 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: boolean series .isin([pd.NA])] inconsistent for series length
1 participant