Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about dispersion entropy outcomes and probabilities #433

Open
rusandris opened this issue Jan 9, 2025 · 3 comments
Open

Confusion about dispersion entropy outcomes and probabilities #433

rusandris opened this issue Jan 9, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@rusandris
Copy link
Contributor

Hi!
I came across a strange behaviour related to the Dispersion outcome space.
Following the same example given in Rostaghi, M. and Azami, H. (2016), I get the same symbolic time series as presented in the paper.

using ComplexityMeasures
x=[9,8,1,12,5,-3,1.5,8.01,2.99,4,-1,10]
d = Dispersion(; c = 3, m = 2, τ = 1)
codify(d,x) #[3,  3,  1,  3,  2,  1,  1,  3,  2,  2,  1,  3]

However, when I try to calculate the probabilities of the dispersion patterns with m=2,

probs,outc = probabilities_and_outcomes(d,x)
probs
 Probabilities{Float64,1} over 7 outcomes
 [1, 1]  0.09090909090909091
 [1, 2]  0.18181818181818182
 [1, 3]  0.09090909090909091
 [2, 2]  0.09090909090909091
 [2, 3]  0.18181818181818182
 [3, 1]  0.2727272727272727
 [3, 3]  0.09090909090909091

the output contains patterns that aren't even observed ([1,2]), others appear with the incorrect probability ([1,3]). Is this due to some difference in the definitions/implementation? What am I missing?
Thanks

@Datseris Datseris added the bug Something isn't working label Jan 9, 2025
@Datseris
Copy link
Member

Datseris commented Jan 9, 2025

cc @kahaaga

(I haven't read the paper)

@kahaaga
Copy link
Member

kahaaga commented Jan 10, 2025

@rusandris Hey there!

I did a bit of digging. The implementation here differs from the original implementation in step 2 of the paper. They use positive embedding lags; we use negative embedding lags. Otherwise the implementation is identical.

Sanity check

The critical line is in the definition of τs in counts_and_outcomes(o::Dispersion, ...), where we use instead of τ. (I added a few @show statements to track the codified symbols and dispersion patterns).

function counts_and_outcomes(o::Dispersion, x::AbstractVector{<:Real})
    N = length(x)
    @show symbols = codify(o, x)
    # We must use genembed, not embed, to make sure the zero lag is included
    m, τ = o.m, o.τ
    τs = tuple((x for x in 0:-τ:-(m-1)*τ)...)  # Rostaghi uses 0:+τ:+(m-1)*τ 
    @show dispersion_patterns = genembed(symbols, τs, ones(m)).data 
    cts = fasthist!(dispersion_patterns) # This sorts `dispersion_patterns`
    outs = unique!(dispersion_patterns) # Therefore, outcomes are the sorted patterns.
    c = Counts(cts, (outs, ))
    return c, outcomes(c)
end

If you just put a minus sign in front of your desired lag, you get precisely what they do in the original paper.

julia> x=[9,8,1,12,5,-3,1.5,8.01,2.99,4,-1,10]; d = Dispersion(; c = 3, m = 2, τ = -1); # use -1 lag to get Rostaghi behavior

julia> probs,outc = allprobabilities_and_outcomes(d,x); probs
symbols = codify(o, x) = [3, 3, 1, 3, 2, 1, 1, 3, 2, 2, 1, 3]
dispersion_patterns = (genembed(symbols, τs, ones(m))).data = SVector{2, Int64}[[3, 3], [3, 1], [1, 3], [3, 2], [2, 1], [1, 1], [1, 3], [3, 2], [2, 2], [2, 1], [1, 3]]
 Probabilities{Float64,1} over 9 outcomes
 [1, 1]  0.09090909090909091
 [1, 2]  0.0
 [1, 3]  0.2727272727272727
 [2, 1]  0.18181818181818182
 [2, 2]  0.09090909090909091
 [2, 3]  0.0
 [3, 1]  0.09090909090909091
 [3, 2]  0.18181818181818182
 [3, 3]  0.09090909090909091

That'll also give you the dispersion entropy for their example.

julia> information(Shannon(base = ℯ), d, x)
symbols = [3, 3, 1, 3, 2, 1, 1, 3, 2, 2, 1, 3]
dispersion_patterns = SVector{2, Int64}[[3, 3], [3, 1], [1, 3], [3, 2], [2, 1], [1, 1], [1, 3], [3, 2], [2, 2], [2, 1], [1, 3]]
1.8462202193216335

Reasoning

The reason we use the negative sign on the lag here is for compatibility with Associations.jl, where we explicitly need embedding vectors constructed using negative τ for correctness for use in transfer entropy computation etc. Depending on your application, positive or negative embedding lags may work equally well.

For large enough real data sets, this implementation detail likely won't matter anyways if the goal is to compute the dispersion entropy, since this quantify only cares about the relative frequency of dispersion patterns/embedding vectors.

Solution

This isn't strictly a bug, since we say in the docstring that this outcome space is based on Rostaghi et al, not that it implements it precisely. The choice of the sign of the embedding lag is somewhat arbitrary anyways, and in fact, the particular choice of the positive embedding lag isn't elaborated on in the original paper at all, as far as I can see by quickly skimming the paper again now.

I think the solution here is to add a documentation note explaining this implementation discrepancy. That should be enough to clear up any future confusion, I guess?

@rusandris
Copy link
Contributor Author

Thank you very much for clearing this up for me! Yes, it seems this can be solved by simply adding a note in the docs.

@rusandris rusandris changed the title Incorrect dispersion entropy outcomes and probabilities Confusion about dispersion entropy outcomes and probabilities Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants