Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Everest storage (port of seba_sqlite logic) #9763

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

yngve-sk
Copy link
Contributor

@yngve-sk yngve-sk commented Jan 16, 2025

Release notes:

  • Stores constraint violation values
  • Replaces seba sqlite storage

(Storage PR back in, for reviews, previous PR (accidentally merged): #9161)

Issue
Resolves #8811

Base idea/documentation:

Store datasets by [batch, realization, perturbation] x [controls, objectives, constraints, objective_gradient, constraint_gradient]:

Exhaustive list of data stored PER BATCH :

  • batch.json - contains info about the batch, batch_id and whether it is an improvement (aka merit flag, but the concepts are now unified for dakota and non-dakota runs)
  • batch_constraints constraint values (and violations) for constraints, batch-wide
  • batch_objectives objective values, batch-wide
  • realization_controls - control values for geo-realizations, also includes simulation_id
  • realization_objectives - objective values per geo-realization
  • realization_constraints - constraint values per geo-realization
  • perturbation_objectives - objective and control values per perturbation
  • perturbation_constraints - constraint and control values per perturbation (Note/discussion point: control values could be pulled into separate table to avoid redundancy)
  • batch_objective_gradient - Partial derivatives of objectives, given different controls. This dataset has one column per objective, and one row per control value, and the intersecting cells represent the partial derivative of the objective wrt that control value.
  • batch_constraint_gradient - Partial derivatives of constraints, given different controls. This dataset has one column per constraint, and one row per control value, and the intersecting cells represent the partial derivative of the constraint wrt that control value.

Example data from math_func/config_advanced.yml (json format)
Screenshot 2025-01-10 at 14 53 04

Exhaustive list of data stored PER OPTIMIZATION

  • controls.json - control values for this batch
  • realization_weights.json - realization weights
  • nonlinear_constraints - conditions for constraints to satisfy (on average over the batch)
  • objective_functions - objective function names, weights, and normalization

Example data from math_func/config_advanced.yml
Screenshot 2025-01-10 at 15 00 29

Potential simplifications

The everest_data_api is currently used for plotting, but could be used (probably expanded a bit) to avoid doing direct (polars) dataframe manipulations elsewhere in the code, but currently they are done directly in the code.

@yngve-sk yngve-sk changed the title 24.10.25.store everest opt results in ertstorage Add Everest storage Jan 16, 2025
Copy link

codspeed-hq bot commented Jan 16, 2025

CodSpeed Performance Report

Merging #9763 will not alter performance

Comparing yngve-sk:24.10.25.store-everest-opt-results-in-ertstorage (77d9534) with main (a2fd3e1)

Summary

✅ 24 untouched benchmarks

@yngve-sk yngve-sk self-assigned this Jan 16, 2025
@yngve-sk yngve-sk added release-notes:breaking-change Automatically categorise as breaking change in release notes enhancement labels Jan 16, 2025
@yngve-sk yngve-sk changed the title Add Everest storage Add Everest storage (port of seba_sqlite logic) Jan 16, 2025
Copy link
Contributor

@DanSava DanSava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good 👍 🏅

src/everest/api/everest_data_api.py Show resolved Hide resolved
src/everest/detached/__init__.py Show resolved Hide resolved
Copy link
Contributor

@StephanDeHoop StephanDeHoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Most of my comments are just asking to clarify things I don't understand :). Amazing work, well done :) !!!

perturbation_objectives: polars.DataFrame | None
batch_constraint_gradient: polars.DataFrame | None
perturbation_constraints: polars.DataFrame | None
is_improvement: bool | None = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might answer my own question at the end of the review, but why are we not sending the is_improvement when returning the existing dataframe as a dict in the method below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the .existing_dataframes() prop is used for looping over dataframes that exist for a batch, it may be function results and/or gradient results, so it filters out non-existent dataframes to avoid errors when trying to write them. As for other properties to the batch that are not dataframe, those will go into the batch.json file

realization_weights: polars.DataFrame | None = None

@property
def simulation_to_geo_realization_map(self) -> dict[int, int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% following this, the mapping from ert-"realization" to everest-"simulation_id" depends on the batch but here it seems like we are always basing it on the first realization_controls that is not None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a logic error now actually, since the mapping varies per batch, nice catch, will correct the logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic fixed, although this thing is not covered by tests. Suggest to defer adding a test for it until when we add those mappings explicitly to ERT storage, as it will be an easier setup than whatever is there now. There are probably some edge cases here I'm not quite familiar with.

renames = {
"objective": "objective_name",
"weighted_objective": "total_objective_value",
"variable": "control_name",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"variable" and "variables" where one is actually the variable_name and the other the value is interesting, I guess in the table/json it is very clear what is what, but maybe in the code it's not immediately obvious what means what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this rename function should clarify it somewhat. I added a comment clarifying that this maps ropt columns to columns we present to the user.

exp = _OptimizerOnlyExperiment(self._output_dir)

# csv writing mostly for dev/debugging/quick inspection
self.data.write_to_experiment(exp, write_csv=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already writing an ungodly amount of files to disk whenever we run EVEREST, since you mention this is for dev or debug (or quick inspection), should we have a flag for this? We call this whenever FINISHED_OPTIMIZER_STEP event is triggered, I guess after every batch then? EDIT: I guess the "STEP" in this event name refers to the full optimization as a "STEP" in the workflow (not an optimization iteration/step). So only one file is generated I guess that wouldn't hurt haha.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially for dev/debug yes, and we did not discuss/land on a final format, I think this should be a discussion point at some point before merging it. The data is not big but I don't think we need to 3x the data with 3 different formats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit making it optional for now, invocation now looks like this, but the whole json/csv should maybe be removed, to be discussed.

        self.data.write_to_experiment(
            exp,
            write_json=bool(os.getenv("_EVEREST_STORAGE_WRITE_JSON")),
            write_csv=bool(os.getenv("_EVEREST_STORAGE_WRITE_CSV")),
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Added another commit on top removing the functionality, I think we should probably default to not having it as a feature.

select=["objectives", "evaluation_ids"],
).reset_index(),
).select(
"batch_id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question or not how it should work, but I see many hardcoded strings that occur in many different places (and potentially have the same meaning). Shouldn't we have enums for this or use something like nameof (like in C#) when we are referring to properties/attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it would be possible to do an enum, but the column keys are still specified in the _rename_ropt_df_columns function, all other ROPT columns that are dropped, we do not need to know. We could maybe translate this into an enum but I'm not sure it would be a good practice / much gain, other than it being easier to get autocomplete while choosing a column. But I think if you are in the first place doing a dataframe manipulation, you will at the very least look at a sample dataframe before being able to make sense of what to do, so implicitly you'll know the columns in that way. If this becomes an issue we could do a future PR/issue towards it.

batch_data = next((b for b in storage.data.batches if b.batch_id == batch), None)

if batch_data:
# All geo-realizations should have the same unperturbed control values per batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only true if batch contains a forward model run for unperturbed controls (if batch is only pertubations they will all be different for different <GEO_ID>s? Not sure how that affects config_branch_entry() where this is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this is OK in that case, because it is used for config_branch functionality, where users supply a batch to use as a "prior" in the next config, basically copying its control values as initial values in the new config. If they supply a perturbation-only batch there will be some (probably not the best) error, but I think that is separate from this PR.

assert new_controls_initial_guesses == opt_control_val_for_batch_id
control_names = storage.data.controls["control_name"]
batch_1_info = next(b for b in storage.data.batches if b.batch_id == 1)
realization_control_vals_mean = batch_1_info.realization_controls.select(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call it mean? It seems either unperturbed values or first perturbation or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think it is because it was maybe running on advanced before, which had 2 geo-realizations, where it did the mean, but on minimal when there is only one, the mean is no longer necessary.

storage.read_from_output_dir()
control_names = storage.data.controls["control_name"]
batch_1_info = next(b for b in storage.data.batches if b.batch_id == 1)
realization_control_vals_mean = batch_1_info.realization_controls.select(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, renaming the variable to not have mean in it

snapshot = SebaSnapshot(config.optimization_output_dir).get_snapshot()
storage = EverestStorage(Path(config.optimization_output_dir))
storage.read_from_output_dir()
optimal = storage.get_optimal_result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this (implicit?) improvement regarding the name compared to snapshot.optimization_data[-1] :) !

"control": "point_y",
"function": "distance_p",
"value": 0.98866227
"control": "point_x",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a weird git diff, or why did this control_name change in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a double check, but I think it just means the order of items changed, and the diff goes line by line rather than "by order"

@yngve-sk yngve-sk force-pushed the 24.10.25.store-everest-opt-results-in-ertstorage branch 5 times, most recently from cba1184 to 058e314 Compare January 20, 2025 09:09
@yngve-sk yngve-sk force-pushed the 24.10.25.store-everest-opt-results-in-ertstorage branch from 058e314 to d7a931e Compare January 20, 2025 09:09
@yngve-sk yngve-sk force-pushed the 24.10.25.store-everest-opt-results-in-ertstorage branch from d7a931e to d707241 Compare January 20, 2025 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement release-notes:breaking-change Automatically categorise as breaking change in release notes
Projects
Status: Ready for Review
Development

Successfully merging this pull request may close these issues.

Refactor communication and storage of optimization results
4 participants