Add Everest storage (port of seba_sqlite logic) #9763

yngve-sk · 2025-01-16T07:45:04Z

Release notes:

Stores constraint violation values
Replaces seba sqlite storage

(Storage PR back in, for reviews, previous PR (accidentally merged): #9161)

Issue
Resolves #8811

Base idea/documentation:

Store datasets by [batch, realization, perturbation] x [controls, objectives, constraints, objective_gradient, constraint_gradient]:

Exhaustive list of data stored PER BATCH :

batch.json - contains info about the batch, batch_id and whether it is an improvement (aka merit flag, but the concepts are now unified for dakota and non-dakota runs)
batch_constraints constraint values (and violations) for constraints, batch-wide
batch_objectives objective values, batch-wide
realization_controls - control values for geo-realizations, also includes simulation_id
realization_objectives - objective values per geo-realization
realization_constraints - constraint values per geo-realization
perturbation_objectives - objective and control values per perturbation
perturbation_constraints - constraint and control values per perturbation (Note/discussion point: control values could be pulled into separate table to avoid redundancy)
batch_objective_gradient - Partial derivatives of objectives, given different controls. This dataset has one column per objective, and one row per control value, and the intersecting cells represent the partial derivative of the objective wrt that control value.
batch_constraint_gradient - Partial derivatives of constraints, given different controls. This dataset has one column per constraint, and one row per control value, and the intersecting cells represent the partial derivative of the constraint wrt that control value.

Example data from math_func/config_advanced.yml (json format)

Exhaustive list of data stored PER OPTIMIZATION

controls.json - control values for this batch
realization_weights.json - realization weights
nonlinear_constraints - conditions for constraints to satisfy (on average over the batch)
objective_functions - objective function names, weights, and normalization

Example data from math_func/config_advanced.yml

Potential simplifications

The everest_data_api is currently used for plotting, but could be used (probably expanded a bit) to avoid doing direct (polars) dataframe manipulations elsewhere in the code, but currently they are done directly in the code.

codspeed-hq · 2025-01-16T07:56:49Z

CodSpeed Performance Report

Merging #9763 will not alter performance

_{Comparing yngve-sk:24.10.25.store-everest-opt-results-in-ertstorage (77d9534) with main (a2fd3e1)}

Summary

✅ 24 untouched benchmarks

DanSava

This looks good 👍 🏅

src/everest/api/everest_data_api.py

src/everest/detached/__init__.py

StephanDeHoop

Looks great! Most of my comments are just asking to clarify things I don't understand :). Amazing work, well done :) !!!

StephanDeHoop · 2025-01-16T09:19:30Z

src/everest/everest_storage.py

+    perturbation_objectives: polars.DataFrame | None
+    batch_constraint_gradient: polars.DataFrame | None
+    perturbation_constraints: polars.DataFrame | None
+    is_improvement: bool | None = False


I might answer my own question at the end of the review, but why are we not sending the is_improvement when returning the existing dataframe as a dict in the method below?

Hmm, the .existing_dataframes() prop is used for looping over dataframes that exist for a batch, it may be function results and/or gradient results, so it filters out non-existent dataframes to avoid errors when trying to write them. As for other properties to the batch that are not dataframe, those will go into the batch.json file

StephanDeHoop · 2025-01-16T09:37:10Z

src/everest/everest_storage.py

+    realization_weights: polars.DataFrame | None = None
+
+    @property
+    def simulation_to_geo_realization_map(self) -> dict[int, int]:


I am not 100% following this, the mapping from ert-"realization" to everest-"simulation_id" depends on the batch but here it seems like we are always basing it on the first realization_controls that is not None?

I think this is a logic error now actually, since the mapping varies per batch, nice catch, will correct the logic

Logic fixed, although this thing is not covered by tests. Suggest to defer adding a test for it until when we add those mappings explicitly to ERT storage, as it will be an easier setup than whatever is there now. There are probably some edge cases here I'm not quite familiar with.

StephanDeHoop · 2025-01-16T09:45:28Z

src/everest/everest_storage.py

+        renames = {
+            "objective": "objective_name",
+            "weighted_objective": "total_objective_value",
+            "variable": "control_name",


"variable" and "variables" where one is actually the variable_name and the other the value is interesting, I guess in the table/json it is very clear what is what, but maybe in the code it's not immediately obvious what means what?

Hmm, I think this rename function should clarify it somewhat. I added a comment clarifying that this maps ropt columns to columns we present to the user.

StephanDeHoop · 2025-01-16T12:03:53Z

src/everest/everest_storage.py

+        exp = _OptimizerOnlyExperiment(self._output_dir)
+
+        # csv writing mostly for dev/debugging/quick inspection
+        self.data.write_to_experiment(exp, write_csv=True)


We are already writing an ungodly amount of files to disk whenever we run EVEREST, since you mention this is for dev or debug (or quick inspection), should we have a flag for this? We call this whenever FINISHED_OPTIMIZER_STEP event is triggered, I guess after every batch then? EDIT: I guess the "STEP" in this event name refers to the full optimization as a "STEP" in the workflow (not an optimization iteration/step). So only one file is generated I guess that wouldn't hurt haha.

This was initially for dev/debug yes, and we did not discuss/land on a final format, I think this should be a discussion point at some point before merging it. The data is not big but I don't think we need to 3x the data with 3 different formats.

Added a commit making it optional for now, invocation now looks like this, but the whole json/csv should maybe be removed, to be discussed.

self.data.write_to_experiment( exp, write_json=bool(os.getenv("_EVEREST_STORAGE_WRITE_JSON")), write_csv=bool(os.getenv("_EVEREST_STORAGE_WRITE_CSV")), )

Edit: Added another commit on top removing the functionality, I think we should probably default to not having it as a feature.

StephanDeHoop · 2025-01-16T12:17:46Z

src/everest/everest_storage.py

+                select=["objectives", "evaluation_ids"],
+            ).reset_index(),
+        ).select(
+            "batch_id",


Maybe a silly question or not how it should work, but I see many hardcoded strings that occur in many different places (and potentially have the same meaning). Shouldn't we have enums for this or use something like nameof (like in C#) when we are referring to properties/attributes?

Hmm, it would be possible to do an enum, but the column keys are still specified in the _rename_ropt_df_columns function, all other ROPT columns that are dropped, we do not need to know. We could maybe translate this into an enum but I'm not sure it would be a good practice / much gain, other than it being easier to get autocomplete while choosing a column. But I think if you are in the first place doing a dataframe manipulation, you will at the very least look at a sample dataframe before being able to make sense of what to do, so implicitly you'll know the columns in that way. If this becomes an issue we could do a future PR/issue towards it.

StephanDeHoop · 2025-01-16T13:07:46Z

src/everest/bin/config_branch_script.py

+    batch_data = next((b for b in storage.data.batches if b.batch_id == batch), None)
+
+    if batch_data:
+        # All geo-realizations should have the same unperturbed control values per batch


Only true if batch contains a forward model run for unperturbed controls (if batch is only pertubations they will all be different for different <GEO_ID>s? Not sure how that affects config_branch_entry() where this is called.

Hmm, I think this is OK in that case, because it is used for config_branch functionality, where users supply a batch to use as a "prior" in the next config, basically copying its control values as initial values in the new config. If they supply a perturbation-only batch there will be some (probably not the best) error, but I think that is separate from this PR.

StephanDeHoop · 2025-01-16T13:15:36Z

tests/everest/entry_points/test_config_branch_entry.py

-    assert new_controls_initial_guesses == opt_control_val_for_batch_id
+    control_names = storage.data.controls["control_name"]
+    batch_1_info = next(b for b in storage.data.batches if b.batch_id == 1)
+    realization_control_vals_mean = batch_1_info.realization_controls.select(


Why do we call it mean? It seems either unperturbed values or first perturbation or am I missing something?

Oh, I think it is because it was maybe running on advanced before, which had 2 geo-realizations, where it did the mean, but on minimal when there is only one, the mean is no longer necessary.

StephanDeHoop · 2025-01-16T13:15:43Z

tests/everest/entry_points/test_config_branch_entry.py

+    storage.read_from_output_dir()
+    control_names = storage.data.controls["control_name"]
+    batch_1_info = next(b for b in storage.data.batches if b.batch_id == 1)
+    realization_control_vals_mean = batch_1_info.realization_controls.select(


Same as above

Same as above, renaming the variable to not have mean in it

StephanDeHoop · 2025-01-16T13:17:13Z

tests/everest/functional/test_main_everest_entry.py

-    snapshot = SebaSnapshot(config.optimization_output_dir).get_snapshot()
+    storage = EverestStorage(Path(config.optimization_output_dir))
+    storage.read_from_output_dir()
+    optimal = storage.get_optimal_result()


Really like this (implicit?) improvement regarding the name compared to snapshot.optimization_data[-1] :) !

StephanDeHoop · 2025-01-16T13:18:39Z

tests/everest/snapshots/test_api_snapshots/test_api_snapshots/config_multiobj.yml/snapshot.json

-      "control": "point_y",
-      "function": "distance_p",
-      "value": 0.98866227
+      "control": "point_x",


Is this a weird git diff, or why did this control_name change in the test?

Worth a double check, but I think it just means the order of items changed, and the diff goes line by line rather than "by order"

EverestStorageDataFrames->OptimizationStorageData

yngve-sk changed the title ~~24.10.25.store everest opt results in ertstorage~~ Add Everest storage Jan 16, 2025

yngve-sk requested review from DanSava and StephanDeHoop January 16, 2025 07:53

yngve-sk self-assigned this Jan 16, 2025

yngve-sk added release-notes:breaking-change Automatically categorise as breaking change in release notes enhancement labels Jan 16, 2025

yngve-sk changed the title ~~Add Everest storage~~ Add Everest storage (port of seba_sqlite logic) Jan 16, 2025

DanSava reviewed Jan 16, 2025

View reviewed changes

src/everest/api/everest_data_api.py Show resolved Hide resolved

src/everest/detached/__init__.py Show resolved Hide resolved

StephanDeHoop reviewed Jan 16, 2025

View reviewed changes

yngve-sk force-pushed the 24.10.25.store-everest-opt-results-in-ertstorage branch 5 times, most recently from cba1184 to 058e314 Compare January 20, 2025 09:09

yngve-sk added 15 commits January 20, 2025 10:09

(only for reference) Paste in SEBA storage

215e492

Add Everest storage

bda0de6

Remove everexport

12a9974

Use new storage for egg test

3a6bd67

Add test locking in get_opt_status behavior

803d36b

Use new storage for get_opt_status

c0d2b76

Use new storage for test_main_everest_entry.py

c5e9dc5

Use new storage in test_everserver.py

bfc355d

Use new storage in test_config_branch_entry.py

a5f157a

Remove SEBA reference in test_simulator_cache

ab6877a

review: use int for realization map

eceba20

review: Add comment about weird function in everest api

b8fe983

review: Stop storing result_id

bce5689

review: realization map be batch-depdendent

15f3ace

review: Add comment clarifying renaming of columns

d2a18f2

yngve-sk added 10 commits January 20, 2025 10:09

review: Make csv / json writing optional

89e1094

review: Remove JSON/csv writing

5a39d95

review: Correct output_constraint logic

1c083a4

review: write .objective_values w/ Everest storage

da660fa

review: BatchDataFrames->BatchStorageData

f20706b

EverestStorageDataFrames->OptimizationStorageData

review: clarify comment in opt_controls_by_batch

ecb7462

review: remove mean() from varnames in test_config_branch_entry

c2e639f

fixup! review: write .objective_values w/ Everest storage

4423e71

(wip) adding csv export

c26ccf0

fixup Enforce dtypes for perturbed objectives

d2b3879

yngve-sk force-pushed the 24.10.25.store-everest-opt-results-in-ertstorage branch from 058e314 to d7a931e Compare January 20, 2025 09:09

Add csv export

d707241

yngve-sk force-pushed the 24.10.25.store-everest-opt-results-in-ertstorage branch from d7a931e to d707241 Compare January 20, 2025 09:11

fixup snapshot?

77d9534

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Everest storage (port of seba_sqlite logic) #9763

Add Everest storage (port of seba_sqlite logic) #9763

yngve-sk commented Jan 16, 2025 •

edited

Loading

codspeed-hq bot commented Jan 16, 2025 •

edited

Loading

DanSava left a comment

StephanDeHoop left a comment

StephanDeHoop Jan 16, 2025

yngve-sk Jan 16, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 16, 2025

yngve-sk Jan 17, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 17, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 17, 2025

yngve-sk Jan 17, 2025

yngve-sk Jan 17, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 17, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 17, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 17, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 17, 2025

StephanDeHoop Jan 16, 2025

StephanDeHoop Jan 16, 2025

yngve-sk Jan 17, 2025

Add Everest storage (port of seba_sqlite logic) #9763

Are you sure you want to change the base?

Add Everest storage (port of seba_sqlite logic) #9763

Conversation

yngve-sk commented Jan 16, 2025 • edited Loading

Base idea/documentation:

Exhaustive list of data stored PER BATCH :

Exhaustive list of data stored PER OPTIMIZATION

Potential simplifications

codspeed-hq bot commented Jan 16, 2025 • edited Loading

Merging #9763 will not alter performance

Summary

DanSava left a comment

Choose a reason for hiding this comment

StephanDeHoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yngve-sk commented Jan 16, 2025 •

edited

Loading

codspeed-hq bot commented Jan 16, 2025 •

edited

Loading