-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Everest storage (port of seba_sqlite logic) #9763
base: main
Are you sure you want to change the base?
Add Everest storage (port of seba_sqlite logic) #9763
Conversation
CodSpeed Performance ReportMerging #9763 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good 👍 🏅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Most of my comments are just asking to clarify things I don't understand :). Amazing work, well done :) !!!
perturbation_objectives: polars.DataFrame | None | ||
batch_constraint_gradient: polars.DataFrame | None | ||
perturbation_constraints: polars.DataFrame | None | ||
is_improvement: bool | None = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might answer my own question at the end of the review, but why are we not sending the is_improvement when returning the existing dataframe as a dict in the method below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the .existing_dataframes()
prop is used for looping over dataframes that exist for a batch, it may be function results and/or gradient results, so it filters out non-existent dataframes to avoid errors when trying to write them. As for other properties to the batch that are not dataframe, those will go into the batch.json
file
src/everest/everest_storage.py
Outdated
realization_weights: polars.DataFrame | None = None | ||
|
||
@property | ||
def simulation_to_geo_realization_map(self) -> dict[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% following this, the mapping from ert-"realization" to everest-"simulation_id" depends on the batch
but here it seems like we are always basing it on the first realization_controls that is not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a logic error now actually, since the mapping varies per batch
, nice catch, will correct the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic fixed, although this thing is not covered by tests. Suggest to defer adding a test for it until when we add those mappings explicitly to ERT storage, as it will be an easier setup than whatever is there now. There are probably some edge cases here I'm not quite familiar with.
renames = { | ||
"objective": "objective_name", | ||
"weighted_objective": "total_objective_value", | ||
"variable": "control_name", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"variable" and "variables" where one is actually the variable_name and the other the value is interesting, I guess in the table/json it is very clear what is what, but maybe in the code it's not immediately obvious what means what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this rename function should clarify it somewhat. I added a comment clarifying that this maps ropt columns to columns we present to the user.
src/everest/everest_storage.py
Outdated
exp = _OptimizerOnlyExperiment(self._output_dir) | ||
|
||
# csv writing mostly for dev/debugging/quick inspection | ||
self.data.write_to_experiment(exp, write_csv=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already writing an ungodly amount of files to disk whenever we run EVEREST, since you mention this is for dev or debug (or quick inspection), should we have a flag for this? We call this whenever FINISHED_OPTIMIZER_STEP event is triggered, I guess after every batch then? EDIT: I guess the "STEP" in this event name refers to the full optimization as a "STEP" in the workflow (not an optimization iteration/step). So only one file is generated I guess that wouldn't hurt haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was initially for dev/debug yes, and we did not discuss/land on a final format, I think this should be a discussion point at some point before merging it. The data is not big but I don't think we need to 3x the data with 3 different formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit making it optional for now, invocation now looks like this, but the whole json/csv should maybe be removed, to be discussed.
self.data.write_to_experiment(
exp,
write_json=bool(os.getenv("_EVEREST_STORAGE_WRITE_JSON")),
write_csv=bool(os.getenv("_EVEREST_STORAGE_WRITE_CSV")),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Added another commit on top removing the functionality, I think we should probably default to not having it as a feature.
select=["objectives", "evaluation_ids"], | ||
).reset_index(), | ||
).select( | ||
"batch_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a silly question or not how it should work, but I see many hardcoded strings that occur in many different places (and potentially have the same meaning). Shouldn't we have enums for this or use something like nameof
(like in C#) when we are referring to properties/attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it would be possible to do an enum, but the column keys are still specified in the _rename_ropt_df_columns
function, all other ROPT columns that are dropped, we do not need to know. We could maybe translate this into an enum but I'm not sure it would be a good practice / much gain, other than it being easier to get autocomplete while choosing a column. But I think if you are in the first place doing a dataframe manipulation, you will at the very least look at a sample dataframe before being able to make sense of what to do, so implicitly you'll know the columns in that way. If this becomes an issue we could do a future PR/issue towards it.
batch_data = next((b for b in storage.data.batches if b.batch_id == batch), None) | ||
|
||
if batch_data: | ||
# All geo-realizations should have the same unperturbed control values per batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only true if batch contains a forward model run for unperturbed controls (if batch is only pertubations they will all be different for different <GEO_ID>
s? Not sure how that affects config_branch_entry()
where this is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this is OK in that case, because it is used for config_branch
functionality, where users supply a batch to use as a "prior" in the next config, basically copying its control values as initial values in the new config. If they supply a perturbation-only batch there will be some (probably not the best) error, but I think that is separate from this PR.
assert new_controls_initial_guesses == opt_control_val_for_batch_id | ||
control_names = storage.data.controls["control_name"] | ||
batch_1_info = next(b for b in storage.data.batches if b.batch_id == 1) | ||
realization_control_vals_mean = batch_1_info.realization_controls.select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we call it mean? It seems either unperturbed values or first perturbation or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think it is because it was maybe running on advanced before, which had 2 geo-realizations, where it did the mean, but on minimal when there is only one, the mean is no longer necessary.
storage.read_from_output_dir() | ||
control_names = storage.data.controls["control_name"] | ||
batch_1_info = next(b for b in storage.data.batches if b.batch_id == 1) | ||
realization_control_vals_mean = batch_1_info.realization_controls.select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, renaming the variable to not have mean
in it
snapshot = SebaSnapshot(config.optimization_output_dir).get_snapshot() | ||
storage = EverestStorage(Path(config.optimization_output_dir)) | ||
storage.read_from_output_dir() | ||
optimal = storage.get_optimal_result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like this (implicit?) improvement regarding the name compared to snapshot.optimization_data[-1]
:) !
"control": "point_y", | ||
"function": "distance_p", | ||
"value": 0.98866227 | ||
"control": "point_x", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a weird git diff, or why did this control_name change in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth a double check, but I think it just means the order of items changed, and the diff goes line by line rather than "by order"
cba1184
to
058e314
Compare
EverestStorageDataFrames->OptimizationStorageData
058e314
to
d7a931e
Compare
d7a931e
to
d707241
Compare
Release notes:
(Storage PR back in, for reviews, previous PR (accidentally merged): #9161)
Issue
Resolves #8811
Base idea/documentation:
Store datasets by
[batch, realization, perturbation] x [controls, objectives, constraints, objective_gradient, constraint_gradient]
:Exhaustive list of data stored PER BATCH :
batch.json
- contains info about the batch, batch_id and whether it is an improvement (aka merit flag, but the concepts are now unified for dakota and non-dakota runs)batch_constraints
constraint values (and violations) for constraints, batch-widebatch_objectives
objective values, batch-widerealization_controls
- control values for geo-realizations, also includessimulation_id
realization_objectives
- objective values per geo-realizationrealization_constraints
- constraint values per geo-realizationperturbation_objectives
- objective and control values per perturbationperturbation_constraints
- constraint and control values per perturbation (Note/discussion point: control values could be pulled into separate table to avoid redundancy)batch_objective_gradient
- Partial derivatives of objectives, given different controls. This dataset has one column per objective, and one row per control value, and the intersecting cells represent the partial derivative of the objective wrt that control value.batch_constraint_gradient
- Partial derivatives of constraints, given different controls. This dataset has one column per constraint, and one row per control value, and the intersecting cells represent the partial derivative of the constraint wrt that control value.Example data from
math_func/config_advanced.yml
(json format)Exhaustive list of data stored PER OPTIMIZATION
controls.json
- control values for this batchrealization_weights.json
- realization weightsnonlinear_constraints
- conditions for constraints to satisfy (on average over the batch)objective_functions
- objective function names, weights, and normalizationExample data from
math_func/config_advanced.yml
Potential simplifications
The
everest_data_api
is currently used for plotting, but could be used (probably expanded a bit) to avoid doing direct (polars) dataframe manipulations elsewhere in the code, but currently they are done directly in the code.