Native support for incremental restore (#13239)

Summary: With this change we are adding native library support for incremental restores. When designing the solution we decided to follow 'tiered' approach where users can pick one of the three predefined, and for now, mutually exclusive restore modes (`kKeepLatestDbSessionIdFiles`, `kVerifyChecksum` and `kPurgeAllFiles` [default]) - trading write IO / CPU for the degree of certainty that the existing destination db files match selected backup files contents. New mode option is exposed via existing `RestoreOptions` configuration, which by this time has been already well-baked into our APIs. Restore engine will consume this configuration and infer which of the existing destination db files are 'in policy' to be retained during restore. ### Motivation This work is motivated by internal customer who is running write-heavy, 1M+ QPS service and is using RocksDB restore functionality to scale up their fleet. Given already high QPS on their end, additional write IO from restores as-is today is contributing to prolonged spikes which lead the service to hit BLOB storage write quotas, which finally results in slowing down the pace of their scaling. See [T206217267](https://www.internalfb.com/intern/tasks/?t=206217267) for more. ### Impact Enable faster service scaling by reducing write IO footprint on BLOB storage (coming from restore) to the absolute minimum. ### Key technical nuances 1. According to prior investigations, the risk of collisions on [file #, db session id, file size] metadata triplets is low enough to the point that we can confidently use it to uniquely describe the file and its' *perceived* contents, which is the rationale behind the `kKeepLatestDbSessionIdFiles` mode. To find more about the risks / tradeoffs for using this mode, please check the related comment in `backup_engine.cc`. This mode is only supported for SSTs where we persist the `db_session_id` information in the metadata footer. 2. `kVerifyChecksum` mode requires a full blob / SST file scan (assuming backup file has its' `checksum_hex` metadata set appropriately, if not additional file scan for backup file). While it saves us on write IOs (if checksums match), it's still fairly complex and _potentially_ CPU intensive operation. 3. We're extending the `WorkItemType` enum introduced in #13228 to accommodate a new simple request to `ComputeChecksum`, which will enable us to run 2) in parallel. This will become increasingly more important as we're moving towards disaggregated storage and holding up the sequence of checksum evaluations on a single lagging remote file scan would not be acceptable. 4. Note that it's necessary to compute the checksum on the restored file if corresponding backup file and existing destination db file checksums didn't match. ### Test plan ✅ 1. Manual testing using debugger: ✅ 2. Automated tests: * `./backup_engine_test --gtest_filter=*IncrementalRestore*` covering the following scenarios: ✅ * Full clean restore * Integration with `exclude files` feature (with proper writes counting) * User workflow simulation: happy path with mix of added new files and deleted original backup files, * Existing db files corruptions and the difference in handling between `kVerifyChecksum` and `kKeepLatestDbSessionIdFiles` modes. * `./backup_engine_test --gtest_filter=*ExcludedFiles*` ✅ * Integrate existing test collateral with newly introduced restore modes Pull Request resolved: #13239 Reviewed By: pdillinger Differential Revision: D67513875 Pulled By: mszeszko-meta fbshipit-source-id: 273642accd7c97ea52e42f9dc1cc1479f86cf30e
facebook · Jan 18, 2025 · 2257f4f · 2257f4f
1 parent 602e19f
commit 2257f4f
Show file tree

Hide file tree

Showing 4 changed files with 851 additions and 197 deletions.
diff --git a/include/rocksdb/env.h b/include/rocksdb/env.h
@@ -627,7 +627,7 @@ class Env : public Customizable {
       const EnvOptions& env_options,
       const ImmutableDBOptions& immutable_ops) const;
 
-  // OptimizeForCompactionTableWrite will create a new EnvOptions object that
+  // OptimizeForCompactionTableRead will create a new EnvOptions object that
   // is a copy of the EnvOptions in the parameters, but is optimized for reading
   // table files.
   virtual EnvOptions OptimizeForCompactionTableRead(

diff --git a/include/rocksdb/utilities/backup_engine.h b/include/rocksdb/utilities/backup_engine.h
@@ -126,8 +126,8 @@ struct BackupEngineOptions {
   // Default: true
   bool share_files_with_checksum;
 
-  // Up to this many background threads will copy files for CreateNewBackup()
-  // and RestoreDBFromBackup()
+  // Up to this many background threads will be used to copy files & compute
+  // checksums for CreateNewBackup() and RestoreDBFromBackup().
   // Default: 1
   int max_background_operations;
 
@@ -349,6 +349,39 @@ struct CreateBackupOptions {
 };
 
 struct RestoreOptions {
+  // Enum reflecting tiered approach to restores.
+  //
+  // Options `kKeepLatestDbSessionIdFiles`, `kVerifyChecksum` introduce
+  // incremental restore capability and are intended to be used separately.
+  enum Mode : uint32_t {
+    // Most efficient way to restore a healthy / non-corrupted DB from
+    // the backup(s). This mode can almost always successfully recover from
+    // incomplete / missing files, as in an incomplete copy of a DB.
+    // This mode is also integrated with `exclude_files_callback` feature
+    // and will opportunistically try to find excluded files in existing db
+    // filesystem if missing in all supplied backup directories.
+    //
+    // Effective on data files following modern share files naming schemes.
+    kKeepLatestDbSessionIdFiles = 1U,
+
+    // Recommended when db is suspected to be unhealthy, ex. we want to retain
+    // most of the files (therefore saving on write I/O) with an exception of
+    // a few corrupted ones.
+    //
+    // When opted-in, restore engine will scan the db file, compute the
+    // checksum and compare it against the checksum hardened in the backup file
+    // metadata. If checksums match, existing file will be retained as-is.
+    // Otherwise, it will be deleted and replaced it with its' restored backup
+    // counterpart. If backup file doesn't have a checksum hardened in the
+    // metadata, we'll schedule an async task to compute it.
+    kVerifyChecksum = 2U,
+
+    // Zero trust. Least efficient.
+    //
+    // Purge all the destination files and restores all files from the backup.
+    kPurgeAllFiles = 0xffffU,
+  };
+
   // If true, restore won't overwrite the existing log files in wal_dir. It will
   // also move all log files from archive directory to wal_dir. Use this option
   // in combination with BackupEngineOptions::backup_log_files = false for
@@ -361,8 +394,13 @@ struct RestoreOptions {
   // directories known to contain the required files.
   std::forward_list<BackupEngineReadOnlyBase*> alternate_dirs;
 
-  explicit RestoreOptions(bool _keep_log_files = false)
-      : keep_log_files(_keep_log_files) {}
+  // Specifies the level of incremental restore. 'kPurgeAllFiles' by default.
+  Mode mode;
+
+  // FIXME(https://github.com/facebook/rocksdb/issues/13293)
+  explicit RestoreOptions(bool _keep_log_files = false,
+                          Mode _mode = Mode::kPurgeAllFiles)
+      : keep_log_files(_keep_log_files), mode(_mode) {}
 };
 
 using BackupID = uint32_t;