You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Keep latest n files based on version/timestamp and delete rest(deltalake-python).
Deltalake provides the vaccum() method to delete older files based on time (vaccum)
Example:
deltaTable.vacuum() # vacuum files not required by versions more than 7 days old
deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old
### However, if the last 7 days of data is not updated and the data needs to be retained based of the latest available files ,
how to keep only the latest n files and delete the rest? Use Case
One way to achieve this would be to delete the files based on version/timestamp history(),BUT....
#delta_table.load_as_version(version) is not pointing to version path
version_delete = delta_table.load_as_version(version)
HOW TO Delete older files and RESET THE VERSION ONCE THE OLDER FILES ARE DELETED?
.py
from deltalake import DeltaTable
delta_table_path = "/path/to/delta_table"
delta_table = DeltaTable(delta_files[0])
history = delta_table.history()
for record in history:
print(record)
# Retain last 5 file_versions
files_retained = 5
record_latest_version = int(history[0]['version'])
record_deleted_versions = record_latest_version - files_retained
print(record_deleted_versions)
if len(history) > files_retained:
for version in range(record_deleted_versions) :
#delta_table.load_as_version(version) not pointing to version path
version_delete = delta_table.load_as_version(version)
print(version_delete)
file_path = os.path.join(delta_table_path, version_delete)
if os.path.exists(file_path):
print(f"Deleting file: {file_path}")
os.remove(file_path)
file versions(reset version for latest 5 retained files)
Description
Keep latest n files based on version/timestamp and delete rest(deltalake-python).
Deltalake provides the vaccum() method to delete older files based on time (vaccum)
Example:
### However, if the last 7 days of data is not updated and the data needs to be retained based of the latest available files ,
how to keep only the latest n files and delete the rest?
Use Case
One way to achieve this would be to delete the files based on version/timestamp history(),BUT....
HOW TO Delete older files and RESET THE VERSION ONCE THE OLDER FILES ARE DELETED?
.py
file versions(reset version for latest 5 retained files)
The text was updated successfully, but these errors were encountered: