Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete older files based on VERSION #3143

Open
starzar opened this issue Jan 19, 2025 · 0 comments
Open

Delete older files based on VERSION #3143

starzar opened this issue Jan 19, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@starzar
Copy link

starzar commented Jan 19, 2025

Description

Keep latest n files based on version/timestamp and delete rest(deltalake-python).

Deltalake provides the vaccum() method to delete older files based on time (vaccum)

Example:


deltaTable.vacuum()     # vacuum files not required by versions more than 7 days old

deltaTable.vacuum(100)  # vacuum files not required by versions more than 100 hours old

### However, if the last 7 days of data is not updated and the data needs to be retained based of the latest available files ,
how to keep only the latest n files and delete the rest?

Use Case

One way to achieve this would be to delete the files based on version/timestamp history(),BUT....

 #delta_table.load_as_version(version) is not pointing to version path
            version_delete = delta_table.load_as_version(version)

HOW TO Delete older files and RESET THE VERSION ONCE THE OLDER FILES ARE DELETED?

.py

from deltalake import DeltaTable

delta_table_path = "/path/to/delta_table"

delta_table = DeltaTable(delta_files[0])

history = delta_table.history()
for record in history:
    print(record)
# Retain last 5 file_versions
files_retained = 5
record_latest_version = int(history[0]['version'])
record_deleted_versions = record_latest_version - files_retained
print(record_deleted_versions)
if len(history) > files_retained:
    for version in range(record_deleted_versions) :
        #delta_table.load_as_version(version) not pointing to version path
        version_delete = delta_table.load_as_version(version)
        print(version_delete)

        file_path = os.path.join(delta_table_path, version_delete)
        if os.path.exists(file_path):
            print(f"Deleting file: {file_path}")
            os.remove(file_path)

file versions(reset version for latest 5 retained files)

{'timestamp': 1724762119089, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite'}, 'clientVersion': 'delta-rs.0.18.1', 'version': 155}
{'timestamp': 1724761972623, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite'}, 'clientVersion': 'delta-rs.0.18.1', 'version': 154}
{'timestamp': 1724761906451, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite'}, 'clientVersion': 'delta-rs.0.18.1', 'version': 153}
{'timestamp': 1724761641391, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite'}, 'clientVersion': 'delta-rs.0.18.1', 'version': 152}
{'timestamp': 1724761204127, 'operation': 'WRITE', 'operationParameters': {'mode': 'Overwrite'}, 'clientVersion': 'delta-rs.0.18.1', 'version': 151}
@starzar starzar added the enhancement New feature or request label Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant