Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential little fixes appendix-D4 .ipynb #427

Merged
merged 2 commits into from
Nov 3, 2024
Merged

Conversation

casinca
Copy link
Contributor

@casinca casinca commented Nov 2, 2024

While toying around to compare with my code, there are these 2 things that I'm not sure about in the complete training func from D.4 let me know what do you think

You showed 2 ways of passing the peak_lr value to the optimizer: Directly passed as an argument to the train_model function or retrieving it using the optimizers parameters inside the train_model function with peak_lr = optimizer.param_groups[0]["lr"] which is the way implemented in the notebook and the book.

But in the code, the lr argument for optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1) is never passed as lr=peak_lr thus defaulting to the lr AdamW's default of 1e-3 instead of the peak_lr = 5e-4 when we retrieve with peak_lr = optimizer.param_groups[0]["lr"]

There is a gap for the gradients clipping, there's no clipping for the 1st step after the warmup ends when global_step = warmup_steps because the warmup stops at if global_step < warmup_steps: and the clipping starts at if global_step > warmup_steps:

I'm not sure that was intended to have no clipping when lr is at max because you also mentioned:

Applies gradient clipping after the warmup phase to avoid exploding gradients

Example of an output with warmup_steps = 18 and prints yes/no under the above conditions

- lr missing argument for passing peak_lr to the optimizer
- filling 1 step gap for gradient clipping
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rasbt
Copy link
Owner

rasbt commented Nov 3, 2024

Thanks a lot for the PR. Yes, you are right, since we retrieve the peak LR from the optimizer in the training function, we should initialize the optimizer with the peak LR.

I agree with your update, but I'll set the peak_lr to 0.001 so that the loss and plots afterwards don't change too much (to make it a bit less confusing for the readers)

Regarding the second point, that's a great catch as well. I just double checked and you are correct!

@rasbt rasbt merged commit 9ce0be3 into rasbt:main Nov 3, 2024
8 checks passed
@casinca casinca deleted the Append-D4-fixes branch November 4, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants