Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: modernisation of proselint #1371

Draft
wants to merge 434 commits into
base: main
Choose a base branch
from
Draft

Conversation

Nytelife26
Copy link
Member

@Nytelife26 Nytelife26 commented May 13, 2024

this is a follow on from #1361. credit to @orgua for the initial work here.

following a request that no work from the initial refactoring effort should be used, preserved below, the oxidation of proselint begins.
image
it can be observed here that the time it now takes to launch proselint, parse CLI options, find config paths for both JSON and TOML, and deserialize them, is faster than it previously took just to load the CLI options. it is worth noting that the previous measurements were taken even after a highly optimised refactor that involved replacing click with a simplified parse function.
image

it may take some time to get this all up and running. i would like to thank any onlookers for bearing with me.

keeping this here so i don't forget:

  • moka for caching(? few libraries support TTL in an LRU, which is the style used in proselint)
  • postcard for cache persistence, due to its relatively good performance and efficient output size

@tecosaur
Copy link

tecosaur commented Oct 9, 2024

This looks fantastic @Nytelife26! I see the last commit here is from July, is there much more that needs to be done?

@Nytelife26
Copy link
Member Author

is there much more that needs to be done?

I am in the process of porting proselint in its entirety, which became a necessity after an unfortunate conflict arose with the original author of the refactor. I see this as the best possible path forward for proselint - a fresh start, which will come with performance benefits, long-overdue housekeeping, and a good chance to implement some features people have been asking for for a long time.

This effort has not been easy, and while I'm back at university, things have been a hard balance. However, I have many of the internals done already (configuration, core parts of the command line, specification structures), and I feel good progress is being made.

I will be making this effort more public once I have a solid foundation in place. For now, the latest commits to this pull request mark the last Python version of the project, unless some major breakthrough in communication happens.

Let me know if you have any furher questions. As always, I am incredibly grateful and happy to see that people are still interested in proselint. Things were rough, with stagnant development after communications ceased some years ago, but I am excited to finally have the chance to revive this project.

@Nytelife26
Copy link
Member Author

Small victories are showing - the command line, configuration parser, and check primitives are operational. As an additional bonus, all of the check specifications will be evaluated and stored at compile-time, entirely eliminating runtime discovery costs from the Python version.

Shown here is the first test with an actual regex specification from the original code.

image

Some things are not yet possible for reasons beyond my control, like consistent case-insensitive matching without using hacky mode modifiers (blocked by fancy-regex#132). I also have yet to implement parallelization, but that will not be a priority until much of the other major work is complete; although, it should be as trivial as adding Rayon and adjusting the dispatch iterators.

@Nytelife26
Copy link
Member Author

Nytelife26 commented Nov 7, 2024

I committed a preview version today so any onlookers can see how things are coming along. Be advised that at present, things are messy, quite inefficient, and there are still traces of Python-esque design patterns lying around.

However, results speak for themselves. With 51 of ~180 checks implemented, an uncached serial run in release mode is at least 10 times faster than the previous implementation of an uncached parallel run mode from my measurements. Assuming this performance will scale in a linear fashion with some pessimistic padding, I would expect no worse than 2 times faster.

nytelife26@[lilium-2] » proselint git:(dev) ‎‎‎± hyperfine
Benchmark 1: ./proselint-rs/target/debug/proselint check --demo
  Time (mean ± σ):     529.2 ms ±   6.0 ms    [User: 520.3 ms, System: 7.0 ms]
  Range (min … max):   520.4 ms … 542.0 ms    10 runs

Benchmark 2: ./proselint-rs/target/release/proselint check --demo
  Time (mean ± σ):      77.0 ms ±   4.8 ms    [User: 73.9 ms, System: 1.9 ms]
  Range (min … max):    70.3 ms …  85.2 ms    10 runs

Benchmark 3: pdm run proselint check --demo
  Time (mean ± σ):     939.7 ms ±  15.3 ms    [User: 1174.2 ms, System: 232.7 ms]
  Range (min … max):   920.6 ms … 972.2 ms    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ./proselint-rs/target/release/proselint check --demo ran
    6.88 ± 0.43 times faster than ./proselint-rs/target/debug/proselint check --demo
   12.21 ± 0.78 times faster than pdm run proselint check --demo

Updated results with a parallel iterator via rayon:

Benchmark 1: ./proselint-rs/target/debug/proselint check --demo
  Time (mean ± σ):     128.8 ms ±  19.7 ms    [User: 725.1 ms, System: 68.3 ms]
  Range (min … max):   103.3 ms … 162.5 ms    10 runs

Benchmark 2: ./proselint-rs/target/release/proselint check --demo
  Time (mean ± σ):      38.5 ms ±   3.2 ms    [User: 76.1 ms, System: 33.7 ms]
  Range (min … max):    33.2 ms …  43.4 ms    10 runs

Benchmark 3: pdm run proselint check --demo
  Time (mean ± σ):     838.4 ms ±  17.4 ms    [User: 1006.9 ms, System: 185.3 ms]
  Range (min … max):   813.6 ms … 865.9 ms    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ./proselint-rs/target/release/proselint check --demo ran
    3.34 ± 0.58 times faster than ./proselint-rs/target/debug/proselint check --demo
   21.76 ± 1.89 times faster than pdm run proselint check --demo

All check specifications are registered at compile-time. Things that remain to be done include message templating, output formats, deciding whether I'd like CheckType to remain as an enum or become a trait to emulate the flexibility of Python's unions, implementation of a new ExistenceFancy check type, and general housekeeping.

This affects all check types except for ExistenceSimple, which is
currently the only check type requiring backtracking or lookarounds.
Applications of the unicode flag for checks were previously consistent,
and having it default to true makes no difference for any check within
proselint. Further, this behaviour is consistent with the default for
the Rust regex crate.

The only check that previously had dotall enabled did not use dot
matching, so it was redundant.

Removing these unnecessary features results in a cleaner codebase.
This data is held statically in check specifications immutably.
Therefore, there is no reason to convert it to a heap buffer when it is
passed around. Equally, this removes the cost of cloning the heap
buffers when the conversion from a CheckResult to a LintResult is made.
The previous emulation was inefficient and inconsistent. Now, the
signature of all check functions is the same, and a layer of indirection
is removed. In the future, this may allow for combining several
specifications into one check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants