Setup

There are two ways to set up the environment. One simply installs all dependencies to get you up and running. The other uses the transformers fork necessary for the project in editable mode. This setup is recommended if you want to make changes to the transformers library while working on any implementation details.

Build the environment

Clone this repo

cd semantic_decoding
# build env
conda env create -f env/environment.yml
# conda env create -f env/environment-gpu.yml # use instead for gpu support

conda activate sem

Build the environment with the HF fork in editable mode

Clone this repo
Clone the hf fork to a sibling directory

# the repos should be in the same directory for the yml install to work; otherwise adapt path in yml file
ls
# my_folder/
#    semantic_decoding/     # this repo
#    transformers/          # the hf fork

comment out the remote source of transformers in the environment*.yml file and point to the the local directory instead

name: sem
channels:
  - ...
dependencies:
  - ...
  - pip
  - pip:
-      - git+https://github.com/philheller/transformers.git
+     # - git+https://github.com/philheller/transformers.git
-     # - -e ../transformers
+      - -e ../transformers

Install all dependencies (currently only environment.yml & environment-gpu.yml are up to date)

# from the root of this repo
conda env create -f env/environment.yml
# conda env create -f env/environment-gpu.yml # use instead for gpu support
# make sure the pip dependencies in the yml file have properly been installed

For usage, activate the enviroment and see Usage.

conda activate sem

Usage

The usage of semantic decoding is provided through the Generator class. Here is simple usage:

# generator
from semantic_decoding.generators.generator import Generator
# generation config for syntactic and semantic level
from transformers.generation.utils import GenerationConfig
from semantic_decoding.generators.semantic import SemanticGenerationConfig

# load the generator
generator = Generator(
  model_name,
  "en_core_web_sm",
  device,
  unique_key=args.aggregation_key
)

# generation configs
# syntactic
syntactic_generation_config: GenerationConfig = GenerationConfig(
    max_new_tokens=4,
    num_beams=200,
    num_return_sequences=200,
    access_token=access_token,
    # ...
)
# semantic
semantic_generation_config: SemanticGenerationConfig = SemanticGenerationConfig(
    num_beams=2,
    num_return_sequences=2,
    max_overall_tokens=1000,
    max_overall_generated_tokens=1000,
    nest_beam_search=True,
)

# generate
res = generator.generate(
    prompts=["Obama was born in"],
    syntactic_generation_config=syntactic_generation_config,
    semantic_generation_config=semantic_generation_config,
)

Model Choices

Syntactic models are the HF models.
Available models for the semantic generation can be viewed and implemented in semantic_model.py. Currently, some ner models and some spacy models are supported. Adding new ones requires implementing the SemanticDataModel class and registration in the SemanticModelFactory. The already implemented models serve as examples.

Generation Modes

The Generator.generate function is structurally kept analogous to the transformers library. Currently, these decoding modes are supported:

Greedy decoding
Beam Search decoding
Nested Beam Search decoding

The appropriate mode is selected based on the semantic and syntactic generation config. For more details, see the SemanticGenerationConfig.

General structure

Central to the generation is the Generator class which orchestrates the generation. Helper functions are mostly for syntactic and semantic generation structure code further:

the SyntacticGenerator
the SemanticGenerator

The SyntacticGenerator contains the functions associated with manipulation of syntactic hypothesis. The SemanticGenerator contains the functions associated with manipulation of semantic hypothesis.

Both classes also contain the models and the tokenizers used:

# to decode syntactic tokens
syntactic_generator.tokenizer.batch_decode(syntactic_output)

# to decode semantic tokens
semantic_generator.tokenizer.batch_decode(semantic_output)

Known issues

Batching and scores Scores are not resolving to be the same based on batching and masking. This can change the results of beam search (more on that in tests regarding differences in scores). To make a result reproducible (and thus easily accessible), batching should be avoided. Not batched computations can be reproduced.

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.vscode		.vscode
env		env
experiments		experiments
generators		generators
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Build the environment

Build the environment with the HF fork in editable mode

Usage

Model Choices

Generation Modes

General structure

Known issues

About

Releases

Packages

Languages

NIRVANA55051849k/semantic_decoding

Folders and files

Latest commit

History

Repository files navigation

Setup

Build the environment

Build the environment with the HF fork in editable mode

Usage

Model Choices

Generation Modes

General structure

Known issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages