This repository contains sample notebooks to demonstrate how to evaluate an LLM-augmented system. It provides tools and methods for local evaluation.
- Ensure you've enabled Claude Sonnet and Claude Haiku in the Bedrock Console
- Ensure you have adequate permissions to call Bedrock from the Python SDK (Boto3)
These notebooks were tested with Python 3.12. If you're running locally, ensure you're using 3.12. Also ensure that you have the AWS CLI setup with the credentials you want set to the default profile. These credentials need access to Amazon Bedrock Models
LLM-System-Validation/
├── data/ # RAG context and validation datasets
├── example-notebooks/ # Notebooks for evaluating various components
|__ script/ # Various scripts for setting up environment.
|__ .github/ # Example github actions
data/
: Contains the datasets used for Retrieval-Augmented Generation (RAG) context and validation.example-notebooks/
: Jupyter notebooks demonstrating the evaluation of:- Embeddings and Chunking Strategy
- Reranking with large chunk sizes
- LLM-As-A-Judge Prompt Engineering
- RAG Prompt Engineering
- E2E RAG Testing
-
Clone the repository:
git clone git@github.com:aws-samples/genai-system-evaluation.git cd genai-system-evaluation
-
Set up a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
-
Download opensearch docs for RAG context.
$ cd data && mkdir opensearch-docs && cd opensearch-docs $ git clone https://github.com/opensearch-project/documentation-website.git
-
Go to notebook examples & start jupyter notebooks!
$ cd ../../example-notebooks $ jupyter notebook
-
Start at notebook 1 and work your way through them!
- Explore the example notebooks in the
example-notebooks/
directory to understand different evaluation techniques.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.