Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
-
Updated
Dec 2, 2023 - Jupyter Notebook
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
[CoRL'23] Adversarial Training for Safe End-to-End Driving
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
Materials for the course Principles of AI: LLMs at UPenn (Stat 9911, Spring 2025). LLM architectures, training paradigms (pre- and post-training, alignment), test-time computation, reasoning, safety and robustness (jailbreaking, oversight, uncertainty), representations, interpretability (circuits), etc.
Can Large Language Models Solve Security Challenges? We test LLMs' ability to interact and break out of shell environments using the OverTheWire wargames environment, showing the models' surprising ability to do action-oriented cyberexploits in shell environments
The go-to API for detecting and preventing prompt injection attacks.
A benchmark for evaluating hallucinations in large visual language models
Common repository for our readings and discussions
Safe Option Critic: Learning Safe Options in the A2OC Architecture
[NeurIPS 2024] SACPO (Stepwise Alignment for Constrained Policy Optimization)
a Python library for peer-to-peer communication over the Yggdrasil network
Explore techniques to use small models as jailbreaking judges
An organized repository of essential machine learning resources, including tutorials, papers, books, and tools, each with corresponding links for easy access.
Finetuning of Mistral Nemo 13B on the WildJailbreak dataset to produce a red-teaming model
This repository is dedicated to enhancing my skills in AI, specifically focusing on PyTorch and various technical aspects of artificial intelligence. It is designed to document my progress as I work through the comprehensive course provided by ARENA.
This project facilitates structured debates between two Language Model (LLM) instances on a given topic. It organises the debate into distinct phases: opening arguments, multiple rounds of rebuttals, and concluding statements.
This repository contains the code, data, and analysis used in the study "Religious-Based Manipulation and AI Alignment Risks," which explores the risks of large language models (LLMs) generating religious content that can encourage discriminatory or violent behavior.
Add a description, image, and links to the aisafety topic page so that developers can more easily learn about it.
To associate your repository with the aisafety topic, visit your repo's landing page and select "manage topics."