- GPT-3 and GPT-4 (OpenAI): Designed for natural language processing tasks such as text generation, translation, summarization, and more.
- Gemini (Google): Focuses on dialogue applications, generating conversational responses.successor to LaMDA and PaLM 2, focusing on multimodal capabilities
- Claude (Anthropic): Known for safety and alignment, designed to be helpful and harmless in generating text.
- PaLM (Google): Used in various applications, including Google's Gemini, known for its extensive language understanding and generation capabilities.
- LLaMA (Meta): Designed for research purposes, known for their efficiency and performance in various natural language processing tasks.
- BLOOM (Hugging Face): An open-source model known for its multilingual capabilities.
- NeMo (Nvidia): Used for various applications, including natural language understanding and generation.
- XLM-RoBERTa (Hugging Face): Designed for cross-lingual tasks, making it highly effective in multilingual contexts.
- Mistral (Mistral AI): A large language model known for its impressive performance on various benchmarks and its ability to generate high-quality text.
- DALL-E (OpenAI): Generates images from textual descriptions, creating unique and creative visuals.
- Firefly (Adobe): Integrates with Adobe's creative tools to generate images and text effects from prompts.
- Midjourney (Midjourney, Inc.): Specializes in creating detailed and aesthetically pleasing images from text prompts.
- SDXL-Lightning-4step (ByteDance): Generates high-quality 1024x1024 pixel images from text prompts in just 4 inference steps. Offers a balance between speed and image quality, making it suitable for real-time applications. Capable of producing a wide variety of images, from realistic scenes to creative compositions.
- SDXL-Lightning-2step (ByteDance): Creates high-quality images in only 2 inference steps, prioritizing even faster generation times. Suitable for applications requiring near-instantaneous image creation while maintaining good quality.
- SDXL-Lightning-8step (ByteDance): Provides the highest image quality among the SDXL-Lightning variants, using 8 inference steps. Ideal for applications where image fidelity is crucial and a slightly longer generation time is acceptable.
- Demucs (Facebook Research): Performs music source separation, isolating individual instruments (vocals, drums, bass, other) from mixed audio tracks. Uses a hybrid architecture combining waveform and spectrogram processing for high-quality separation.
- Demucs v4 (Facebook Research): Latest version of Demucs, offering improved separation quality and faster processing. Introduces new model variants optimized for specific separation tasks and computational requirements.
- MusicGen (Meta): Generates high-quality music samples from text descriptions or audio prompts. Capable of creating various music styles, handling unconditional generation, and producing both mono and stereo outputs.
- OpenPose (Carnegie Mellon University): A real-time model for detecting human poses in videos, widely used for motion analysis, tracking, and generating animations based on detected actions.
- vid2vid (NVIDIA): A generative model that transforms input video frames into high-quality output video sequences, capable of generating videos from semantic maps or sketches.
- Deep Video Portraits (University of California, Berkeley): Enables realistic manipulation of video content by re-animating faces in static images, allowing for alterations in facial expressions and head poses.
- MoCoGAN (University of California, Berkeley): A generative adversarial network that creates coherent video sequences by modeling temporal dynamics alongside visual appearance.
- WaveNet (Google): An AI model that generates human-like speech from text, used in text-to-speech applications.
- Tacotron and Tacotron 2: Sequence-to-sequence models designed for text-to-speech synthesis that convert text input into mel-spectrograms, which are then turned into audio waveforms.
- FastSpeech and FastSpeech 2: Non-autoregressive models for text-to-speech synthesis that generate mel-spectrograms from text input, allowing for faster and more robust audio generation.
- Deep Voice (Baidu): A family of neural network models for text-to-speech synthesis that replicates human speech synthesis techniques using modular architecture.
- Transformer TTS: Based on the Transformer architecture, this model captures long-range dependencies in text for improved naturalness in speech synthesis.
- AIVA (Artificial Intelligence Virtual Artist): While primarily known for music composition, AIVA also incorporates voice synthesis capabilities in its offerings.
- F5-TTS (Open Source): Generates highly natural and expressive speech from text input. Capable of zero-shot voice cloning, mimicking voices after hearing brief samples. Supports multilingual speech synthesis and code-switching.
- ChatGPT-4 (OpenAI): Processes and responds to text, images, audio, and video, making it highly versatile.
- Gato (DeepMind): A generalist agent capable of performing multiple tasks across different modalities, including text, images, and control tasks.
- Gemini (Google): A powerful multimodal AI model that can understand and generate text, images, videos, and audio, excelling in complex tasks across various domains, including math, physics, and code generation.
- IBM Granite (IBM): Processes and generates text for various business tasks, including summarization, question-answering, and content creation. Specializes in enterprise-relevant domains such as finance, legal, and technical documentation.
- Granite-20b-multilingual (IBM): Handles tasks in multiple languages including English, French, German, Portuguese, and Spanish, enabling cross-lingual business communications and content generation.
- Granite-34b-code-instruct (IBM): Focuses on code-related tasks, capable of generating, explaining, and translating code from natural language prompts. Supports multiple programming languages and software development workflows.
Note: it is only some parts for overview. No guarantee that all models now exist or are included (it is too much).