Skip to content

Latest commit

 

History

History
54 lines (44 loc) · 6.4 KB

models.md

File metadata and controls

54 lines (44 loc) · 6.4 KB

AI Models

Text Models

  1. GPT-3 and GPT-4 (OpenAI): Designed for natural language processing tasks such as text generation, translation, summarization, and more.
  2. Gemini (Google): Focuses on dialogue applications, generating conversational responses.successor to LaMDA and PaLM 2, focusing on multimodal capabilities
  3. Claude (Anthropic): Known for safety and alignment, designed to be helpful and harmless in generating text.
  4. PaLM (Google): Used in various applications, including Google's Gemini, known for its extensive language understanding and generation capabilities.
  5. LLaMA (Meta): Designed for research purposes, known for their efficiency and performance in various natural language processing tasks.
  6. BLOOM (Hugging Face): An open-source model known for its multilingual capabilities.
  7. NeMo (Nvidia): Used for various applications, including natural language understanding and generation.
  8. XLM-RoBERTa (Hugging Face): Designed for cross-lingual tasks, making it highly effective in multilingual contexts.
  9. Mistral (Mistral AI): A large language model known for its impressive performance on various benchmarks and its ability to generate high-quality text.

Image Models

  1. DALL-E (OpenAI): Generates images from textual descriptions, creating unique and creative visuals.
  2. Firefly (Adobe): Integrates with Adobe's creative tools to generate images and text effects from prompts.
  3. Midjourney (Midjourney, Inc.): Specializes in creating detailed and aesthetically pleasing images from text prompts.
  4. SDXL-Lightning-4step (ByteDance): Generates high-quality 1024x1024 pixel images from text prompts in just 4 inference steps. Offers a balance between speed and image quality, making it suitable for real-time applications. Capable of producing a wide variety of images, from realistic scenes to creative compositions.
  5. SDXL-Lightning-2step (ByteDance): Creates high-quality images in only 2 inference steps, prioritizing even faster generation times. Suitable for applications requiring near-instantaneous image creation while maintaining good quality.
  6. SDXL-Lightning-8step (ByteDance): Provides the highest image quality among the SDXL-Lightning variants, using 8 inference steps. Ideal for applications where image fidelity is crucial and a slightly longer generation time is acceptable.

Audio Processing Models

  1. Demucs (Facebook Research): Performs music source separation, isolating individual instruments (vocals, drums, bass, other) from mixed audio tracks. Uses a hybrid architecture combining waveform and spectrogram processing for high-quality separation.
  2. Demucs v4 (Facebook Research): Latest version of Demucs, offering improved separation quality and faster processing. Introduces new model variants optimized for specific separation tasks and computational requirements.
  3. MusicGen (Meta): Generates high-quality music samples from text descriptions or audio prompts. Capable of creating various music styles, handling unconditional generation, and producing both mono and stereo outputs.

Video Models

  1. OpenPose (Carnegie Mellon University): A real-time model for detecting human poses in videos, widely used for motion analysis, tracking, and generating animations based on detected actions.
  2. vid2vid (NVIDIA): A generative model that transforms input video frames into high-quality output video sequences, capable of generating videos from semantic maps or sketches.
  3. Deep Video Portraits (University of California, Berkeley): Enables realistic manipulation of video content by re-animating faces in static images, allowing for alterations in facial expressions and head poses.
  4. MoCoGAN (University of California, Berkeley): A generative adversarial network that creates coherent video sequences by modeling temporal dynamics alongside visual appearance.

Text-to-Speech Models

  1. WaveNet (Google): An AI model that generates human-like speech from text, used in text-to-speech applications.
  2. Tacotron and Tacotron 2: Sequence-to-sequence models designed for text-to-speech synthesis that convert text input into mel-spectrograms, which are then turned into audio waveforms.
  3. FastSpeech and FastSpeech 2: Non-autoregressive models for text-to-speech synthesis that generate mel-spectrograms from text input, allowing for faster and more robust audio generation.
  4. Deep Voice (Baidu): A family of neural network models for text-to-speech synthesis that replicates human speech synthesis techniques using modular architecture.
  5. Transformer TTS: Based on the Transformer architecture, this model captures long-range dependencies in text for improved naturalness in speech synthesis.
  6. AIVA (Artificial Intelligence Virtual Artist): While primarily known for music composition, AIVA also incorporates voice synthesis capabilities in its offerings.
  7. F5-TTS (Open Source): Generates highly natural and expressive speech from text input. Capable of zero-shot voice cloning, mimicking voices after hearing brief samples. Supports multilingual speech synthesis and code-switching.

Multimodal Models

  1. ChatGPT-4 (OpenAI): Processes and responds to text, images, audio, and video, making it highly versatile.
  2. Gato (DeepMind): A generalist agent capable of performing multiple tasks across different modalities, including text, images, and control tasks.
  3. Gemini (Google): A powerful multimodal AI model that can understand and generate text, images, videos, and audio, excelling in complex tasks across various domains, including math, physics, and code generation.

Enterprise-Focused Foundation Models

  1. IBM Granite (IBM): Processes and generates text for various business tasks, including summarization, question-answering, and content creation. Specializes in enterprise-relevant domains such as finance, legal, and technical documentation.
  2. Granite-20b-multilingual (IBM): Handles tasks in multiple languages including English, French, German, Portuguese, and Spanish, enabling cross-lingual business communications and content generation.
  3. Granite-34b-code-instruct (IBM): Focuses on code-related tasks, capable of generating, explaining, and translating code from natural language prompts. Supports multiple programming languages and software development workflows.

Note: it is only some parts for overview. No guarantee that all models now exist or are included (it is too much).