CVPR2024|AIGC(图像生成,视频生成,3D生成等)相关论文汇总(附论文链接/开源代码/解析)【持续更新】
整理汇总下今年CVPR AIGC相关的论文和代码
·
CVPR2024|AIGC相关论文汇总(如果觉得有帮助,欢迎点赞和收藏)
- Awesome-CVPR2024-AIGC
- 1.图像生成(Image Generation/Image Synthesis)
- Accelerating Diffusion Sampling with Optimized Time Steps
- Adversarial Text to Continuous Image Generation
- Amodal Completion via Progressive Mixed Context Diffusion
- Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
- Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion
- Attention Calibration for Disentangled Text-to-Image Personalization
- Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
- CapHuman: Capture Your Moments in Parallel Universes
- CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization
- Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
- Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
- CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
- Condition-Aware Neural Network for Controlled Image Generation
- CosmicMan: A Text-to-Image Foundation Model for Humans
- Countering Personalized Text-to-Image Generation with Influence Watermarks
- Cross Initialization for Face Personalization of Text-to-Image Models
- Customization Assistant for Text-to-image Generation
- DeepCache: Accelerating Diffusion Models for Free
- DemoFusion: Democratising High-Resolution Image Generation With No $
- Desigen: A Pipeline for Controllable Design Template Generation
- DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
- Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
- DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
- Diversity-aware Channel Pruning for StyleGAN Compression
- Discriminative Probing and Tuning for Text-to-Image Generation
- Don’t drop your samples! Coherence-aware training benefits Conditional diffusion
- Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
- DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
- Dynamic Prompt Optimizing for Text-to-Image Generation
- ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
- Efficient Dataset Distillation via Minimax Diffusion
- ElasticDiffusion: Training-free Arbitrary Size Image Generation
- EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
- Enabling Multi-Concept Fusion in Text-to-Image Models
- Exact Fusion via Feature Distribution Matching for Few-shot Image Generation
- FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
- Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
- FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition
- Generalizable Tumor Synthesis
- Generating Daylight-driven Architectural Design via Diffusion Models
- Generative Unlearning for Any Identity
- HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
- High-fidelity Person-centric Subject-to-Image Synthesis
- InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
- InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- InstanceDiffusion: Instance-level Control for Image Generation
- Instruct-Imagen: Image Generation with Multi-modal Instruction
- Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
- InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model
- Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
- Inversion-Free Image Editing with Natural Language
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
- LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion
- Learned representation-guided diffusion models for large-image generation
- Learning Continuous 3D Words for Text-to-Image Generation
- Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
- Learning Multi-dimensional Human Preference for Text-to-Image Generation
- LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
- MACE: Mass Concept Erasure in Diffusion Models
- MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
- MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
- MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
- MindBridge: A Cross-Subject Brain Decoding Framework
- MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
- On the Scalability of Diffusion-based Text-to-Image Generation
- OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
- Personalized Residuals for Concept-Driven Text-to-Image Generation
- Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
- PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
- Plug-and-Play Diffusion Distillation
- Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models
- Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
- Readout Guidance: Learning Control from Diffusion Features
- Relation Rectification in Diffusion Model
- Residual Denoising Diffusion Models
- Rethinking FID: Towards a Better Evaluation Metric for Image Generation
- Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
- Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
- Rich Human Feedback for Text-to-Image Generation
- SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
- Self-correcting LLM-controlled Diffusion Models
- Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
- Shadow Generation for Composite Image Using Diffusion Model
- Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
- SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
- StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
- Structure-Guided Adversarial Training of Diffusion Models
- Style Aligned Image Generation via Shared Attention
- SVGDreamer: Text Guided SVG Generation with Diffusion Model
- SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
- Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting
- Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
- Taming Stable Diffusion for Text to 360∘ Panorama Image Generation
- TextCraftor: Your Text Encoder Can be Image Quality Controller
- Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
- TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
- TokenCompose: Grounding Diffusion with Token-level Supervision
- Towards Accurate Post-training Quantization for Diffusion Models
- Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
- Towards Memorization-Free Diffusion Models
- Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
- UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
- UniGS: Unified Representation for Image Generation and Segmentation
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
- ViewDiff: 3D-Consistent Image Generation with Text-To-Image Models
- When StyleGAN Meets Stable Diffusion: a 𝒲+ Adapter for Personalized Image Generation
- X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
- 2.图像编辑(Image Editing)
- An Edit Friendly DDPM Noise Space: Inversion and Manipulations
- Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing
- Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth
- Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
- DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
- Deformable One-shot Face Stylization via DINO Semantic Guidance
- DemoCaricature: Democratising Caricature Generation with a Rough Sketch
- DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
- DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
- Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
- DiffusionLight: Light Probes for Free by Painting a Chrome Ball
- Diffusion Models Without Attention
- Doubly Abductive Counterfactual Inference for Text-based Image Editing
- Edit One for All: Interactive Batch Image Editing
- Face2Diffusion for Fast and Editable Face Personalization
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
- FreeDrag: Feature Dragging for Reliable Point-based Image Editing
- Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image
- Image Sculpting: Precise Object Editing with 3D Geometry Control
- Inversion-Free Image Editing with Natural Language
- PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models
- Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing
- Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network
- PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
- RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
- SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
- Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
- SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting
- Text-Driven Image Editing via Learnable Regions
- Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
- TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing
- UniHuman: A Unified Model For Editing Human Images in the Wild
- ZONE: Zero-Shot Instruction-Guided Local Editing
- 3.视频生成(Video Generation/Video Synthesis)
- 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
- A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
- DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation
- DisCo: Disentangled Control for Realistic Human Dance Generation
- FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
- Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
- Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models
- GenTron: Diffusion Transformers for Image and Video Generation
- Grid Diffusion Models for Text-to-Video Generation
- Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation
- Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
- LAMP: Learn A Motion Pattern for Few-Shot Video Generation
- Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis
- Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance Primitives
- MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
- Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework
- Make Your Dream A Vlog
- Make Pixels Dance: High-Dynamic Video Generation
- MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
- Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
- PEEKABOO: Interactive Video Generation via Masked-Diffusion
- Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
- SimDA: Simple Diffusion Adapter for Efficient Video Generation
- StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
- SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
- TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- VideoBooth: Diffusion-based Video Generation with Image Prompts
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- Video-P2P: Video Editing with Cross-attention Control
- 4.视频编辑(Video Editing)
- A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
- CAMEL: Causal Motion Enhancement tailored for lifting text-driven video editing
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
- CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
- FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
- VidToMe: Video Token Merging for Zero-Shot Video Editing
- VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
- 5.3D生成(3D Generation/3D Synthesis)
- 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
- Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling
- A Unified Approach for Text- and Image-guided 4D Scene Generation
- BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
- BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation
- CAD: Photorealistic 3D Generation via Adversarial Distillation
- CAGE: Controllable Articulation GEneration
- CityDreamer: Compositional Generative Model of Unbounded 3D Cities
- Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior
- ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis
- ControlRoom3D: Room Generation using Semantic Proxy Rooms
- DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
- DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
- DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis
- Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features
- Diffusion Time-step Curriculum for One Image to 3D Generation
- DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
- DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
- DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
- Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
- EscherNet: A Generative Model for Scalable View Synthesis
- GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
- GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
- Gaussian Shell Maps for Efficient 3D Human Generation
- HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D
- HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
- Holodeck: Language Guided Generation of 3D Embodied AI Environments
- HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
- Interactive3D: Create What You Want by Interactive 3D Generation
- InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusio
- Intrinsic Image Diffusion for Single-view Material Estimation
- Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text
- MoMask: Generative Masked Modeling of 3D Human Motions
- Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration
- EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
- OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
- One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion
- Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering
- PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
- PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics
- RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D.
- SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors
- SceneWiz3D: Towards Text-guided 3D Scene Composition
- SemCity: Semantic Scene Generation with Triplane Diffusion
- Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior
- SIGNeRF: Scene Integrated Generation for Neural Radiance Fields
- Single Mesh Diffusion Models with Field Latents for Texture Generation
- SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion
- SPAD: Spatially Aware Multiview Diffusers
- Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
- Text-to-3D using Gaussian Splatting
- The More You See in 2D, the More You Perceive in 3D
- Tiger: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process
- Towards Realistic Scene Generation with LiDAR Diffusion Models
- UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion
- ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
- 6.3D编辑(3D Editing)
- 7.多模态大语言模型(Multi-Modal Large Language Models)
- Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
- Anchor-based Robust Finetuning of Vision-Language Models
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
- Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
- Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
- Compositional Chain-of-Thought Prompting for Large Multimodal Models
- Describing Differences in Image Sets with Natural Language
- Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
- Efficient Stitchable Task Adaptation
- Efficient Test-Time Adaptation of Vision-Language Models
- Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
- FairCLIP: Harnessing Fairness in Vision-Language Learning
- FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
- FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models
- Generative Multimodal Models are In-Context Learners
- GLaMM: Pixel Grounding Large Multimodal Model
- GPT4Point: A Unified Framework for Point-Language Understanding and Generation
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
- LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
- LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
- Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
- MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
- Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
- OneLLM: One Framework to Align All Modalities with Language
- One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
- OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
- Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
- PixelLM: Pixel Reasoning with Large Multimodal Model
- PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization
- Prompt Highlighter: Interactive Control for Multi-Modal LLMs
- PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
- Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
- SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
- SEED-Bench: Benchmarking Multimodal Large Language Models
- SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
- The Manga Whisperer: Automatically Generating Transcriptions for Comics
- UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- VideoChat: Chat-Centric Video Understanding
- ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
- ViTamin: Designing Scalable Vision Models in the Vision-language Era
- ViT-Lens: Towards Omni-modal Representations
- 8.其他任务(Others)
- AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
- Diff-BGM: A Diffusion Model for Video Background Music Generation
- EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
- On the Content Bias in Fréchet Video Distance
- TexTile: A Differentiable Metric for Texture Tileability
- 参考
- 相关整理
Awesome-CVPR2024-AIGC
A Collection of Papers and Codes for CVPR2024 AIGC
整理汇总下今年CVPR AIGC相关的论文和代码,具体如下。
欢迎star,fork和PR~
优先在Github更新:Awesome-CVPR2024-AIGC,欢迎star~
知乎:https://zhuanlan.zhihu.com/p/684325134
参考或转载请注明出处
CVPR2024官网:https://cvpr.thecvf.com/Conferences/2024
CVPR接收论文列表:https://cvpr.thecvf.com/Conferences/2024/AcceptedPapers
CVPR完整论文库:https://openaccess.thecvf.com/CVPR2024
开会时间:2024年6月17日-6月21日
论文接收公布时间:2024年2月27日
【Contents】
- 1.图像生成(Image Generation/Image Synthesis)
- 2.图像编辑(Image Editing)
- 3.视频生成(Video Generation/Image Synthesis)
- 4.视频编辑(Video Editing)
- 5.3D生成(3D Generation/3D Synthesis)
- 6.3D编辑(3D Editing)
- 7.多模态大语言模型(Multi-Modal Large Language Model)
- 8.其他多任务(Others)
1.图像生成(Image Generation/Image Synthesis)
Accelerating Diffusion Sampling with Optimized Time Steps
- Paper: https://arxiv.org/abs/2402.17376
- Code: https://github.com/scxue/DM-NonUniform
Adversarial Text to Continuous Image Generation
- Paper: https://openreview.net/forum?id=9X3UZJSGIg9
- Code:
Amodal Completion via Progressive Mixed Context Diffusion
- Paper: https://arxiv.org/abs/2312.15540
- Code: https://github.com/k8xu/amodal
Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
- Paper: https://arxiv.org/abs/2403.10255
- Code: https://github.com/zhenshij/arbitrary-scale-diffusion
Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion
- Paper: https://arxiv.org/abs/2312.12471
- Code: https://github.com/zkawfanx/Atlantis
Attention Calibration for Disentangled Text-to-Image Personalization
- Paper: https://arxiv.org/abs/2403.18551
- Code: https://github.com/Monalissaa/DisenDiff
Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
- Paper: https://arxiv.org/abs/2405.05252
- Code:
CapHuman: Capture Your Moments in Parallel Universes
- Paper: https://arxiv.org/abs/2402.18078
- Code: https://github.com/VamosC/CapHuman
CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization
- Paper: https://arxiv.org/abs/2404.00521
- Code:
Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
- Paper: https://arxiv.org/abs/2311.15773
- Code:
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
- Paper: https://arxiv.org/abs/2402.00627
- Code: https://github.com/YanzuoLu/CFLD
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
- Paper: https://arxiv.org/abs/2310.01407
- Code: https://github.com/fast-codi/CoDi
Condition-Aware Neural Network for Controlled Image Generation
- Paper: https://arxiv.org/abs/2404.01143v1
- Code:
CosmicMan: A Text-to-Image Foundation Model for Humans
- Paper: https://arxiv.org/abs/2404.01294
- Code: https://github.com/cosmicman-cvpr2024/CosmicMan
Countering Personalized Text-to-Image Generation with Influence Watermarks
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liu_Countering_Personalized_Text-to-Image_Generation_with_Influence_Watermarks_CVPR_2024_paper.html
- Code:
Cross Initialization for Face Personalization of Text-to-Image Models
- Paper: https://arxiv.org/abs/2312.15905
- Code: https://github.com/lyuPang/CrossInitialization
Customization Assistant for Text-to-image Generation
- Paper: https://arxiv.org/abs/2312.03045
- Code:
DeepCache: Accelerating Diffusion Models for Free
- Paper: https://arxiv.org/abs/2312.00858
- Code: https://github.com/horseee/DeepCache
DemoFusion: Democratising High-Resolution Image Generation With No $
- Paper: https://arxiv.org/abs/2311.16973
- Code: https://github.com/PRIS-CV/DemoFusion
Desigen: A Pipeline for Controllable Design Template Generation
- Paper: https://arxiv.org/abs/2403.09093
- Code: https://github.com/whaohan/desigen
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
- Paper: https://arxiv.org/abs/2404.01342
- Code: https://github.com/OpenGVLab/DiffAgent
Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
- Paper: https://arxiv.org/abs/2405.04356v1
- Code:
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
- Paper: https://arxiv.org/abs/2402.19481
- Code: https://github.com/mit-han-lab/distrifuser
Diversity-aware Channel Pruning for StyleGAN Compression
- Paper: https://arxiv.org/abs/2403.13548
- Code: https://github.com/jiwoogit/DCP-GAN
Discriminative Probing and Tuning for Text-to-Image Generation
- Paper: https://www.arxiv.org/abs/2403.04321
- Code: https://github.com/LgQu/DPT-T2I
Don’t drop your samples! Coherence-aware training benefits Conditional diffusion
- Paper: https://arxiv.org/abs/2405.20324
- Code: https://github.com/nicolas-dufour/CAD
Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
- Paper: https://arxiv.org/abs/2404.01050
- Code: https://github.com/haofengl/DragNoise
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
- Paper: https://arxiv.org/abs/2402.09812
- Code: https://github.com/KU-CVLAB/DreamMatcher
Dynamic Prompt Optimizing for Text-to-Image Generation
- Paper: https://arxiv.org/abs/2404.04095
- Code: https://github.com/Mowenyii/PAE
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
- Paper: https://arxiv.org/abs/2312.04655
- Code: https://github.com/eclipse-t2i/eclipse-inference
Efficient Dataset Distillation via Minimax Diffusion
- Paper: https://arxiv.org/abs/2311.15529
- Code: https://github.com/vimar-gu/MinimaxDiffusion
ElasticDiffusion: Training-free Arbitrary Size Image Generation
- Paper: https://arxiv.org/abs/2311.18822
- Code: https://github.com/MoayedHajiAli/ElasticDiffusion-official
EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
- Paper: https://arxiv.org/abs/2401.04608
- Code: https://github.com/JingyuanYY/EmoGen
Enabling Multi-Concept Fusion in Text-to-Image Models
- Paper: https://arxiv.org/abs/2404.03913v1
- Code:
Exact Fusion via Feature Distribution Matching for Few-shot Image Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zhou_Exact_Fusion_via_Feature_Distribution_Matching_for_Few-shot_Image_Generation_CVPR_2024_paper.html
- Code:
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
- Paper: https://arxiv.org/abs/2403.06775
- Code:
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
- Paper: https://arxiv.org/abs/2312.00094
- Code: https://github.com/zju-pi/diff-sampler
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- Paper: https://arxiv.org/abs/2312.07536
- Code: https://github.com/genforce/freecontrol
FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition
- Paper: https://arxiv.org/abs/2405.13870
- Code: https://github.com/aim-uofa/FreeCustom
Generalizable Tumor Synthesis
- Paper: https://www.cs.jhu.edu/~alanlab/Pubs24/chen2024towards.pdf
- Code: https://github.com/MrGiovanni/DiffTumor
Generating Daylight-driven Architectural Design via Diffusion Models
- Paper: https://arxiv.org/abs/2404.13353
- Code: https://github.com/unlimitedli/DDADesign
Generative Unlearning for Any Identity
- Paper: https://arxiv.org/abs/2405.09879
- Code: https://github.com/JJuOn/GUIDE
HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
- Paper: https://arxiv.org/abs/2403.01693
- Code: https://github.com/JJuOn/GUIDE
High-fidelity Person-centric Subject-to-Image Synthesis
- Paper: https://arxiv.org/abs/2311.10329
- Code: https://github.com/CodeGoat24/Face-diffuser?tab=readme-ov-file
InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
- Paper: https://arxiv.org/abs/2404.04650
- Code: https://github.com/xiefan-guo/initno
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- Paper: https://arxiv.org/abs/2304.03411
- Code:
InstanceDiffusion: Instance-level Control for Image Generation
- Paper: https://arxiv.org/abs/2402.03290
- Code: https://github.com/frank-xwang/InstanceDiffusion
Instruct-Imagen: Image Generation with Multi-modal Instruction
- Paper: https://arxiv.org/abs/2401.01952
- Code:
Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
- Paper: https://arxiv.org/abs/2306.00973
- Code: https://github.com/haoningwu3639/StoryGen
InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model
- Paper: https://arxiv.org/abs/2312.05849
- Code: https://github.com/jiuntian/interactdiffusion
Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
- Paper: https://arxiv.org/abs/2308.15692
- Code:
Inversion-Free Image Editing with Natural Language
- Paper: https://arxiv.org/abs/2312.04965
- Code: https://github.com/sled-group/InfEdit
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zeng_JeDi_Joint-Image_Diffusion_Models_for_Finetuning-Free_Personalized_Text-to-Image_Generation_CVPR_2024_paper.html
- Code:
LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion
- Paper: https://arxiv.org/abs/2404.00292
- Code: https://github.com/PanchengZhao/LAKE-RED
Learned representation-guided diffusion models for large-image generation
- Paper: https://arxiv.org/abs/2312.07330
- Code: https://github.com/cvlab-stonybrook/Large-Image-Diffusion
Learning Continuous 3D Words for Text-to-Image Generation
- Paper: https://arxiv.org/abs/2402.08654
- Code: https://github.com/ttchengab/continuous_3d_words_code/
Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
- Paper: https://arxiv.org/abs/2311.15841
- Code:
Learning Multi-dimensional Human Preference for Text-to-Image Generation
- Paper: https://arxiv.org/abs/2311.15841
- Code:
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
- Paper: https://arxiv.org/abs/2305.11577
- Code: https://github.com/ewrfcas/LeftRefill
MACE: Mass Concept Erasure in Diffusion Models
- Paper: https://arxiv.org/abs/2402.05408
- Code: https://github.com/Shilin-LU/MACE
MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
- Paper: https://arxiv.org/abs/2308.10997
- Code:
MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
- Paper: https://arxiv.org/abs/2403.04290
- Code:
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
- Paper: https://arxiv.org/abs/2402.05408
- Code: https://github.com/limuloo/MIGC
MindBridge: A Cross-Subject Brain Decoding Framework
- Paper: https://arxiv.org/abs/2404.07850
- Code: https://github.com/littlepure2333/MindBridge
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
- Paper: https://arxiv.org/abs/2404.02790
- Code: https://huggingface.co/datasets/mulan-dataset/v1.0
On the Scalability of Diffusion-based Text-to-Image Generation
- Paper: https://arxiv.org/abs/2404.02883
- Code:
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
- Paper: https://arxiv.org/abs/2404.07990
- Code: https://github.com/Picsart-AI-Research/OpenBias
Personalized Residuals for Concept-Driven Text-to-Image Generation
- Paper: https://arxiv.org/abs/2405.12978
- Code:
Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models
- Paper: https://arxiv.org/abs/2404.15081
- Code:
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
- Paper: https://arxiv.org/abs/2312.04461
- Code: https://github.com/TencentARC/PhotoMaker
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
- Paper: https://arxiv.org/abs/2403.01852
- Code: https://github.com/cszy98/PLACE
Plug-and-Play Diffusion Distillation
- Paper: https://arxiv.org/abs/2406.01954
- Code:
Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
- Paper: https://arxiv.org/abs/2305.16223
- Code: https://github.com/SHI-Labs/Prompt-Free-Diffusion
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
- Paper: https://arxiv.org/abs/2311.17002
- Code: https://github.com/ali-vilab/Ranni
Readout Guidance: Learning Control from Diffusion Features
- Paper: https://arxiv.org/abs/2312.02150
- Code: https://github.com/google-research/readout_guidance
Relation Rectification in Diffusion Model
- Paper: https://arxiv.org/abs/2403.20249
- Code: https://github.com/WUyinwei-hah/RRNet
Residual Denoising Diffusion Models
- Paper: https://arxiv.org/abs/2308.13712
- Code: https://github.com/nachifur/RDDM
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
- Paper: https://arxiv.org/abs/2401.09603
- Code: https://github.com/google-research/google-research/tree/master/cmmd
Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
- Paper: https://arxiv.org/abs/2404.05384
- Code: https://github.com/SmilesDZgk/S-CFG
Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
- Paper: https://arxiv.org/abs/2311.13602
- Code: https://github.com/CyberAgentAILab/RALF
Rich Human Feedback for Text-to-Image Generation
- Paper: https://arxiv.org/abs/2312.10240
- Code:
SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
- Paper: https://arxiv.org/abs/2401.08053
- Code:
Self-correcting LLM-controlled Diffusion Models
- Paper: https://arxiv.org/abs/2311.16090
- Code: https://github.com/tsunghan-wu/SLD
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
- Paper: https://arxiv.org/abs/2311.17216
- Code: https://github.com/hangligit/InterpretDiffusion
Shadow Generation for Composite Image Using Diffusion Model
- Paper: https://arxiv.org/abs/2308.09972
- Code: https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
- Paper: https://arxiv.org/abs/2312.04410
- Code: https://github.com/SHI-Labs/Smooth-Diffusion
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
- Paper: https://arxiv.org/abs/2312.16272
- Code: https://github.com/Xiaojiu-z/SSR_Encoder
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
- Paper: https://arxiv.org/abs/2312.01725
- Code: https://github.com/rlawjdghek/StableVITON
Structure-Guided Adversarial Training of Diffusion Models
- Paper: https://arxiv.org/abs/2402.17563
- Code:
Style Aligned Image Generation via Shared Attention
- Paper: https://arxiv.org/abs/2312.02133
- Code: https://github.com/google/style-aligned/
SVGDreamer: Text Guided SVG Generation with Diffusion Model
- Paper: https://arxiv.org/abs/2312.16476
- Code: https://github.com/ximinng/SVGDreamer
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
- Paper: https://arxiv.org/abs/2312.05239
- Code: https://github.com/VinAIResearch/SwiftBrush
Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting
- Paper: https://arxiv.org/abs/2310.08129
- Code: https://github.com/zzjchen/Tailored-Visions
Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
- Paper: https://arxiv.org/abs/2403.08381
- Code: https://github.com/PangzeCheung/SingDiffusion
Taming Stable Diffusion for Text to 360∘ Panorama Image Generation
- Paper: https://arxiv.org/abs/2404.07949
- Code: https://github.com/chengzhag/PanFusion
TextCraftor: Your Text Encoder Can be Image Quality Controller
- Paper: https://arxiv.org/abs/2403.18978
- Code:
Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
- Paper: https://arxiv.org/abs/2403.06247
- Code: https://github.com/MingyuLee82/TGI_AD_v1
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
- Paper: https://arxiv.org/abs/2311.16503
- Code: https://github.com/ModelTC/TFMQ-DM
TokenCompose: Grounding Diffusion with Token-level Supervision
- Paper: https://arxiv.org/abs/2312.03626
- Code: https://github.com/mlpc-ucsd/TokenCompose
Towards Accurate Post-training Quantization for Diffusion Models
- Paper: https://arxiv.org/abs/2305.18723
- Code: https://github.com/ChangyuanWang17/APQ-DM
Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
- Paper: https://arxiv.org/abs/2403.05239
- Code:
Towards Memorization-Free Diffusion Models
- Paper: https://arxiv.org/abs/2404.00922
- Code: https://github.com/chenchen-usyd/AMG
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Miao_Training_Diffusion_Models_Towards_Diverse_Image_Generation_with_Reinforcement_Learning_CVPR_2024_paper.html
- Code:
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
- Paper: https://arxiv.org/abs/2311.09257
- Code:
UniGS: Unified Representation for Image Generation and Segmentation
- Paper: https://arxiv.org/abs/2312.01985
- Code: https://github.com/qqlu/Entity
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- Paper: https://arxiv.org/abs/2311.13231
- Code: https://github.com/yk7333/d3po
U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
- Paper: https://arxiv.org/abs/2403.20231
- Code: https://github.com/ICTMCG/U-VAP
ViewDiff: 3D-Consistent Image Generation with Text-To-Image Models
- Paper: https://arxiv.org/abs/2403.01807
- Code: https://github.com/facebookresearch/ViewDiff
When StyleGAN Meets Stable Diffusion: a 𝒲+ Adapter for Personalized Image Generation
- Paper: https://arxiv.org/abs/2311.17461
- Code: https://github.com/csxmli2016/w-plus-adapter
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
- Paper: https://arxiv.org/abs/2312.02238
- Code: https://github.com/showlab/X-Adapter
2.图像编辑(Image Editing)
An Edit Friendly DDPM Noise Space: Inversion and Manipulations
- Paper: https://arxiv.org/abs/2304.06140
- Code: https://github.com/inbarhub/DDPM_inversion
Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing
- Paper: https://arxiv.org/abs/2405.04377
- Code:
Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Sun_Content-Style_Decoupling_for_Unsupervised_Makeup_Transfer_without_Generating_Pseudo_Ground_CVPR_2024_paper.html
- Code: https://github.com/Snowfallingplum/CSD-MT
Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
- Paper: https://arxiv.org/abs/2311.18608
- Code: https://github.com/HyelinNAM/ContrastiveDenoisingScore
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
- Paper: https://arxiv.org/abs/2403.06951
- Code: https://github.com/Tianhao-Qi/DEADiff_code
Deformable One-shot Face Stylization via DINO Semantic Guidance
- Paper: https://arxiv.org/abs/2403.00459
- Code: https://github.com/zichongc/DoesFS
DemoCaricature: Democratising Caricature Generation with a Rough Sketch
- Paper: https://arxiv.org/abs/2312.04364
- Code: https://github.com/ChenDarYen/DemoCaricature
DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Sun_DiffAM_Diffusion-based_Adversarial_Makeup_Transfer_for_Facial_Privacy_Protection_CVPR_2024_paper.html
- Code: https://github.com/HansSunY/DiffAM
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
- Paper: https://arxiv.org/abs/2312.07409
- Code: https://github.com/Kevin-thu/DiffMorpher
Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
- Paper: https://arxiv.org/abs/2312.02190
- Code: https://github.com/adobe-research/DiffusionHandles
DiffusionLight: Light Probes for Free by Painting a Chrome Ball
- Paper: https://arxiv.org/abs/2312.09168
- Code: https://github.com/DiffusionLight/DiffusionLight
Diffusion Models Without Attention
- Paper: https://arxiv.org/abs/2311.18257
- Code: https://github.com/Kevin-thu/DiffMorpher
Doubly Abductive Counterfactual Inference for Text-based Image Editing
- Paper: https://arxiv.org/abs/2403.02981
- Code: https://github.com/xuesong39/DAC
Edit One for All: Interactive Batch Image Editing
- Paper: https://arxiv.org/abs/2401.10219
- Code: https://github.com/thaoshibe/edit-one-for-all
Face2Diffusion for Fast and Editable Face Personalization
- Paper: https://arxiv.org/abs/2403.05094
- Code: https://github.com/mapooon/Face2Diffusion
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
- Paper: https://arxiv.org/abs/2312.10113
- Code: https://github.com/guoqincode/Focus-on-Your-Instruction
FreeDrag: Feature Dragging for Reliable Point-based Image Editing
- Paper: https://arxiv.org/abs/2307.04684
- Code: https://github.com/LPengYang/FreeDrag
Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image
- Paper: https://arxiv.org/abs/2403.09632
- Code: https://github.com/guoqincode/Focus-on-Your-Instruction
Image Sculpting: Precise Object Editing with 3D Geometry Control
- Paper: https://arxiv.org/abs/2401.01702
- Code: https://github.com/vision-x-nyu/image-sculpting
Inversion-Free Image Editing with Natural Language
- Paper: hhttps://arxiv.org/abs/2312.04965
- Code: https://github.com/sled-group/InfEdit
PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models
- Paper: https://arxiv.org/abs/2303.17546
- Code: https://github.com/Picsart-AI-Research/PAIR-Diffusion
Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing
- Paper: https://arxiv.org/abs/2303.17546
- Code: https://github.com/YangChangHee/CVPR2024_Person-In-Place_RELEASE?tab=readme-ov-file
Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network
- Paper: https://arxiv.org/abs/2405.19775
- Code:
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
- Paper: https://arxiv.org/abs/2312.13964
- Code: https://github.com/open-mmlab/PIA
RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
- Paper: https://arxiv.org/abs/2403.00483
- Code:
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
- Paper: https://arxiv.org/abs/2312.06739
- Code: https://github.com/TencentARC/SmartEdit
Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
- Paper: https://arxiv.org/abs/2312.09008
- Code: https://github.com/jiwoogit/StyleID
SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting
- Paper: https://arxiv.org/abs/2402.18848
- Code:
Text-Driven Image Editing via Learnable Regions
- Paper: https://arxiv.org/abs/2311.16432
- Code: https://github.com/yuanze-lin/Learnable_Regions
Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
- Paper: https://arxiv.org/abs/2404.01089
- Code: https://github.com/Gal4way/TPD
TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing
- Paper: https://arxiv.org/abs/2404.11120
- Code: https://github.com/SherryXTChen/TiNO-Edit
UniHuman: A Unified Model For Editing Human Images in the Wild
- Paper: https://arxiv.org/abs/2312.14985
- Code: https://github.com/NannanLi999/UniHuman
ZONE: Zero-Shot Instruction-Guided Local Editing
- Paper: https://arxiv.org/abs/2312.16794
- Code: https://github.com/lsl001006/ZONE
3.视频生成(Video Generation/Video Synthesis)
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
- Paper: https://arxiv.org/abs/2401.06578
- Code: https://github.com/Akaneqwq/360DVD
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- Paper: https://arxiv.org/abs/2312.15770
- Code: https://github.com/ali-vilab/VGen
BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
- Paper: https://arxiv.org/abs/2312.02813
- Code: https://github.com/MCG-NJU/BIVDiff
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
- Paper: https://arxiv.org/abs/2403.17936
- Code: https://github.com/m-hamza-mughal/convofusion
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
- Paper: https://arxiv.org/abs/2404.01862
- Code: https://github.com/thuhcsi/S2G-MDDiffusion
DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation
- Paper:
- Code:
DisCo: Disentangled Control for Realistic Human Dance Generation
- Paper: https://arxiv.org/abs/2307.00040
- Code: https://github.com/Wangt-CN/DisCo
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
- Paper: https://arxiv.org/abs/2403.01901
- Code:
Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- Paper: https://arxiv.org/abs/2405.10272
- Code: https://github.com/Wangt-CN/DisCo
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liang_FlowVid_Taming_Imperfect_Optical_Flows_for_Consistent_Video-to-Video_Synthesis_CVPR_2024_paper.html
- Code:
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Cai_Generative_Rendering_Controllable_4D-Guided_Video_Generation_with_2D_Diffusion_Models_CVPR_2024_paper.html
- Code:
GenTron: Diffusion Transformers for Image and Video Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Chen_GenTron_Diffusion_Transformers_for_Image_and_Video_Generation_CVPR_2024_paper.html
- Code:
Grid Diffusion Models for Text-to-Video Generation
- Paper: https://arxiv.org/abs/2404.00234
- Code: https://github.com/taegyeong-lee/Grid-Diffusion-Models-for-Text-to-Video-Generation
Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Skorokhodov_Hierarchical_Patch_Diffusion_Models_for_High-Resolution_Video_Generation_CVPR_2024_paper.html
- Code:
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Qing_Hierarchical_Spatio-temporal_Decoupling_for_Text-to-Video_Generation_CVPR_2024_paper.html
- Code:
LAMP: Learn A Motion Pattern for Few-Shot Video Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Wu_LAMP_Learn_A_Motion_Pattern_for_Few-Shot_Video_Generation_CVPR_2024_paper.html
- Code: https://github.com/RQ-Wu/LAMP
Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis
- Paper: https://arxiv.org/abs/2402.17364
- Code: https://github.com/zhangzc21/DynTet
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance Primitives
- Paper: https://arxiv.org/abs/2403.10518
- Code: https://github.com/li-ronghui/LODGE
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
- Paper: https://arxiv.org/abs/2311.16498
- Code: https://github.com/magic-research/magic-animate
Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework
- Paper: https://arxiv.org/abs/2403.16510
- Code: https://github.com/ICTMCG/Make-Your-Anchor
Make Your Dream A Vlog
- Paper: https://arxiv.org/abs/2401.09414
- Code: https://github.com/Vchitect/Vlogger
Make Pixels Dance: High-Dynamic Video Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zeng_Make_Pixels_Dance_High-Dynamic_Video_Generation_CVPR_2024_paper.html
- Code:
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Wang_MicroCinema_A_Divide-and-Conquer_Approach_for_Text-to-Video_Generation_CVPR_2024_paper.html
- Code:
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
- Paper: https://arxiv.org/abs/2311.16813
- Code: https://github.com/wenyuqing/panacea
PEEKABOO: Interactive Video Generation via Masked-Diffusion
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Jain_PEEKABOO_Interactive_Video_Generation_via_Masked-Diffusion_CVPR_2024_paper.html
- Code:
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
- Paper: https://arxiv.org/abs/2308.13712
- Code: https://github.com/yzxing87/Seeing-and-Hearing
SimDA: Simple Diffusion Adapter for Efficient Video Generation
- Paper: https://arxiv.org/abs/2308.09710
- Code: https://github.com/ChenHsing/SimDA
StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
- Paper: https://arxiv.org/abs/2403.14186
- Code: https://github.com/jeolpyeoni/StyleCineGAN
SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
- Paper: https://arxiv.org/abs/2311.17590
- Code: https://github.com/ZiqiaoPeng/SyncTalk
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models
- Paper: https://arxiv.org/abs/2311.17590
- Code:
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- Paper: https://arxiv.org/abs/2404.16306
- Code: https://github.com/showlab/Tune-A-Video
VideoBooth: Diffusion-based Video Generation with Image Prompts
- Paper: https://arxiv.org/abs/2312.00777
- Code: https://github.com/Vchitect/VideoBooth
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- Paper: https://arxiv.org/abs/2401.09047
- Code: https://github.com/AILab-CVC/VideoCrafter
Video-P2P: Video Editing with Cross-attention Control
- Paper: https://arxiv.org/abs/2303.04761
- Code: https://github.com/dvlab-research/Video-P2P
4.视频编辑(Video Editing)
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
- Paper: https://arxiv.org/abs/2312.05856
- Code: https://github.com/STEM-Inv/stem-inv
CAMEL: Causal Motion Enhancement tailored for lifting text-driven video editing
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zhang_CAMEL_CAusal_Motion_Enhancement_Tailored_for_Lifting_Text-driven_Video_Editing_CVPR_2024_paper.html
- Code: https://github.com/zhangguiwei610/CAMEL
CCEdit: Creative and Controllable Video Editing via Diffusion Models
- Paper: https://arxiv.org/abs/2309.16496
- Code: https://github.com/RuoyuFeng/CCEdit
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
- Paper: https://arxiv.org/abs/2308.07926
- Code: https://github.com/qiuyu96/CoDeF
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
- Paper: https://arxiv.org/abs/2403.12962
- Code: https://github.com/williamyang1991/FRESCO/tree/main
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
- Paper: https://arxiv.org/abs/2312.04524
- Code: https://github.com/rehg-lab/RAVE
VidToMe: Video Token Merging for Zero-Shot Video Editing
- Paper: https://arxiv.org/abs/2312.10656
- Code: https://github.com/lixirui142/VidToMe
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
- Paper: https://arxiv.org/abs/2312.00845
- Code: https://github.com/HyeonHo99/Video-Motion-Customization
5.3D生成(3D Generation/3D Synthesis)
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
- Paper: https://arxiv.org/abs/2310.08528
- Code: https://github.com/hustvl/4DGaussians
Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling
- Paper: https://arxiv.org/abs/2311.16096
- Code: https://github.com/lizhe00/AnimatableGaussians
A Unified Approach for Text- and Image-guided 4D Scene Generation
- Paper: https://arxiv.org/abs/2311.16854
- Code: https://github.com/NVlabs/dream-in-4d
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
- Paper: https://arxiv.org/abs/2405.09546
- Code: https://github.com/behavior-vision-suite/behavior-vision-suite.github.io
BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation
- Paper: https://arxiv.org/abs/2312.02136
- Code: https://github.com/zqh0253/BerfScene
CAD: Photorealistic 3D Generation via Adversarial Distillation
- Paper: https://arxiv.org/abs/2312.06663
- Code: https://github.com/raywzy/CAD
CAGE: Controllable Articulation GEneration
- Paper: https://arxiv.org/abs/2312.09570
- Code: https://github.com/3dlg-hcvc/cage
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
- Paper: https://arxiv.org/abs/2309.00610
- Code: https://github.com/hzxie/CityDreamer
Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior
- Paper: https://arxiv.org/abs/2401.09050
- Code: https://github.com/sail-sg/Consistent3D
ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis
- Paper: https://arxiv.org/abs/2311.17123
- Code: https://github.com/gaoxiangjun/ConTex-Human
ControlRoom3D: Room Generation using Semantic Proxy Rooms
- Paper: https://arxiv.org/abs/2312.05208
- Code:
DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
- Paper: https://arxiv.org/abs/2403.13667
- Code: https://github.com/Carmenw1203/DanceCamera3D-Official
DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- Paper: https://arxiv.org/abs/2312.13016
- Code: https://github.com/FreedomGu/DiffPortrait3D
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
- Paper: https://arxiv.org/abs/2401.04747
- Code: https://github.com/JeremyCJM/DiffSHEG
DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis
- Paper: https://arxiv.org/abs/2303.14207
- Code: https://github.com/tangjiapeng/DiffuScene
Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features
- Paper: https://arxiv.org/abs/2311.17024
- Code: https://github.com/niladridutt/Diffusion-3D-Features
Diffusion Time-step Curriculum for One Image to 3D Generation
- Paper: https://paperswithcode.com/paper/diffusion-time-step-curriculum-for-one-image
- Code: https://github.com/yxymessi/DTC123
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
- Paper: https://arxiv.org/abs/2304.00916
- Code: https://github.com/yukangcao/DreamAvatar
DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
- Paper: https://arxiv.org/abs/2312.03611
- Code: https://github.com/yhyang-myron/DreamComposer
DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
- Paper: https://arxiv.org/abs/2312.06439
- Code: https://github.com/tyhuang0428/DreamControl
Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
- Paper: https://arxiv.org/abs/2312.04466
- Code: https://github.com/kiranchhatre/amuse
EscherNet: A Generative Model for Scalable View Synthesis
- Paper: https://arxiv.org/abs/2402.03908
- Code: https://github.com/hzxie/city-dreamer
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
- Paper: https://arxiv.org/abs/2310.08529
- Code: https://github.com/hustvl/GaussianDreamer
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
- Paper: https://arxiv.org/abs/2401.04092
- Code: https://github.com/3DTopia/GPTEval3D
Gaussian Shell Maps for Efficient 3D Human Generation
- Paper: https://arxiv.org/abs/2311.17857
- Code: https://github.com/computational-imaging/GSM
HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D
- Paper: https://arxiv.org/abs/2312.15980
- Code: https://github.com/byeongjun-park/HarmonyView
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
- Paper: https://arxiv.org/abs/2312.03050
- Code:
Holodeck: Language Guided Generation of 3D Embodied AI Environments
- Paper: https://arxiv.org/abs/2312.09067
- Code: https://github.com/allenai/Holodeck
HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
- Paper: https://arxiv.org/abs/2310.01406
- Code:
Interactive3D: Create What You Want by Interactive 3D Generation
- Paper: https://hub.baai.ac.cn/paper/494efc8d-f4ed-4ca4-8469-b882f9489f5e
- Code:
InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusio
- Paper: https://arxiv.org/abs/2403.17422
- Code: https://github.com/jyunlee/InterHandGen
Intrinsic Image Diffusion for Single-view Material Estimation
- Paper: https://arxiv.org/abs/2312.12274
- Code: https://github.com/Peter-Kocsis/IntrinsicImageDiffusion
Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text
- Paper: https://arxiv.org/abs/2403.16897
- Code: https://github.com/junshutang/Make-It-Vivid
MoMask: Generative Masked Modeling of 3D Human Motions
- Paper: https://arxiv.org/abs/2312.00063
- Code: https://github.com/EricGuo5513/momask-codes
Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration
- Paper: https://arxiv.org/abs/2402.05746
- Code: https://github.com/yifanlu0227/ChatSim?tab=readme-ov-file
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
- Paper: https://arxiv.org/abs/2312.06725
- Code: https://github.com/huanngzh/EpiDiff
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
- Paper: https://arxiv.org/abs/2405.16925
- Code:
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion
- Paper: https://arxiv.org/abs/2311.07885
- Code: https://github.com/SUDO-AI-3D/One2345plus
Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering
- Paper: https://arxiv.org/abs/2312.11360
- Code: https://github.com/postech-ami/Paint-it
PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
- Paper: https://arxiv.org/abs/2402.10636
- Code: https://github.com/snuvclab/pegasus
PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics
- Paper: https://arxiv.org/abs/2311.12198
- Code: https://github.com/XPandora/PhysGaussian
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D.
- Paper: https://arxiv.org/abs/2311.16918
- Code: https://github.com/modelscope/richdreamer
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors
- Paper: https://arxiv.org/abs/2311.17261
- Code: https://github.com/daveredrum/SceneTex
SceneWiz3D: Towards Text-guided 3D Scene Composition
- Paper: https://arxiv.org/abs/2312.08885
- Code: https://github.com/zqh0253/SceneWiz3D
SemCity: Semantic Scene Generation with Triplane Diffusion
- Paper: https://arxiv.org/abs/2403.07773
- Code: https://github.com/zoomin-lee/SemCity?tab=readme-ov-file
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior
- Paper: https://arxiv.org/abs/2312.06655
- Code: https://github.com/liuff19/Sherpa3D
SIGNeRF: Scene Integrated Generation for Neural Radiance Fields
- Paper: https://arxiv.org/abs/2401.01647
- Code: https://github.com/cgtuebingen/SIGNeRF
Single Mesh Diffusion Models with Field Latents for Texture Generation
- Paper: https://arxiv.org/abs/2312.09250
- Code: https://github.com/google-research/google-research/tree/master/mesh_diffusion
SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion
- Paper: https://arxiv.org/abs/2311.15855
- Code: https://github.com/SiTH-Diffusion/SiTH
SPAD: Spatially Aware Multiview Diffusers
- Paper: https://arxiv.org/abs/2402.05235
- Code: https://github.com/yashkant/spad
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
- Paper: https://arxiv.org/abs/2312.04963
- Code: https://github.com/BiDiff/bidiff
Text-to-3D using Gaussian Splatting
- Paper: https://arxiv.org/abs/2309.16585
- Code: https://github.com/gsgen3d/gsgen
The More You See in 2D, the More You Perceive in 3D
- Paper: https://arxiv.org/abs/2404.03652
- Code: https://github.com/sap3d/sap3d
Tiger: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process
- Paper: https://cvlab.cse.msu.edu/pdfs/Ren_Kim_Liu_Liu_TIGER_supp.pdf
- Code: https://github.com/Zhiyuan-R/Tiger-Diffusion
Towards Realistic Scene Generation with LiDAR Diffusion Models
- Paper: https://arxiv.org/abs/2404.00815
- Code: https://github.com/hancyran/LiDAR-Diffusion
UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion
- Paper: https://arxiv.org/abs/2404.06851
- Code: https://github.com/weiqi-zhang/UDiFF
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
- Paper: https://arxiv.org/abs/2312.01305
- Code: https://github.com/ubc-vision/vivid123
6.3D编辑(3D Editing)
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
- Paper: https://arxiv.org/abs/2311.14521
- Code: https://github.com/buaacyw/GaussianEditor
GenN2N: Generative NeRF2NeRF Translation
- Paper: https://arxiv.org/abs/2404.02788
- Code: https://github.com/Lxiangyue/GenN2N
Makeup Prior Models for 3D Facial Makeup Estimation and Applications
- Paper: https://arxiv.org/abs/2403.17761
- Code: https://github.com/YangXingchao/makeup-priors
7.多模态大语言模型(Multi-Modal Large Language Models)
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
- Paper: https://arxiv.org/abs/2312.03818
- Code: https://github.com/SunzeY/AlphaCLIP
Anchor-based Robust Finetuning of Vision-Language Models
- Paper: https://arxiv.org/abs/2404.06244
- Code: https://github.com/LixDemon/ARF
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
- Paper: https://arxiv.org/abs/2403.11549
- Code: https://github.com/JiazuoYu/MoE-Adapters4CL
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
- Paper: https://arxiv.org/abs/2403.18447
- Code: https://github.com/InhwanBae/LMTrajectory
Can’t make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
- Paper: https://arxiv.org/abs/2405.20305
- Code:
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
- Paper: https://arxiv.org/abs/2311.08046
- Code: https://github.com/PKU-YuanGroup/Chat-UniVi
Compositional Chain-of-Thought Prompting for Large Multimodal Models
- Paper: https://arxiv.org/abs/2311.17076
- Code: https://github.com/chancharikmitra/CCoT
Describing Differences in Image Sets with Natural Language
- Paper: https://arxiv.org/abs/2312.02974
- Code: https://github.com/Understanding-Visual-Datasets/VisDiff
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
- Paper: https://arxiv.org/abs/2403.17589
- Code: https://github.com/YBZh/DMN
Efficient Stitchable Task Adaptation
- Paper: https://arxiv.org/abs/2311.17352
- Code: https://github.com/ziplab/Stitched_LLaMA
Efficient Test-Time Adaptation of Vision-Language Models
- Paper: https://arxiv.org/abs/2403.18293
- Code: https://github.com/kdiAAA/TDA
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
- Paper: https://arxiv.org/abs/2404.11207
- Code: https://github.com/zycheiheihei/transferable-visual-prompting
FairCLIP: Harnessing Fairness in Vision-Language Learning
- Paper: https://arxiv.org/abs/2403.19949
- Code: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
- Paper: https://arxiv.org/abs/2404.16123
- Code:
FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models
- Paper: https://arxiv.org/abs/2404.16123
- Code:
Generative Multimodal Models are In-Context Learners
- Paper: https://arxiv.org/abs/2312.13286
- Code: https://github.com/baaivision/Emu/tree/main/Emu2
GLaMM: Pixel Grounding Large Multimodal Model
- Paper: https://arxiv.org/abs/2311.03356
- Code: https://github.com/mbzuai-oryx/groundingLMM
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
- Paper: https://arxiv.org/abs/2312.02980
- Code: https://github.com/Pointcept/GPT4Point
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- Paper: https://arxiv.org/abs/2312.14238
- Code: https://github.com/OpenGVLab/InternVL
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- Paper: https://arxiv.org/abs/2404.00909
- Code:
Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
- Paper: https://arxiv.org/abs/2312.02439
- Code: https://github.com/sail-sg/CLoT
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
- Paper: https://arxiv.org/abs/2311.11860
- Code: https://github.com/rshaojimmy/JiuTian
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
- Paper: https://arxiv.org/abs/2311.18651
- Code: https://github.com/Open3DA/LL3DA
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- Paper: https://arxiv.org/abs/2311.16922
- Code: https://github.com/DAMO-NLP-SG/VCD
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
- Paper: https://arxiv.org/abs/2311.17049
- Code: https://github.com/apple/ml-mobileclip
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
- Paper: https://arxiv.org/abs/2403.07839
- Code:
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
- Paper: https://arxiv.org/abs/2404.14471
- Code: https://github.com/shiyi-zh0408/NAE_CVPR2024
OneLLM: One Framework to Align All Modalities with Language
- Paper: https://arxiv.org/abs/2312.03700
- Code: https://github.com/csuhan/OneLLM
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
- Paper: https://arxiv.org/abs/2403.01849
- Code: https://github.com/TreeLLi/APT
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
- Paper: https://arxiv.org/abs/2402.19479
- Code: https://github.com/shikiw/OPERA
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
- Paper: https://arxiv.org/abs/2311.17911
- Code: https://github.com/snap-research/Panda-70M
PixelLM: Pixel Reasoning with Large Multimodal Model
- Paper: https://arxiv.org/abs/2312.02228
- Code: https://github.com/MaverickRen/PixelLM
PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization
- Paper: https://arxiv.org/abs/2404.09011
- Code:
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
- Paper: https://arxiv.org/abs/2312.04302
- Code: https://github.com/dvlab-research/Prompt-Highlighter
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
- Paper: https://arxiv.org/abs/2403.02781
- Code: https://github.com/zhengli97/PromptKD
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
- Paper: https://arxiv.org/abs/2311.06783
- Code: https://github.com/Q-Future/Q-Instruct
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
- Paper: https://arxiv.org/abs/2403.13263
- Code: https://github.com/ivattyue/SC-Tune
SEED-Bench: Benchmarking Multimodal Large Language Models
- Paper: https://arxiv.org/abs/2311.17092
- Code: https://github.com/AILab-CVC/SEED-Bench
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
- Paper: https://arxiv.org/abs/2404.01156
- Code:
The Manga Whisperer: Automatically Generating Transcriptions for Comics
- Paper: https://arxiv.org/abs/2401.10224
- Code: https://github.com/ragavsachdeva/magi
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
- Paper: https://arxiv.org/abs/2403.12532
- Code:
VBench: Comprehensive Benchmark Suite for Video Generative Models
- Paper: https://arxiv.org/abs/2311.17982
- Code: https://github.com/Vchitect/VBench
VideoChat: Chat-Centric Video Understanding
- Paper: https://arxiv.org/abs/2305.06355
- Code: https://github.com/OpenGVLab/Ask-Anything
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
- Paper: https://arxiv.org/abs/2312.00784
- Code: https://github.com/mu-cai/ViP-LLaVA
ViTamin: Designing Scalable Vision Models in the Vision-language Era
- Paper: https://arxiv.org/abs/2404.02132
- Code: https://github.com/Beckschen/ViTamin
ViT-Lens: Towards Omni-modal Representations
- Paper: https://github.com/TencentARC/ViT-Lens
- Code: https://arxiv.org/abs/2308.10185
8.其他任务(Others)
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
- Paper: https://arxiv.org/abs/2401.17879
- Code: https://github.com/jonasricker/aeroblade
Diff-BGM: A Diffusion Model for Video Background Music Generation
- Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Li_Diff-BGM_A_Diffusion_Model_for_Video_Background_Music_Generation_CVPR_2024_paper.html
- Code: https://github.com/sizhelee/Diff-BGM
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
- Paper: https://arxiv.org/abs/2310.11440
- Code: https://github.com/evalcrafter/EvalCrafter
On the Content Bias in Fréchet Video Distance
- Paper: https://arxiv.org/abs/2404.12391
- Code: https://github.com/songweige/content-debiased-fvd
TexTile: A Differentiable Metric for Texture Tileability
- Paper: https://arxiv.org/abs/2403.12961v1
- Code: https://github.com/crp94/textile
持续更新~
参考
CVPR 2024 论文和开源项目合集(Papers with Code)
相关整理
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
已为社区贡献6条内容
所有评论(0)