CVPR2023图像处理相关论文速览63篇

CVPR2023图像处理论文相关速览

木木阳

2450人浏览 · 2024-06-21 08:36:45

木木阳 · 2024-06-21 08:36:45 发布

CVPR2023图像处理论文速览

在这里插入图片描述

Paper1 Revisiting Self-Similarity: Structural Embedding for Image Retrieval

摘要原文: Despite advances in global image representation, existing image retrieval approaches rarely consider geometric structure during the global retrieval stage. In this work, we revisit the conventional self-similarity descriptor from a convolutional perspective, to encode both the visual and structural cues of the image to global image representation. Our proposed network, named Structural Embedding Network (SENet), captures the internal structure of the images and gradually compresses them into dense self-similarity descriptors while learning diverse structures from various images. These self-similarity descriptors and original image features are fused and then pooled into global embedding, so that global embedding can represent both geometric and visual cues of the image. Along with this novel structural embedding, our proposed network sets new state-of-the-art performances on several image retrieval benchmarks, convincing its robustness to look-alike distractors. The code and models are available: https://github.com/sungonce/SENet.

中文总结: 尽管全球图像表示技术取得了进展，但现有的图像检索方法在全局检索阶段很少考虑几何结构。在这项工作中，我们从卷积的角度重新审视传统的自相似描述符，以编码图像的视觉和结构线索，以实现全局图像表示。我们提出的网络名为结构嵌入网络（SENet），捕捉图像的内部结构，并逐渐将其压缩为密集的自相似描述符，同时从各种图像中学习不同的结构。这些自相似描述符和原始图像特征被融合，然后汇总为全局嵌入，以便全局嵌入可以表示图像的几何和视觉线索。除了这种新颖的结构嵌入外，我们提出的网络在几个图像检索基准上取得了新的最先进性能，证明了其对相似干扰物的稳健性。代码和模型可在以下链接找到：https://github.com/sungonce/SENet。

Paper2 Decoupling-and-Aggregating for Image Exposure Correction

摘要原文: The images captured under improper exposure conditions often suffer from contrast degradation and detail distortion. Contrast degradation will destroy the statistical properties of low-frequency components, while detail distortion will disturb the structural properties of high-frequency components, leading to the low-frequency and high-frequency components being mixed and inseparable. This will limit the statistical and structural modeling capacity for exposure correction. To address this issue, this paper proposes to decouple the contrast enhancement and detail restoration within each convolution process. It is based on the observation that, in the local regions covered by convolution kernels, the feature response of low-/high-frequency can be decoupled by addition/difference operation. To this end, we inject the addition/difference operation into the convolution process and devise a Contrast Aware (CA) unit and a Detail Aware (DA) unit to facilitate the statistical and structural regularities modeling. The proposed CA and DA can be plugged into existing CNN-based exposure correction networks to substitute the Traditional Convolution (TConv) to improve the performance. Furthermore, to maintain the computational costs of the network without changing, we aggregate two units into a single TConv kernel using structural re-parameterization. Evaluations of nine methods and five benchmark datasets demonstrate that our proposed method can comprehensively improve the performance of existing methods without introducing extra computational costs compared with the original networks. The codes will be publicly available.

中文总结: 这段话主要讨论了在曝光条件不当下捕捉的图像往往会受到对比度降低和细节失真的影响。对比度降低会破坏低频成分的统计特性，而细节失真会扰乱高频成分的结构特性，导致低频和高频成分混合在一起且难以分离。这将限制曝光校正的统计和结构建模能力。为了解决这个问题，该论文提出在每个卷积过程中解耦对比度增强和细节恢复。基于这样的观察，通过加法/差异操作可以将卷积核覆盖的局部区域中的低/高频特征响应解耦。为此，我们将加法/差异操作注入到卷积过程中，并设计了一个对比度感知（CA）单元和一个细节感知（DA）单元，以促进统计和结构规律建模。提出的CA和DA可以插入现有基于CNN的曝光校正网络中，以替代传统卷积（TConv）以提高性能。此外，为了保持网络的计算成本不变，我们通过结构重参数化将两个单元聚合成一个单一的TConv核。对九种方法和五个基准数据集的评估表明，我们提出的方法可以全面提高现有方法的性能，而与原始网络相比不引入额外的计算成本。代码将公开发布。

Paper3 LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

摘要原文: Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability to handle multiple attributes per image. Recent truly-unsupervised methods adopt clustering approaches to easily provide per-sample one-hot domain labels. However, they cannot account for the real-world setting: one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to human understanding. To overcome these, we present LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels. This formulation naturally enables multi-hot labels so that users can specify the target domain with a set of attributes in language. To account for the case that the initial prompts are inaccurate, we also present prompt learning. We further present domain regularization loss that enforces translated images to be mapped to the corresponding domain. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models. The code is available at github.com/KU-CVLAB/LANIT.

中文总结: 这段话主要讨论了图像到图像翻译的现有技术普遍存在的两个关键问题：对每个样本领域注释的严重依赖以及无法处理每个图像多个属性的问题。最近的真正无监督方法采用聚类方法来轻松提供每个样本的单热领域标签。然而，它们无法应对真实世界的情况：一个样本可能具有多个属性。此外，聚类的语义不容易与人类理解相联系。为了克服这些问题，他们提出了基于语言驱动的图像到图像翻译模型LANIT。他们利用文本中提供的易于获取的候选属性来指示数据集中图像和属性之间的相似性，从而实现每个样本的领域标签。这种形式自然地实现了多热标签，使用户可以用一组语言属性指定目标领域。为了解决初始提示不准确的情况，他们还提出了提示学习。他们进一步提出了领域正则化损失，强制将翻译后的图像映射到相应的领域。在几个标准基准上的实验表明，LANIT的性能与现有模型相当或更优。该代码可在github.com/KU-CVLAB/LANIT上找到。

Paper4 Bitstream-Corrupted JPEG Images Are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration

摘要原文: In this paper, we study a real-world JPEG image restoration problem with bit errors on the encrypted bitstream. The bit errors bring unpredictable color casts and block shifts on decoded image contents, which cannot be trivially resolved by existing image restoration methods mainly relying on pre-defined degradation models in the pixel domain. To address these challenges, we propose a robust JPEG decoder, followed by a two-stage compensation and alignment framework to restore bitstream-corrupted JPEG images. Specifically, the robust JPEG decoder adopts an error-resilient mechanism to decode the corrupted JPEG bitstream. The two-stage framework is composed of the self-compensation and alignment (SCA) stage and the guided-compensation and alignment (GCA) stage. The SCA adaptively performs block-wise image color compensation and alignment based on the estimated color and block offsets via image content similarity. The GCA leverages the extracted low-resolution thumbnail from the JPEG header to guide full-resolution pixel-wise image restoration in a coarse-to-fine manner. It is achieved by a coarse-guided pix2pix network and a refine-guided bi-directional Laplacian pyramid fusion network. We conduct experiments on three benchmarks with varying degrees of bit error rates. Experimental results and ablation studies demonstrate the superiority of our proposed method. The code will be released at https://github.com/wenyang001/Two-ACIR.

中文总结: 在这篇论文中，我们研究了一个真实世界中的JPEG图像恢复问题，即在加密比特流上存在比特错误。比特错误会给解码后的图像内容带来无法预测的色偏和块偏移，这不能简单地通过现有主要依赖于像素域中预定义退化模型的图像恢复方法来解决。为了解决这些挑战，我们提出了一个强大的JPEG解码器，随后是一个两阶段补偿和对齐框架来恢复比特流损坏的JPEG图像。具体来说，强大的JPEG解码器采用了一个抗错误机制来解码损坏的JPEG比特流。两阶段框架由自补偿和对齐（SCA）阶段和引导补偿和对齐（GCA）阶段组成。SCA根据估计的颜色和块偏移通过图像内容相似性自适应地执行块状图像颜色补偿和对齐。GCA利用从JPEG头部提取的低分辨率缩略图来引导全分辨率像素级图像恢复，以粗到细的方式实现。这是通过一个粗引导的pix2pix网络和一个精细引导的双向拉普拉斯金字塔融合网络来实现的。我们在三个具有不同比特错误率的基准上进行实验。实验结果和消融研究证明了我们提出的方法的优越性。代码将发布在https://github.com/wenyang001/Two-ACIR。

Paper5 WeatherStream: Light Transport Automation of Single Image Deweathering

摘要原文: Today single image deweathering is arguably more sensitive to the dataset type, rather than the model. We introduce WeatherStream, an automatic pipeline capturing all real-world weather effects (rain, snow, and rain fog degradations), along with their clean image pairs. Previous state-of-the-art methods that have attempted the all-weather removal task train on synthetic pairs, and are thus limited by the Sim2Real domain gap. Recent work has attempted to manually collect time multiplexed pairs, but the use of human labor limits the scale of such a dataset. We introduce a pipeline that uses the power of light-transport physics and a model trained on a small, initial seed dataset to reject approximately 99.6% of unwanted scenes. The pipeline is able to generalize to new scenes and degradations that can, in turn, be used to train existing models just like fully human-labeled data. Training on a dataset collected through this procedure leads to significant improvements on multiple existing weather removal methods on a carefully human-collected test set of real-world weather effects. The dataset and code can be found in the following website: http://visual.ee.ucla.edu/wstream.htm/.

中文总结: 今天，单幅图像去雾效果可能更受数据集类型的影响，而不是模型的影响。我们介绍了WeatherStream，这是一个自动流水线，捕捉了所有真实世界的天气效应（雨、雪和雨雾降解），以及它们的清晰图像对。先前尝试进行全天气去除任务的最先进方法是在合成对上进行训练，因此受到Sim2Real域差距的限制。最近的工作尝试手动收集时间复用对，但是使用人力限制了这种数据集的规模。我们介绍了一个流水线，利用光传输物理学的力量和在一个小的初始种子数据集上训练的模型，以拒绝约99.6%的不需要的场景。该流水线能够推广到新的场景和降解，这反过来可以用来训练现有模型，就像完全由人标记的数据一样。在通过这一过程收集的数据集上训练可以显著改进多个现有的天气去除方法，在一个精心人工收集的真实世界天气效应测试集上。数据集和代码可以在以下网站找到：http://visual.ee.ucla.edu/wstream.htm/。

Paper6 You Do Not Need Additional Priors or Regularizers in Retinex-Based Low-Light Image Enhancement

摘要原文: Images captured in low-light conditions often suffer from significant quality degradation. Recent works have built a large variety of deep Retinex-based networks to enhance low-light images. The Retinex-based methods require decomposing the image into reflectance and illumination components, which is a highly ill-posed problem and there is no available ground truth. Previous works addressed this problem by imposing some additional priors or regularizers. However, finding an effective prior or regularizer that can be applied in various scenes is challenging, and the performance of the model suffers from too many additional constraints. We propose a contrastive learning method and a self-knowledge distillation method that allow training our Retinex-based model for Retinex decomposition without elaborate hand-crafted regularization functions. Rather than estimating reflectance and illuminance images and representing the final images as their element-wise products as in previous works, our regularizer-free Retinex decomposition and synthesis network (RFR) extracts reflectance and illuminance features and synthesizes them end-to-end. In addition, we propose a loss function for contrastive learning and a progressive learning strategy for self-knowledge distillation. Extensive experimental results demonstrate that our proposed methods can achieve superior performance compared with state-of-the-art approaches.

中文总结: 在低光条件下拍摄的图像往往会受到显著的质量下降影响。最近的研究已经构建了大量基于深度Retinex的网络来增强低光图像。Retinex方法需要将图像分解为反射和照明成分，这是一个高度不适定的问题，并且没有可用的地面真实数据。先前的研究通过引入一些额外的先验或正则化项来解决这个问题。然而，找到一个可以适用于各种场景的有效先验或正则化项是具有挑战性的，而且模型的性能会因为太多的额外约束而受到影响。我们提出了一种对比学习方法和自我知识蒸馏方法，可以训练我们的Retinex模型进行Retinex分解，而无需繁琐的手工制定正则化函数。与以前的作品中估计反射和照明图像并将最终图像表示为它们的逐元素乘积不同，我们的无正则化项的Retinex分解和合成网络（RFR）提取反射和照明特征，并进行端到端的合成。此外，我们提出了一个用于对比学习的损失函数和用于自我知识蒸馏的渐进学习策略。大量实验结果表明，我们提出的方法可以实现比最先进方法更优越的性能。

Paper7 PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification

摘要原文: Interpretable methods based on prototypical patches recognize various components in an image in order to explain their reasoning to humans. However, existing prototype-based methods can learn prototypes that are not in line with human visual perception, i.e., the same prototype can refer to different concepts in the real world, making interpretation not intuitive. Driven by the principle of explainability-by-design, we introduce PIP-Net (Patch-based Intuitive Prototypes Network): an interpretable image classification model that learns prototypical parts in a self-supervised fashion which correlate better with human vision. PIP-Net can be interpreted as a sparse scoring sheet where the presence of a prototypical part in an image adds evidence for a class. The model can also abstain from a decision for out-of-distribution data by saying “I haven’t seen this before”. We only use image-level labels and do not rely on any part annotations. PIP-Net is globally interpretable since the set of learned prototypes shows the entire reasoning of the model. A smaller local explanation locates the relevant prototypes in one image. We show that our prototypes correlate with ground-truth object parts, indicating that PIP-Net closes the “semantic gap” between latent space and pixel space. Hence, our PIP-Net with interpretable prototypes enables users to interpret the decision making process in an intuitive, faithful and semantically meaningful way. Code is available at https://github.com/M-Nauta/PIPNet.

中文总结: 这段话主要介绍了基于原型补丁的可解释方法，通过识别图像中的各种组件来向人类解释其推理过程。然而，现有基于原型的方法可能学习到与人类视觉感知不一致的原型，即同一原型可能在现实世界中指代不同的概念，使解释变得不直观。在可解释性设计原则的驱动下，作者介绍了PIP-Net（基于补丁的直观原型网络）：一种可解释的图像分类模型，以自监督方式学习与人类视觉更好相关的原型部分。PIP-Net可以被解释为一个稀疏的评分表，图像中原型部分的存在为一个类别提供证据。该模型还可以在处理超出分布范围的数据时选择放弃决策，并表示“我以前没有见过这个”。作者仅使用图像级别标签，不依赖任何部分注释。PIP-Net在全局上是可解释的，因为学习到的原型集显示了模型的整个推理过程。较小的局部解释可在一幅图像中定位相关的原型。作者展示了他们的原型与地面真实对象部分相关，表明PIP-Net弥合了潜在空间与像素空间之间的“语义鸿沟”。因此，作者的PIP-Net具有可解释的原型，使用户能够以直观、忠实和语义有意义的方式解释决策过程。源代码可在 https://github.com/M-Nauta/PIPNet 找到。

Paper8 Spectral Bayesian Uncertainty for Image Super-Resolution

摘要原文: Recently deep learning techniques have significantly advanced image super-resolution (SR). Due to the black-box nature, quantifying reconstruction uncertainty is crucial when employing these deep SR networks. Previous approaches for SR uncertainty estimation mostly focus on capturing pixel-wise uncertainty in the spatial domain. SR uncertainty in the frequency domain which is highly related to image SR is seldom explored. In this paper, we propose to quantify spectral Bayesian uncertainty in image SR. To achieve this, a Dual-Domain Learning (DDL) framework is first proposed. Combined with Bayesian approaches, the DDL model is able to estimate spectral uncertainty accurately, enabling a reliability assessment for high frequencies reasoning from the frequency domain perspective. Extensive experiments under non-ideal premises are conducted and demonstrate the effectiveness of the proposed spectral uncertainty. Furthermore, we propose a novel Spectral Uncertainty based Decoupled Frequency (SUDF) training scheme for perceptual SR. Experimental results show the proposed SUDF can evidently boost perceptual quality of SR results without sacrificing much pixel accuracy.

中文总结: 最近，深度学习技术显著推进了图像超分辨率（SR）领域。由于其黑盒特性，当使用这些深度SR网络时，量化重建不确定性至关重要。先前的SR不确定性估计方法主要集中在捕捉空间域中的像素级不确定性。与图像SR密切相关的频域中的SR不确定性很少被探索。本文提出了在图像SR中量化频谱贝叶斯不确定性的方法。为了实现这一目标，首先提出了一个双域学习（DDL）框架。结合贝叶斯方法，DDL模型能够准确估计频谱不确定性，从频域角度对高频率进行可靠性评估。在非理想前提下进行了大量实验，证明了所提出的频谱不确定性的有效性。此外，我们提出了一种新颖的基于频谱不确定性的解耦频率（SUDF）训练方案用于感知SR。实验结果表明，所提出的SUDF方案可以显著提升SR结果的感知质量，而不牺牲太多像素精度。

Paper9 CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

摘要原文: In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting (“all”). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https://aneeshan95.github.io/Sketch_LVM/

中文总结: 本文利用CLIP进行零样本素描图像检索（ZS-SBIR）。我们在很大程度上受到最近基础模型的进展和它们似乎提供的无与伦比的泛化能力的启发，但首次将其定制以造福素描社区。我们提出了关于如何最好实现这种协同作用的新颖设计，既适用于类别设置，也适用于细粒度设置（“全部”）。在我们解决方案的核心是一个提示学习设置。首先，我们展示了仅通过考虑素描特定提示，我们已经拥有了一个超过所有先前艺术的类别级ZS-SBIR系统，大幅超出（24.8%）-这是对研究CLIP和ZS-SBIR协同作用的伟大见证。然而，进入细粒度设置更加棘手，需要更深入地探讨这种协同作用。为此，我们提出了两种具体设计来解决问题的细粒度匹配特性：（i）额外的正则化损失，以确保在类别间素描和照片之间的相对分离是均匀的，而这对于黄金标准的独立三元组损失并非如此；（ii）一个巧妙的块洗牌技术，帮助建立素描-照片对之间的实例级结构对应关系。通过这些设计，我们再次观察到相对于先前的最新技术，性能有显著提升，达到26.9%。如果有任何收获，那就是所提出的CLIP和提示学习范式在解决其他与素描相关的任务（不仅限于ZS-SBIR）中具有巨大潜力，其中数据稀缺仍然是一个巨大挑战。项目页面：https://aneeshan95.github.io/Sketch_LVM/

Paper10 Learning To Generate Image Embeddings With User-Level Differential Privacy

摘要原文: Small on-device models have been successfully trained with user-level differential privacy (DP) for next word prediction and image classification tasks in the past. However, existing methods can fail when directly applied to learn embedding models using supervised training data with a large class space. To achieve user-level DP for large image-to-embedding feature extractors, we propose DP-FedEmb, a variant of federated learning algorithms with per-user sensitivity control and noise addition, to train from user-partitioned data centralized in datacenter. DP-FedEmb combines virtual clients, partial aggregation, private local fine-tuning, and public pretraining to achieve strong privacy utility trade-offs. We apply DP-FedEmb to train image embedding models for faces, landmarks and natural species, and demonstrate its superior utility under same privacy budget on benchmark datasets DigiFace, GLD and iNaturalist. We further illustrate it is possible to achieve strong user-level DP guarantees of epsilon < 2 while controlling the utility drop within 5%, when millions of users can participate in training.

中文总结: 这段话主要内容是介绍了过去已成功使用用户级差分隐私（DP）训练小型设备上的模型，用于下一个单词预测和图像分类任务。然而，现有方法在直接应用于使用具有大类空间的监督训练数据来学习嵌入模型时可能会失败。为了实现大型图像到嵌入特征提取器的用户级DP，作者提出了DP-FedEmb，这是一种带有每个用户灵敏度控制和噪声添加的联邦学习算法变体，用于从数据中心集中的用户分区数据进行训练。DP-FedEmb结合了虚拟客户端、部分聚合、私人本地微调和公共预训练，以实现强隐私效用权衡。作者将DP-FedEmb应用于训练人脸、地标和自然物种的图像嵌入模型，并在基准数据集DigiFace、GLD和iNaturalist上展示了其在相同隐私预算下的优越效用。他们进一步说明，当数百万用户参与训练时，可以实现强用户级DP保证，使epsilon小于2，同时将效用下降控制在5%以内。

Paper11 Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances

摘要原文: Low-light Image Enhancement (LIE) aims at improving contrast and restoring details for images captured in low-light conditions. Most of the previous LIE algorithms adjust illumination using a single input image with several handcrafted priors. Those solutions, however, often fail in revealing image details due to the limited information in a single image and the poor adaptability of handcrafted priors. To this end, we propose PairLIE, an unsupervised approach that learns adaptive priors from low-light image pairs. First, the network is expected to generate the same clean images as the two inputs share the same image content. To achieve this, we impose the network with the Retinex theory and make the two reflectance components consistent. Second, to assist the Retinex decomposition, we propose to remove inappropriate features in the raw image with a simple self-supervised mechanism. Extensive experiments on public datasets show that the proposed PairLIE achieves comparable performance against the state-of-the-art approaches with a simpler network and fewer handcrafted priors. Code is available at: https://github.com/zhenqifu/PairLIE.

中文总结: 这段话主要讨论了低光照图像增强（LIE）的目标是改善低光条件下拍摄的图像的对比度并恢复细节。大多数先前的LIE算法使用单个输入图像和几个手工制作的先验来调整照明。然而，这些解决方案通常在揭示图像细节方面失败，因为单个图像中的信息有限，手工制作的先验的适应性较差。为此，作者提出了PairLIE，这是一种无监督方法，从低光照图像对中学习自适应的先验。首先，网络被期望生成相同的干净图像，因为两个输入共享相同的图像内容。为了实现这一点，他们将网络与Retinex理论相结合，并使两个反射分量保持一致。其次，为了辅助Retinex分解，他们提出使用简单的自监督机制去除原始图像中的不适当特征。在公共数据集上进行的大量实验证明，提出的PairLIE方法在使用更简单的网络和更少手工制作的先验的情况下，实现了与最先进方法相媲美的性能。源代码可在以下链接找到：https://github.com/zhenqifu/PairLIE。

Paper12 Data-Free Sketch-Based Image Retrieval

摘要原文: Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning. Primarily based on data-free knowledge distillation, models developed in this area so far have only been able to operate in a single modality, performing the same kind of task as that of the teacher. For the first time, we propose Data-Free Sketch-Based Image Retrieval (DF-SBIR), a cross-modal data-free learning setting, where teachers trained for classification in a single modality have to be leveraged by students to learn a cross-modal metric-space for retrieval. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks, designing a variety of baselines based on existing data-free learning literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation is available at https://github.com/abhrac/data-free-sbir.

中文总结: 这段话主要讨论了隐私和匿名性保护对深度学习模型的关注日益增加，推动了数据无关学习领域的研究。目前在这一领域中开发的模型主要基于无数据知识蒸馏，迄今为止只能在单一模态下运行，执行与教师相同类型的任务。作者首次提出了Data-Free Sketch-Based Image Retrieval（DF-SBIR），这是一个跨模态的无数据学习设置，在这种设置中，为单一模态的分类而训练的教师必须被学生利用来学习用于检索的跨模态度量空间。预训练分类模型的广泛可用性，以及获取配对的照片-素描数据集用于SBIR的困难程度，证明了这种设置的实用性。作者提出了一种DF-SBIR的方法，可以利用独立训练的模型的知识，用于在照片和素描上执行分类。作者在Sketchy、TU-Berlin和QuickDraw基准上评估了他们的模型，设计了基于现有无数据学习文献的各种基线，并观察到他们的方法在各项指标上均明显超越了所有基线。作者的方法还实现了与数据相关方法相竞争的mAP，同时不需要训练数据。实现代码可在https://github.com/abhrac/data-free-sbir找到。

Paper13 Multi-Realism Image Compression With a Conditional Generator

摘要原文: By optimizing the rate-distortion-realism trade-off, generative compression approaches produce detailed, realistic images, even at low bit rates, instead of the blurry reconstructions produced by rate-distortion optimized models. However, previous methods do not explicitly control how much detail is synthesized, which results in a common criticism of these methods: users might be worried that a misleading reconstruction far from the input image is generated. In this work, we alleviate these concerns by training a decoder that can bridge the two regimes and navigate the distortion-realism trade-off. From a single compressed representation, the receiver can decide to either reconstruct a low mean squared error reconstruction that is close to the input, a realistic reconstruction with high perceptual quality, or anything in between. With our method, we set a new state-of-the-art in distortion-realism, pushing the frontier of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.

中文总结: 通过优化速率-失真-真实性的权衡，生成式压缩方法可以在低比特率下生成详细、逼真的图像，而不是速率-失真优化模型产生的模糊重建图像。然而，先前的方法并未明确控制合成的细节量，这导致这些方法常见的批评是用户可能担心生成了与输入图像相去甚远的误导性重建图像。在这项工作中，我们通过训练一个解码器来缓解这些担忧，该解码器可以在两种模式之间建立桥梁并在失真-真实性权衡中导航。接收方可以根据单个压缩表示来决定是重建接近输入的低均方误差重建，还是具有高感知质量的逼真重建，或者介于两者之间的任何内容。通过我们的方法，我们在失真-真实性方面树立了新的技术水平，推动了可实现的失真-真实性对的前沿，即我们的方法在高真实性时实现更好的失真，而在低失真时实现更好的真实性，比以往任何时候都更好。

Paper14 OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images

摘要原文: This paper presents OmniCity, a new dataset for omnipotent city understanding from multi-level and multi-view images. More precisely, OmniCity contains multi-view satellite images as well as street-level panorama and mono-view images, constituting over 100K pixel-wise annotated images that are well-aligned and collected from 25K geo-locations in New York City. To alleviate the substantial pixel-wise annotation efforts, we propose an efficient street-view image annotation pipeline that leverages the existing label maps of satellite view and the transformation relations between different views (satellite, panorama, and mono-view). With the new OmniCity dataset, we provide benchmarks for a variety of tasks including building footprint extraction, height estimation, and building plane/instance/fine-grained segmentation. Compared with existing multi-level and multi-view benchmarks, OmniCity contains a larger number of images with richer annotation types and more views, provides more benchmark results of state-of-the-art models, and introduces a new task for fine-grained building instance segmentation on street-level panorama images. Moreover, OmniCity provides new problem settings for existing tasks, such as cross-view image matching, synthesis, segmentation, detection, etc., and facilitates the developing of new methods for large-scale city understanding, reconstruction, and simulation. The OmniCity dataset as well as the benchmarks will be released at https://city-super.github.io/omnicity/.

中文总结: 这篇论文介绍了OmniCity，这是一个用于多层次和多视角图像的全能城市理解的新数据集。具体来说，OmniCity包含多视角卫星图像以及街道级全景和单视角图像，总共包括来自纽约市25,000个地理位置的超过100,000个像素级标注图像。为了减轻大量的像素级标注工作，我们提出了一种高效的街景图像标注流程，利用了卫星视图的现有标签地图和不同视图之间的转换关系（卫星、全景和单视角）。通过新的OmniCity数据集，我们为各种任务提供了基准，包括建筑物轮廓提取、高度估计以及建筑物平面/实例/细粒度分割。与现有的多级别和多视角基准相比，OmniCity包含更多种类丰富的标注类型和更多视图的图像，提供了更多最先进模型的基准结果，并在街道级全景图像上引入了新的细粒度建筑实例分割任务。此外，OmniCity为现有任务提供了新的问题设置，例如跨视图图像匹配、合成、分割、检测等，并促进了针对大规模城市理解、重建和模拟的新方法的开发。OmniCity数据集以及基准将在https://city-super.github.io/omnicity/发布。

Paper15 Learning To Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization

摘要原文: The hardware image signal processing (ISP) pipeline is the intermediate layer between the imaging sensor and the downstream application, processing the sensor signal into an RGB image. The ISP is less programmable and consists of a series of processing modules. Each processing module handles a subtask and contains a set of tunable hyperparameters. A large number of hyperparameters form a complex mapping with the ISP output. The industry typically relies on manual and time-consuming hyperparameter tuning by image experts, biased towards human perception. Recently, several automatic ISP hyperparameter optimization methods using downstream evaluation metrics come into sight. However, existing methods for ISP tuning treat the high-dimensional parameter space as a global space for optimization and prediction all at once without inducing the structure knowledge of ISP. To this end, we propose a sequential ISP hyperparameter prediction framework that utilizes the sequential relationship within ISP modules and the similarity among parameters to guide the model sequence process. We validate the proposed method on object detection, image segmentation, and image quality tasks.

中文总结: 这段话主要讨论了硬件图像信号处理（ISP）管道在成像传感器和下游应用之间的中间层，将传感器信号处理成RGB图像。ISP较少可编程，由一系列处理模块组成。每个处理模块处理一个子任务，并包含一组可调参数。大量的参数形成了与ISP输出复杂映射。行业通常依赖图像专家进行手动且耗时的超参数调整，偏向于人类感知。最近，一些利用下游评估指标的自动ISP超参数优化方法开始出现。然而，现有的ISP调整方法将高维参数空间视为全局空间进行一次性优化和预测，而不引入ISP的结构知识。因此，我们提出了一个顺序ISP超参数预测框架，利用ISP模块内的顺序关系和参数之间的相似性来指导模型序列过程。我们在目标检测、图像分割和图像质量任务上验证了提出的方法。

Paper16 Light Source Separation and Intrinsic Image Decomposition Under AC Illumination

摘要原文: Artificial light sources are often powered by an electric grid, and then their intensities rapidly oscillate in response to the grid’s alternating current (AC). Interestingly, the flickers of scene radiance values due to AC illumination are useful for extracting rich information on a scene of interest. In this paper, we show that the flickers due to AC illumination is useful for intrinsic image decomposition (IID). Our proposed method conducts the light source separation (LSS) followed by the IID under AC illumination. In particular, we reveal the ambiguity in the blind LSS via matrix factorization and the ambiguity in the IID assuming the Lambert model, and then show why and how those ambiguities can be resolved. We experimentally confirmed that our method can recover the colors of the light sources, the diffuse reflectance values, and the diffuse and specular intensities (shadings) under each of the light sources, and that the IID under AC illumination is effective for application to auto white balancing.

中文总结: 这段话主要讨论了人工光源通常由电网供电，其强度会快速振荡以响应电网的交流电（AC）。由于交流照明引起的场景辐射值的闪烁，可以用于提取有关感兴趣场景的丰富信息。在这篇论文中，作者展示了由于交流照明引起的闪烁对固有图像分解（IID）是有用的。他们提出的方法首先进行光源分离（LSS），然后在AC照明下进行IID。特别地，作者揭示了通过矩阵分解盲目LSS中的模糊性以及假设Lambert模型的IID中的模糊性，然后展示了为什么以及如何解决这些模糊性。实验证实我们的方法可以恢复光源的颜色、漫反射值以及每个光源下的漫反射和镜面强度（阴影），并且AC照明下的IID对于应用于自动白平衡是有效的。

Paper17 NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation

摘要原文: Nerf-based Generative models have shown impressive capacity in generating high-quality images with consistent 3D geometry. Despite successful synthesis of fake identity images randomly sampled from latent space, adopting these models for generating face images of real subjects is still a challenging task due to its so-called inversion issue. In this paper, we propose a universal method to surgically fine-tune these NeRF-GAN models in order to achieve high-fidelity animation of real subjects only by a single image. Given the optimized latent code for an out-of-domain real image, we employ 2D loss functions on the rendered image to reduce the identity gap. Furthermore, our method leverages explicit and implicit 3D regularizations using the in-domain neighborhood samples around the optimized latent code to remove geometrical and visual artifacts. Our experiments confirm the effectiveness of our method in realistic, high-fidelity, and 3D consistent animation of real faces on multiple NeRF-GAN models across different datasets.

中文总结: 这段话主要讨论了基于Nerf的生成模型在生成具有一致3D几何的高质量图像方面表现出的印象深刻的能力。尽管成功合成了从潜在空间随机采样的虚假身份图像，但采用这些模型生成真实主体的面部图像仍然是一项具有挑战性的任务，原因在于其所谓的反演问题。在本文中，我们提出了一种通用方法，通过对这些Nerf-GAN模型进行外科微调，以实现仅通过单张图像实现对真实主体的高保真动画。给定优化后的潜在代码用于一个领域外的真实图像，我们利用渲染图像上的2D损失函数来减少身份差距。此外，我们的方法利用围绕优化后的潜在代码的领域内邻域样本来使用显式和隐式的3D正则化来消除几何和视觉伪影。我们的实验证实了我们的方法在多个数据集上跨不同Nerf-GAN模型中实现真实、高保真和3D一致的真实面部动画的有效性。

Paper18 Progressive Transformation Learning for Leveraging Virtual Images in Training

摘要原文: To effectively interrogate UAV-based images for detecting objects of interest, such as humans, it is essential to acquire large-scale UAV-based datasets that include human instances with various poses captured from widely varying viewing angles. As a viable alternative to laborious and costly data curation, we introduce Progressive Transformation Learning (PTL), which gradually augments a training dataset by adding transformed virtual images with enhanced realism. Generally, a virtual2real transformation generator in the conditional GAN framework suffers from quality degradation when a large domain gap exists between real and virtual images. To deal with the domain gap, PTL takes a novel approach that progressively iterates the following three steps: 1) select a subset from a pool of virtual images according to the domain gap, 2) transform the selected virtual images to enhance realism, and 3) add the transformed virtual images to the training set while removing them from the pool. In PTL, accurately quantifying the domain gap is critical. To do that, we theoretically demonstrate that the feature representation space of a given object detector can be modeled as a multivariate Gaussian distribution from which the Mahalanobis distance between a virtual object and the Gaussian distribution of each object category in the representation space can be readily computed. Experiments show that PTL results in a substantial performance increase over the baseline, especially in the small data and the cross-domain regime.

中文总结: 为了有效地对基于无人机的图像进行审查以检测感兴趣的对象，如人类，必须获取包含不同姿势的人类实例的大规模基于无人机的数据集，这些姿势是从各种不同的视角捕获的。作为费时费力的数据整理的可行替代方案，我们引入了渐进式转换学习（PTL），通过逐渐增加一个训练数据集，添加具有增强现实感的转换虚拟图像。一般来说，在条件GAN框架中，虚拟到真实的转换生成器在真实和虚拟图像之间存在较大领域差距时会遭受质量下降的问题。为了处理领域差距，PTL采用了一种新方法，逐步迭代以下三个步骤：1）根据领域差距从虚拟图像池中选择一个子集，2）转换所选虚拟图像以增强现实感，3）将转换后的虚拟图像添加到训练集中，并从池中删除它们。在PTL中，准确量化领域差距至关重要。为了做到这一点，我们从理论上证明了给定对象检测器的特征表示空间可以被建模为一个多元高斯分布，从中可以轻松计算虚拟对象与表示空间中每个对象类别的高斯分布之间的马氏距离。实验表明，PTL相对于基线实现了显著的性能提升，特别是在小数据和跨领域的情况下。

Paper19 Learning Bottleneck Concepts in Image Classification

摘要原文: Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address this challenge, mostly by providing per-pixel relevance to the decision. Yet, interpreting such explanations may require expert knowledge. Some recent attempts toward interpretability adopt a concept-based framework, giving a higher-level relationship between some concepts and model decisions. This paper proposes Bottleneck Concept Learner (BotCL), which represents an image solely by the presence/absence of concepts learned through training over the target task without explicit supervision over the concepts. It uses self-supervision and tailored regularizers so that learned concepts can be human-understandable. Using some image classification tasks as our testbed, we demonstrate BotCL’s potential to rebuild neural networks for better interpretability.

中文总结: 解释和解释深度神经网络的行为对许多任务至关重要。可解释的人工智能提供了一种解决这一挑战的方法，主要是通过为决策提供每个像素的相关性。然而，解释这些解释可能需要专业知识。最近一些朝着可解释性的尝试采用了基于概念的框架，建立了一种更高级别的关系，将一些概念与模型决策联系起来。本文提出了瓶颈概念学习器（BotCL），它通过在目标任务上训练学习的概念的存在/缺失来表示图像，而无需对概念进行明确监督。它使用自我监督和定制的正则化器，以便学习到的概念可以被人类理解。通过一些图像分类任务作为我们的实验平台，我们展示了BotCL重建神经网络以实现更好的可解释性的潜力。

Paper20 StyleRes: Transforming the Residuals for Real Image Editing With StyleGAN

摘要原文: We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN’s latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.

中文总结: 这段话主要介绍了一个新颖的图像反演框架和训练流程，以实现高保真度的图像反演和高质量的属性编辑。作者指出，将真实图像反演到StyleGAN的潜在空间是一个广泛研究的问题，但在图像重建保真度和编辑质量之间的权衡仍然是一个挑战。低维潜在空间在高保真度重建方面受限，而高维潜在空间会导致编辑质量下降。为了实现高保真度的反演，作者学习了高维潜在代码中的残差特征，以保留图像细节。为了实现高质量的编辑，作者学习了如何转换这些残差特征以适应潜在代码的操作。他们通过新颖的架构流程和循环一致性损失来训练框架，提取和转换残差特征。通过广泛实验和与最先进的反演方法的比较，作者展示了方法的显著改进。

Paper21 Edge-Aware Regional Message Passing Controller for Image Forgery Localization

摘要原文: Digital image authenticity has promoted research on image forgery localization. Although deep learning-based methods achieve remarkable progress, most of them usually suffer from severe feature coupling between the forged and authentic regions. In this work, we propose a two-step Edge-aware Regional Message Passing Controlling strategy to address the above issue. Specifically, the first step is to account for fully exploiting the edge information. It consists of two core designs: context-enhanced graph construction and threshold-adaptive differentiable binarization edge algorithm. The former assembles the global semantic information to distinguish the features between the forged and authentic regions, while the latter stands on the output of the former to provide the learnable edges. In the second step, guided by the learnable edges, a region message passing controller is devised to weaken the message passing between the forged and authentic regions. In this way, our ERMPC is capable of explicitly modeling the inconsistency between the forged and authentic regions and enabling it to perform well on refined forged images. Extensive experiments on several challenging benchmarks show that our method is superior to state-of-the-art image forgery localization methods qualitatively and quantitatively.

中文总结: 数字图像真实性促进了对图像伪造定位的研究。尽管基于深度学习的方法取得了显著进展，但大多数方法通常受到伪造和真实区域之间特征耦合严重的困扰。在这项工作中，我们提出了一个两步边缘感知区域消息传递控制策略来解决上述问题。具体而言，第一步是充分利用边缘信息。它包括两个核心设计：上下文增强图构建和阈值自适应可微分二值化边缘算法。前者汇集全局语义信息以区分伪造和真实区域之间的特征，而后者则依赖于前者的输出提供可学习的边缘。在第二步中，受可学习边缘的指导，设计了一个区域消息传递控制器来减弱伪造和真实区域之间的消息传递。通过这种方式，我们的ERMPC能够明确建模伪造和真实区域之间的不一致性，并使其在精细伪造图像上表现良好。在几个具有挑战性的基准测试上进行的大量实验表明，我们的方法在定性和定量上优于最先进的图像伪造定位方法。

Paper22 Unpaired Image-to-Image Translation With Shortest Path Regularization

摘要原文: Unpaired image-to-image translation aims to learn proper mappings that can map images from one domain to another domain while preserving the content of the input image. However, with large enough capacities, the network can learn to map the inputs to any random permutation of images in another domain. Existing methods treat two domains as discrete and propose different assumptions to address this problem. In this paper, we start from a different perspective and consider the paths connecting the two domains. We assume that the optimal path length between the input and output image should be the shortest among all possible paths. Based on this assumption, we propose a new method to allow generating images along the path and present a simple way to encourage the network to find the shortest path without pair information. Extensive experiments on various tasks demonstrate the superiority of our approach.

中文总结: 这段话主要讲述了未配对的图像到图像的转换旨在学习适当的映射，可以将一个域中的图像映射到另一个域，同时保留输入图像的内容。然而，具有足够大容量的网络可以学习将输入映射到另一个域中的任意随机排列的图像。现有方法将两个域视为离散的，并提出不同的假设来解决这个问题。在本文中，我们从不同的角度出发，考虑连接两个域的路径。我们假设输入和输出图像之间的最佳路径长度应该是所有可能路径中最短的。基于这一假设，我们提出了一种新方法，允许沿着路径生成图像，并提出了一种简单的方法来鼓励网络找到最短路径而无需配对信息。在各种任务上进行的大量实验证明了我们方法的优越性。

Paper23 CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability

摘要原文: Face image quality assessment (FIQA) estimates the utility of the captured image in achieving reliable and accurate recognition performance. This work proposes a novel FIQA method, CR-FIQA, that estimates the face image quality of a sample by learning to predict its relative classifiability. This classifiability is measured based on the allocation of the training sample feature representation in angular space with respect to its class center and the nearest negative class center. We experimentally illustrate the correlation between the face image quality and the sample relative classifiability. As such property is only observable for the training dataset, we propose to learn this property by probing internal network observations during the training process and utilizing it to predict the quality of unseen samples. Through extensive evaluation experiments on eight benchmarks and four face recognition models, we demonstrate the superiority of our proposed CR-FIQA over state-of-the-art (SOTA) FIQA algorithms.

中文总结: 这段话主要讨论了面部图像质量评估（FIQA）的重要性，以及提出的一种新的FIQA方法CR-FIQA。该方法通过学习预测样本的相对可分类性来估计面部图像的质量。可分类性是基于训练样本特征表示在角度空间中相对于其类中心和最近的负类中心的分配来衡量的。实验证明了面部图像质量与样本相对可分类性之间的相关性。由于这种属性只能在训练数据集中观察到，因此提出在训练过程中通过探测内部网络观察来学习这种属性，并利用它来预测未见样本的质量。通过对八个基准数据集和四个人脸识别模型的广泛评估实验，我们展示了我们提出的CR-FIQA相对于最先进的FIQA算法的优越性。

Paper24 GeneCIS: A Benchmark for General Conditional Image Similarity

摘要原文: We argue that there are many notions of ‘similarity’ and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS (‘genesis’) benchmark, which measures models’ ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States.

中文总结: 本段主要讨论了相似性的多种概念以及模型应该能够动态地适应这些概念。与大多数表示学习方法不同，这些方法学习一个固定的嵌入函数，因此隐含地假设了单一的相似性概念。作者提出了GeneCIS（‘genesis’）基准测试，用于衡量模型适应一系列相似性条件的能力。该基准测试仅设计用于零样本评估，并考虑了一个开放的相似性条件集。作者发现，即使基于强大的CLIP模型的基线在GeneCIS上表现也很困难，而且基准测试的表现与ImageNet准确性之间的相关性很弱，表明简单地扩展现有方法并不有效。作者进一步提出了一种简单、可扩展的解决方案，基于自动从现有图像标题数据集中挖掘信息。作者发现，他们的方法在GeneCIS上比基线表现有了显著提升，并进一步改善了相关图像检索基准测试的零样本性能。事实上，尽管是零样本评估，作者的模型在MIT-States上超过了最先进的监督模型。

Paper25 Toward Accurate Post-Training Quantization for Image Super Resolution

摘要原文: Model quantization is a crucial step for deploying super resolution (SR) networks on mobile devices. However, existing works focus on quantization-aware training, which requires complete dataset and expensive computational overhead. In this paper, we study post-training quantization(PTQ) for image super resolution using only a few unlabeled calibration images. As the SR model aims to maintain the texture and color information of input images, the distribution of activations are long-tailed, asymmetric and highly dynamic compared with classification models. To this end, we introduce the density-based dual clipping to cut off the outliers based on analyzing the asymmetric bounds of activations. Moreover, we present a novel pixel aware calibration method with the supervision of the full-precision model to accommodate the highly dynamic range of different samples. Extensive experiments demonstrate that the proposed method significantly outperforms existing PTQ algorithms on various models and datasets. For instance, we get a 2.091 dB increase on Urban100 benchmark when quantizing EDSRx4 to 4-bit with 100 unlabeled images. Our code is available at both https://github.com/huawei-noah/Efficient-Computing/tree/master/Quantization/PTQ4SR and https://gitee.com/mindspore/models/tree/master/research/cv/PTQ4SR.

中文总结: 这段话主要讨论了在移动设备上部署超分辨率（SR）网络时，模型量化是一个至关重要的步骤。然而，现有的工作主要集中在量化感知训练上，这需要完整的数据集和昂贵的计算开销。本文研究了使用少量未标记的校准图像进行图像超分辨率的后训练量化（PTQ）。由于SR模型旨在保持输入图像的纹理和颜色信息，激活的分布呈长尾状、不对称且与分类模型相比高度动态。为此，作者引入了基于密度的双剪切方法，通过分析激活的不对称边界来截断异常值。此外，作者提出了一种新颖的像素感知校准方法，通过全精度模型的监督来适应不同样本的高度动态范围。大量实验证明，所提出的方法在各种模型和数据集上明显优于现有的PTQ算法。例如，当将EDSRx4量化为4位并使用100张未标记图像时，我们在Urban100基准上获得了2.091 dB的增益。我们的代码可以在以下链接找到：https://github.com/huawei-noah/Efficient-Computing/tree/master/Quantization/PTQ4SR 和 https://gitee.com/mindspore/models/tree/master/research/cv/PTQ4SR。

Paper26 CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing

摘要原文: Edit fidelity is a significant issue in open-world controllable generative image editing. Recently, CLIP-based approaches have traded off simplicity to alleviate these problems by introducing spatial attention in a handpicked layer of a StyleGAN. In this paper, we propose CoralStyleCLIP, which incorporates a multi-layer attention-guided blending strategy in the feature space of StyleGAN2 for obtaining high-fidelity edits. We propose multiple forms of our co-optimized region and layer selection strategy to demonstrate the variation of time complexity with the quality of edits over different architectural intricacies while preserving simplicity. We conduct extensive experimental analysis and benchmark our method against state-of-the-art CLIP-based methods. Our findings suggest that CoralStyleCLIP results in high-quality edits while preserving the ease of use.

中文总结: 这段话主要讨论的是在开放世界可控生成图像编辑中，编辑保真度是一个重要问题。最近，基于CLIP的方法通过在StyleGAN的手动选择的层中引入空间注意力来缓解这些问题，但也牺牲了简单性。本文提出了CoralStyleCLIP，该方法在StyleGAN2的特征空间中结合了多层注意力引导的混合策略，以获得高保真度的编辑效果。我们提出了多种形式的区域和层选择策略，以展示在不同架构复杂性下编辑质量与时间复杂度变化的关系，同时保持简单性。我们进行了大量实验分析，并将我们的方法与最先进的基于CLIP的方法进行了基准测试。研究结果表明，CoralStyleCLIP能够产生高质量的编辑效果，同时保持易用性。

Paper27 Initialization Noise in Image Gradients and Saliency Maps

摘要原文: In this paper, we examine gradients of logits of image classification CNNs by input pixel values. We observe that these fluctuate considerably with training randomness, such as the random initialization of the networks. We extend our study to gradients of intermediate layers, obtained via GradCAM, as well as popular network saliency estimators such as DeepLIFT, SHAP, LIME, Integrated Gradients, and SmoothGrad. While empirical noise levels vary, qualitatively different attributions to image features are still possible with all of these, which comes with implications for interpreting such attributions, in particular when seeking data-driven explanations of the phenomenon generating the data. Finally, we demonstrate that the observed artefacts can be removed by marginalization over the initialization distribution by simple stochastic integration.

中文总结: 在这篇论文中，我们通过输入像素值来研究图像分类CNN的logits梯度。我们观察到这些梯度在训练随机性方面有相当大的波动，比如网络的随机初始化。我们将研究扩展到通过GradCAM获得的中间层梯度，以及流行的网络显著性估计器，如DeepLIFT、SHAP、LIME、Integrated Gradients和SmoothGrad。尽管经验噪声水平各不相同，但所有这些方法仍然可能对图像特征进行定性不同的归因，这对解释这些归因具有重要意义，特别是在寻求数据驱动的解释数据生成现象时。最后，我们证明通过简单的随机积分对初始化分布进行边际化可以消除观察到的人为效应。

Paper28 Single Image Depth Prediction Made Better: A Multivariate Gaussian Take

摘要原文: Neural-network-based single image depth prediction (SIDP) is a challenging task where the goal is to predict the scene’s per-pixel depth at test time. Since the problem, by definition, is ill-posed, the fundamental goal is to come up with an approach that can reliably model the scene depth from a set of training examples. In the pursuit of perfect depth estimation, most existing state-of-the-art learning techniques predict a single scalar depth value per-pixel. Yet, it is well-known that the trained model has accuracy limits and can predict imprecise depth. Therefore, an SIDP approach must be mindful of the expected depth variations in the model’s prediction at test time. Accordingly, we introduce an approach that performs continuous modeling of per-pixel depth, where we can predict and reason about the per-pixel depth and its distribution. To this end, we model per-pixel scene depth using a multivariate Gaussian distribution. Moreover, contrary to the existing uncertainty modeling methods—in the same spirit, where per-pixel depth is assumed to be independent, we introduce per-pixel covariance modeling that encodes its depth dependency w.r.t. all the scene points. Unfortunately, per-pixel depth covariance modeling leads to a computationally expensive continuous loss function, which we solve efficiently using the learned low-rank approximation of the overall covariance matrix. Notably, when tested on benchmark datasets such as KITTI, NYU, and SUN-RGB-D, the SIDP model obtained by optimizing our loss function shows state-of-the-art results. Our method’s accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard.

中文总结: 这段话主要讨论了基于神经网络的单图像深度预测（SIDP）是一项具有挑战性的任务，目标是在测试时预测场景的每个像素的深度。由于这个问题在定义上是不适定的，基本目标是提出一种能够可靠地从一组训练示例中建模场景深度的方法。为了追求完美的深度估计，大多数现有的最先进的学习技术预测每个像素的单个标量深度值。然而，众所周知，训练模型存在精度限制，可能会预测不准确的深度。因此，SIDP方法必须考虑模型在测试时预测的预期深度变化。因此，我们介绍了一种执行每个像素深度的连续建模的方法，可以预测和推理每个像素深度及其分布。为此，我们使用多元高斯分布对每个像素的场景深度进行建模。此外，与现有的不确定性建模方法相反，其中每个像素深度被假定为独立的情况相比，我们引入了每个像素协方差建模，该模型编码了其深度依赖性相对于所有场景点。不幸的是，每个像素深度协方差建模会导致一个计算昂贵的连续损失函数，我们通过学习整体协方差矩阵的低秩近似有效地解决了这个问题。值得注意的是，在基准数据集（如KITTI、NYU和SUN-RGB-D）上测试时，通过优化我们的损失函数获得的SIDP模型表现出最先进的结果。我们的方法的准确性（称为MG）在KITTI深度预测基准排行榜上处于前列。

Paper29 NUWA-LIP: Language-Guided Image Inpainting With Defect-Free VQGAN

摘要原文: Language-guided image inpainting aims to fill the defective regions of an image under the guidance of text while keeping the non-defective regions unchanged. However, directly encoding the defective images is prone to have an adverse effect on the non-defective regions, giving rise to distorted structures on non-defective parts. To better adapt the text guidance to the inpainting task, this paper proposes NUWA-LIP, which involves defect-free VQGAN (DF-VQGAN) and a multi-perspective sequence-to-sequence module (MP-S2S). To be specific, DF-VQGAN introduces relative estimation to carefully control the receptive spreading, as well as symmetrical connections to protect structure details unchanged. For harmoniously embedding text guidance into the locally defective regions, MP-S2S is employed by aggregating the complementary perspectives from low-level pixels, high-level tokens as well as the text description. Experiments show that our DF-VQGAN effectively aids the inpainting process while avoiding unexpected changes in non-defective regions. Results on three open-domain benchmarks demonstrate the superior performance of our method against state-of-the-arts. Our code, datasets, and model will be made publicly available.

中文总结: 这段话主要讨论了语言引导的图像修复技术，旨在在文本的指导下填补图像中的缺陷区域，同时保持非缺陷区域不变。然而，直接对缺陷图像进行编码往往会对非缺陷区域产生不良影响，导致非缺陷部分出现扭曲的结构。为了更好地使文本指导适应修复任务，本文提出了NUWA-LIP方法，其中包括无缺陷VQGAN（DF-VQGAN）和多角度序列到序列模块（MP-S2S）。具体而言，DF-VQGAN引入了相对估计来精细控制感受野扩展，以及对称连接来保护结构细节不变。为了将文本指导和局部缺陷区域融合得更加和谐，MP-S2S通过聚合来自低级像素、高级标记以及文本描述的互补视角。实验表明，我们的DF-VQGAN有效地辅助了修复过程，同时避免了非缺陷区域的意外变化。在三个开放领域基准测试上的结果表明，我们的方法在性能上优于现有技术。我们的代码、数据集和模型将公开发布。

Paper30 ImageBind: One Embedding Space To Bind Them All

摘要原文: We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

中文总结: 我们提出了ImageBind，一种学习跨六种不同模态（图像、文本、音频、深度、热像和IMU数据）联合嵌入的方法。我们展示了并非所有配对数据的组合都是必要的来训练这样一个联合嵌入，只有图像配对数据就足以将这些模态绑定在一起。ImageBind可以利用最近的大规模视觉-语言模型，并通过使用它们与图像的自然配对，将它们的零样本能力扩展到新的模态。它能够直接实现新的应用程序，包括跨模态检索、通过算术组合模态、跨模态检测和生成。这些新的能力随着图像编码器的强度而提高，我们在跨模态的紧急零样本识别任务上取得了新的最先进水平，胜过了专门的监督模型。最后，我们展示了强大的少样本识别结果，超过了以前的工作，并且ImageBind作为一种评估视觉模型在视觉和非视觉任务中的新方法。

Paper31 UMat: Uncertainty-Aware Single Image High Resolution Material Capture

摘要原文: We propose a learning-based method to recover normals, specularity, and roughness from a single diffuse image of a material, using microgeometry appearance as our primary cue. Previous methods that work on single images tend to produce over-smooth outputs with artifacts, operate at limited resolution, or train one model per class with little room for generalization. In contrast, in this work, we propose a novel capture approach that leverages a generative network with attention and a U-Net discriminator, which shows outstanding performance integrating global information at reduced computational complexity. We showcase the performance of our method with a real dataset of digitized textile materials and show that a commodity flatbed scanner can produce the type of diffuse illumination required as input to our method. Additionally, because the problem might be ill-posed --more than a single diffuse image might be needed to disambiguate the specular reflection-- or because the training dataset is not representative enough of the real distribution, we propose a novel framework to quantify the model’s confidence about its prediction at test time. Our method is the first one to deal with the problem of modeling uncertainty in material digitization, increasing the trustworthiness of the process and enabling more intelligent strategies for dataset creation, as we demonstrate with an active learning experiment.

中文总结: 我们提出了一种基于学习的方法，从单个材质的漫反射图像中恢复法线、镜面反射和粗糙度，以微观几何外观作为我们的主要线索。先前在单个图像上工作的方法往往会产生过度平滑的输出带有伪影，操作分辨率有限，或者对每个类别训练一个模型，很少有泛化的空间。相比之下，在这项工作中，我们提出了一种新颖的捕捉方法，利用了具有注意力和U-Net鉴别器的生成网络，该方法在减少计算复杂性的同时展现出卓越的性能，集成了全局信息。我们展示了我们的方法在一个真实的数字化纺织材料数据集上的性能，并展示了一种商品平板扫描仪可以产生我们方法所需的漫反射照明类型的输入。此外，由于问题可能是不适定的–可能需要不止一个漫反射图像来消除镜面反射–或者因为训练数据集不足以代表真实分布，我们提出了一种新颖的框架，在测试时量化模型对其预测的置信度。我们的方法是第一个处理材料数字化建模不确定性问题的方法，增加了过程的可信度，并为数据集创建提供了更智能的策略，正如我们在主动学习实验中展示的那样。

Paper32 Rethinking Image Super Resolution From Long-Tailed Distribution Learning Perspective

摘要原文: Existing studies have empirically observed that the resolution of the low-frequency region is easier to enhance than that of the high-frequency one. Although plentiful works have been devoted to alleviating this problem, little understanding is given to explain it. In this paper, we try to give a feasible answer from a machine learning perspective, i.e., the twin fitting problem caused by the long-tailed pixel distribution in natural images. With this explanation, we reformulate image super resolution (SR) as a long-tailed distribution learning problem and solve it by bridging the gaps of the problem between in low- and high-level vision tasks. As a result, we design a long-tailed distribution learning solution, that rebalances the gradients from the pixels in the low- and high-frequency region, by introducing a static and a learnable structure prior. The learned SR model achieves better balance on the fitting of the low- and high-frequency region so that the overall performance is improved. In the experiments, we evaluate the solution on four CNN- and one Transformer-based SR models w.r.t. six datasets and three tasks, and experimental results demonstrate its superiority.

中文总结: 这段话主要讨论了现有研究观察到低频区域的分辨率比高频区域更容易增强，但对于这一问题的解释仍较少。作者从机器学习的角度尝试给出一个可行的解释，即自然图像中长尾像素分布引起的双重拟合问题。作者将图像超分辨率（SR）重新表述为长尾分布学习问题，并通过解决低级和高级视觉任务之间的问题差距来解决它。作者设计了一个长尾分布学习解决方案，通过引入静态和可学习的结构先验，重新平衡低频和高频区域像素的梯度，从而提高整体性能。实验中，作者评估了这个解决方案在四个基于CNN和一个基于Transformer的SR模型上，涉及六个数据集和三个任务，实验结果表明其优越性。

Paper33 Semi-Supervised Parametric Real-World Image Harmonization

摘要原文: Learning-based image harmonization techniques are usually trained to undo synthetic global transformations, applied to a masked foreground in a single ground truth photo. This simulated data does not model many important appearance mismatches (illumination, object boundaries, etc.) between foreground and background in real composites, leading to models that do not generalize well and cannot model complex local changes. We propose a new semi-supervised training strategy that addresses this problem and lets us learn complex local appearance harmonization from unpaired real composites, where foreground and background come from different images. Our model is fully parametric. It uses RGB curves to correct the global colors and tone and a shading map to model local variations. Our approach outperforms previous work on established benchmarks and real composites, as shown in a user study, and processes high-resolution images interactively. The code and project page is available at https://kewang0622.github.io/sprih/.

中文总结: 这段话主要讨论了基于学习的图像和谐化技术通常是针对单个真实合成照片中的前景进行的合成全局变换进行训练，以撤销这些变换。然而，这种模拟数据无法模拟真实合成图像中前景和背景之间许多重要的外观不匹配（如光照、物体边界等），导致模型泛化能力不强，无法模拟复杂的局部变化。作者提出了一种新的半监督训练策略，可以从未配对的真实合成图像中学习复杂的局部外观和谐化，其中前景和背景来自不同的图像。他们的模型是完全参数化的，使用RGB曲线来校正全局颜色和色调，使用阴影图来建模局部变化。作者的方法在已建立的基准测试和真实合成图像上表现优异，经用户研究证实，并且可以交互处理高分辨率图像。具体代码和项目页面可在https://kewang0622.github.io/sprih/上找到。

Paper34 High-Res Facial Appearance Capture From Polarized Smartphone Images

摘要原文: We propose a novel method for high-quality facial texture reconstruction from RGB images using a novel capturing routine based on a single smartphone which we equip with an inexpensive polarization foil. Specifically, we turn the flashlight into a polarized light source and add a polarization filter on top of the camera. Leveraging this setup, we capture the face of a subject with cross-polarized and parallel-polarized light. For each subject, we record two short sequences in a dark environment under flash illumination with different light polarization using the modified smartphone. Based on these observations, we reconstruct an explicit surface mesh of the face using structure from motion. We then exploit the camera and light co-location within a differentiable renderer to optimize the facial textures using an analysis-by-synthesis approach. Our method optimizes for high-resolution normal textures, diffuse albedo, and specular albedo using a coarse-to-fine optimization scheme. We show that the optimized textures can be used in a standard rendering pipeline to synthesize high-quality photo-realistic 3D digital humans in novel environments.

中文总结: 本文提出了一种新颖的方法，通过使用一部配备了廉价偏振薄膜的智能手机的新型捕捉程序，从RGB图像中重建高质量的面部纹理。具体而言，我们将手机的闪光灯转换为偏振光源，并在相机顶部添加偏振滤光片。利用这一设置，我们使用修改后的智能手机在黑暗环境下利用交叉偏振光和平行偏振光捕捉被试者的面部。对于每个被试者，我们在闪光照明下记录两个短序列，使用不同光偏振的修改手机。基于这些观察结果，我们利用运动结构重建了面部的显式表面网格。然后利用不同iable渲染器中的相机和光共位性，使用分析合成方法优化面部纹理。我们的方法通过粗到细的优化方案，优化高分辨率的法线纹理、漫反射反照率和镜面反照率。我们展示了优化后的纹理可以在标准渲染流程中用于合成新环境中的高质量逼真3D数字人物。

Paper35 ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing

摘要原文: Zero-shot capability has been considered as a new revolution of deep learning, letting machines work on tasks without curated training data. As a good start and the only existing outcome of zero-shot image captioning (IC), ZeroCap abandons supervised training and sequentially searching every word in the caption using the knowledge of large-scale pre-trained models. Though effective, its autoregressive generation and gradient-directed searching mechanism limit the diversity of captions and inference speed, respectively. Moreover, ZeroCap does not consider the controllability issue of zero-shot IC. To move forward, we propose a framework for Controllable Zero-shot IC, named ConZIC. The core of ConZIC is a novel sampling-based non-autoregressive language model named GibbsBERT, which can generate and continuously polish every word. Extensive quantitative and qualitative results demonstrate the superior performance of our proposed ConZIC for both zero-shot IC and controllable zero-shot IC. Especially, ConZIC achieves about 5x faster generation speed than ZeroCap, and about 1.5x higher diversity scores, with accurate generation given different control signals.

中文总结: 这段话主要讨论了零样本能力被视为深度学习的新革命，使得机器能够在没有经过策划的训练数据的情况下处理任务。其中提到了零样本图像描述（IC）的唯一现有成果ZeroCap，它放弃了监督训练，使用大规模预训练模型的知识顺序搜索每个单词来生成描述。然而，ZeroCap的自回归生成和梯度导向搜索机制限制了描述的多样性和推理速度。此外，ZeroCap没有考虑零样本IC的可控性问题。为了进一步发展，提出了一个名为ConZIC的可控零样本IC框架，其核心是一种名为GibbsBERT的新型基于采样的非自回归语言模型，可以生成并不断优化每个单词。大量定量和定性结果显示了我们提出的ConZIC在零样本IC和可控零样本IC方面的卓越性能。特别是，ConZIC的生成速度比ZeroCap快大约5倍，多样性得分高出约1.5倍，并且在给定不同控制信号的情况下生成准确。

Paper36 EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata

摘要原文: We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions “zero shot” by clustering the visual embeddings for all of the patches within an image.

中文总结: 这段话的主要内容是关于通过训练多模态嵌入来学习捕捉记录给定照片的相机信息的视觉表示。作者通过在图像补丁和相机自动插入到图像文件中的EXIF元数据之间训练多模态嵌入来实现这一目标。他们的模型通过将元数据转换为文本，然后使用Transformer进行处理来表示这些元数据。所学习的特征在下游图像取证和校准任务中明显优于其他自监督和监督特征。特别是，他们通过对图像中所有补丁的视觉嵌入进行聚类，成功地"零射击"定位拼接图像区域。

Paper37 HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

摘要原文: A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model’s data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts,

中文总结: 这段话主要讨论了在图像字幕生成方面取得的巨大进展，这得益于对如何使用预训练模型对图像进行编码的研究。这包括视觉编码（例如图像网格特征或检测到的对象）以及最近的文本编码（例如图像标签或图像区域的文本描述）。随着更先进的编码方法的出现和应用，人们自然会问：如何高效有效地利用这种异构编码集？在本文中，我们建议将这些编码视为输入图像的增强视图。图像字幕生成模型使用共享编码器独立地对每个视图进行编码，并以一种新颖的方式跨编码视图引入对比损失，以提高它们的表示质量和模型的数据效率。我们提出的分层解码器根据它们对生成字幕的有效性自适应地对编码视图进行加权，首先在标记级别内部聚合，然后在视图级别跨视图聚合。我们在MS-COCO上实现了+5.6%的CIDEr和在Flickr30k上实现了+12.9%的CIDEr的显著性能改进，相比于现有技术水平。

Paper38 Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

摘要原文: This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network (“everything”), and (ii) we would really like to understand how this sketch-photo matching operates (“explainable”). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches – akin to the seasoned “bag-of-words” paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches.

中文总结: 这篇论文研究了零短的基于草图的图像检索问题（ZS-SBIR），但与先前的研究有两个显著的不同之处：（i）我们用一个网络解决了所有变体（跨类别、内类别和跨数据集）的ZS-SBIR问题（“一切”），（ii）我们真的很想了解这种草图-照片匹配是如何运作的（“可解释”）。我们的关键创新在于意识到这样的跨模态匹配问题可以简化为对一组关键局部补丁的比较，类似于经验丰富的“词袋”范式。仅凭这一变化，我们就能实现前述两个目标，还有一个额外的好处是不再需要外部语义知识。技术上，我们的是基于Transformer的跨模态网络，具有三个新颖的组件：（i）一个带有可学习分词器的自注意力模块，用于生成对应于最具信息量的局部区域的视觉标记，（ii）一个交叉注意力模块，用于计算两种模态之间的视觉标记的局部对应关系，最后（iii）一个基于核的关系网络，用于组装局部的假设匹配并为草图-照片对生成总体相似性度量。实验证明，我们的确在所有ZS-SBIR设置下都实现了卓越的性能。至关重要的可解释目标通过可视化跨模态标记对应关系以及首次通过对所有匹配的照片补丁进行通用替换实现了优雅的方式。

Paper39 Context-Based Trit-Plane Coding for Progressive Image Compression

摘要原文: Trit-plane coding enables deep progressive image compression, but it cannot use autoregressive context models. In this paper, we propose the context-based trit-plane coding (CTC) algorithm to achieve progressive compression more compactly. First, we develop the context-based rate reduction module to estimate trit probabilities of latent elements accurately and thus encode the trit-planes compactly. Second, we develop the context-based distortion reduction module to refine partial latent tensors from the trit-planes and improve the reconstructed image quality. Third, we propose a retraining scheme for the decoder to attain better rate-distortion tradeoffs. Extensive experiments show that CTC outperforms the baseline trit-plane codec significantly, e.g. by -14.84% in BD-rate on the Kodak lossless dataset, while increasing the time complexity only marginally. The source codes are available at https://github.com/seungminjeon-github/CTC.

中文总结: 这篇论文提出了基于上下文的三位平面编码（CTC）算法，以更紧凑地实现渐进式压缩。首先，作者开发了基于上下文的速率减少模块，准确估计潜在元素的三位概率，从而紧凑地编码三位平面。其次，作者开发了基于上下文的失真减少模块，从三位平面中细化部分潜在张量，并提高重建图像质量。第三，作者提出了一个解码器的重新训练方案，以获得更好的速率-失真折衷。大量实验证明，CTC在Kodak无损数据集上比基准三位平面编解码器表现显著优越，例如，在BD-rate上提高了-14.84%，同时仅在时间复杂度上略有增加。源代码可在https://github.com/seungminjeon-github/CTC 上获得。

Paper40 Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising

摘要原文: To obtain clean images with salient structures from noisy observations, a growing trend in current denoising studies is to seek the help of additional guidance images with high signal-to-noise ratios, which are often acquired in different spectral bands such as near infrared. Although previous guided denoising methods basically require the input images to be well-aligned, a more common way to capture the paired noisy target and guidance images is to exploit a stereo camera system. However, current studies on cross-spectral stereo matching cannot fully guarantee the pixel-level registration accuracy, and rarely consider the case of noise contamination. In this work, for the first time, we propose a guided denoising framework for cross-spectral stereo images. Instead of aligning the input images via conventional stereo matching, we aggregate structures from the guidance image to estimate a clean structure map for the noisy target image, which is then used to regress the final denoising result with a spatially variant linear representation model. Based on this, we design a neural network, called as SANet, to complete the entire guided denoising process. Experimental results show that, our SANet can effectively transfer structures from an unaligned guidance image to the restoration result, and outperforms state-of-the-art denoisers on various stereo image datasets. Besides, our structure aggregation strategy also shows its potential to handle other unaligned guided restoration tasks such as super-resolution and deblurring. The source code is available at https://github.com/lustrouselixir/SANet.

中文总结: 这段话主要介绍了当前去噪研究中的一个新趋势，即利用具有高信噪比的额外引导图像来获取具有显著结构的清晰图像，这些引导图像通常是在不同光谱波段（如近红外）中获取的。尽管先前的引导去噪方法基本上要求输入图像对齐良好，但捕获配对的噪声目标和引导图像的更常见方式是利用立体摄像机系统。然而，目前关于跨光谱立体匹配的研究不能完全保证像素级的注册精度，并且很少考虑噪声污染的情况。在这项工作中，我们首次提出了一种用于跨光谱立体图像的引导去噪框架。我们不是通过传统的立体匹配来对齐输入图像，而是从引导图像中聚合结构，以估计噪声目标图像的清晰结构图，然后利用这个结构图通过空间变异线性表示模型回归最终的去噪结果。基于此，我们设计了一个名为SANet的神经网络，来完成整个引导去噪过程。实验结果表明，我们的SANet可以有效地将结构从不对齐的引导图像传输到恢复结果，并在各种立体图像数据集上优于最先进的去噪器。此外，我们的结构聚合策略还显示出处理其他不对齐引导恢复任务（如超分辨率和去模糊）的潜力。源代码可在https://github.com/lustrouselixir/SANet找到。

Paper41 Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

摘要原文: Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to the input text prompt, while consistent with the input image. We present Imagen Editor, a cascaded diffusion model, built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor’s edits are faithful to the text prompts, which is accomplished by incorporating object detectors for proposing inpainting masks during training. In addition, text-guided image inpainting captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment – such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion – and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.

中文总结: 这段话主要讨论了文本引导的图像编辑在支持创意应用方面具有转变性影响。其中一个关键挑战是生成与输入文本提示保持一致且与输入图像一致的编辑。作者提出了Imagen Editor，这是一个级联扩散模型，通过在文本引导的图像修补上微调Imagen构建而成。Imagen Editor的编辑结果忠实于文本提示，这是通过在训练过程中结合对象检测器提出修补掩模来实现的。此外，文本引导的图像修补通过在级联管道中对原始高分辨率图像进行条件化，捕捉了输入图像中的细节。为了改进定性和定量评估，作者引入了EditBench，这是一个系统性的用于文本引导的图像修补的基准。EditBench在自然和生成图像上评估修补编辑，探索对象、属性和场景。通过对EditBench进行广泛的人类评估，作者发现在训练过程中进行对象遮罩会在文本-图像对齐方面带来全面的改进，使得Imagen Editor优于DALL-E 2和Stable Diffusion。总的来说，这些模型在对象渲染方面优于文本渲染，并且比较擅长处理材料/颜色/大小属性，而不太擅长处理数量/形状属性。

Paper42 SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation

摘要原文: Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption conditioned on an input image and related captions retrieved from a datastore. Our model is lightweight and fast to train as the only learned parameters are in newly introduced cross-attention layers between a pre-trained CLIP encoder and GPT-2 decoder. SmallCap can transfer to new domains without additional finetuning and can exploit large-scale data in a training-free fashion since the contents of the datastore can be readily replaced. Our experiments show that SmallCap, trained only on COCO, has competitive performance on this benchmark, and also transfers to other domains without retraining, solely through retrieval from target-domain data. Further improvement is achieved through the training-free exploitation of diverse human-labeled and web data, which proves effective for a range of domains, including the nocaps benchmark, designed to test generalization to unseen visual concepts.

中文总结: 这段话的主要内容是介绍了图像字幕生成领域的最新进展。近期的研究主要集中在扩展数据和模型规模，大幅增加了预训练和微调的成本。作为大型模型的替代方案，作者提出了SmallCap模型，该模型在生成图像字幕时受到检索自数据存储中的相关图像标注的影响。SmallCap模型轻量且训练速度快，因为其唯一的学习参数位于预训练的CLIP编码器和GPT-2解码器之间新引入的交叉注意力层中。SmallCap模型可以在不需要额外微调的情况下迁移到新领域，并且可以在训练过程中利用大规模数据，因为数据存储中的内容可以方便地替换。实验证明，仅在COCO数据集上训练的SmallCap模型在该基准测试上具有竞争力的性能，并且可以在不重新训练的情况下迁移到其他领域，仅通过从目标领域数据中检索。通过训练无关的利用各种人工标注和网络数据，进一步提高了模型的性能，这对包括用于测试对未见视觉概念的泛化能力的nocaps基准测试在内的各种领域都证明有效。

Paper43 Learning Generative Structure Prior for Blind Text Image Super-Resolution

摘要原文: Blind text image super-resolution (SR) is challenging as one needs to cope with diverse font styles and unknown degradation. To address the problem, existing methods perform character recognition in parallel to regularize the SR task, either through a loss constraint or intermediate feature condition. Nonetheless, the high-level prior could still fail when encountering severe degradation. The problem is further compounded given characters of complex structures, e.g., Chinese characters that combine multiple pictographic or ideographic symbols into a single character. In this work, we present a novel prior that focuses more on the character structure. In particular, we learn to encapsulate rich and diverse structures in a StyleGAN and exploit such generative structure priors for restoration. To restrict the generative space of StyleGAN so that it obeys the structure of characters yet remains flexible in handling different font styles, we store the discrete features for each character in a codebook . The code subsequently drives the StyleGAN to generate high-resolution structural details to aid text SR. Compared to priors based on character recognition, the proposed structure prior exerts stronger character-specific guidance to restore faithful and precise strokes of a designated character. Extensive experiments on synthetic and real datasets demonstrate the compelling performance of the proposed generative structure prior in facilitating robust text SR. Our code is available at https://github.com/csxmli2016/MARCONet.

中文总结: 这段话主要讨论了针对盲文图像超分辨率（SR）的挑战性问题，其中需要应对各种字体风格和未知的降质。为了解决这个问题，现有方法在进行SR任务时会同时进行字符识别，通过损失约束或中间特征条件来规范化SR任务。然而，当遇到严重降质时，高级先验仍然可能失败。问题进一步复杂化的原因在于复杂结构的字符，例如将多个象形或表意符号组合成一个字符的中文字符。在这项工作中，提出了一种更专注于字符结构的新型先验。具体而言，通过学习将丰富多样的结构封装到StyleGAN中，并利用这种生成结构先验进行恢复。为了限制StyleGAN的生成空间，使其遵循字符的结构但同时在处理不同字体风格时保持灵活性，我们将每个字符的离散特征存储在一个代码本中。该代码随后驱动StyleGAN生成高分辨率的结构细节，以帮助文本SR。与基于字符识别的先验相比，所提出的结构先验对于恢复指定字符的准确和精确笔画提供了更强的字符特定指导。对合成和真实数据集进行的大量实验表明，所提出的生成结构先验在促进鲁棒的文本SR方面表现出引人注目的性能。我们的代码可在https://github.com/csxmli2016/MARCONet 上找到。

Paper44 Soft Augmentation for Image Classification

摘要原文: Modern neural networks are over-parameterized and thus rely on strong regularization such as data augmentation and weight decay to reduce overfitting and improve generalization. The dominant form of data augmentation applies invariant transforms, where the learning target of a sample is invariant to the transform applied to that sample. We draw inspiration from human visual classification studies and propose generalizing augmentation with invariant transforms to soft augmentation where the learning target softens non-linearly as a function of the degree of the transform applied to the sample: e.g., more aggressive image crop augmentations produce less confident learning targets. We demonstrate that soft targets allow for more aggressive data augmentation, offer more robust performance boosts, work with other augmentation policies, and interestingly, produce better calibrated models (since they are trained to be less confident on aggressively cropped/occluded examples). Combined with existing aggressive augmentation strategies, soft targets 1) double the top-1 accuracy boost across Cifar-10, Cifar-100, ImageNet-1K, and ImageNet-V2, 2) improve model occlusion performance by up to 4x, and 3) half the expected calibration error (ECE). Finally, we show that soft augmentation generalizes to self-supervised classification tasks.

中文总结: 这段话主要内容是关于现代神经网络过度参数化的问题以及为了减少过拟合并提高泛化能力而采用强力正则化方法，如数据增强和权重衰减。数据增强的主要形式是应用不变变换，其中样本的学习目标对于应用于该样本的变换是不变的。作者从人类视觉分类研究中得到启发，提出了将不变变换推广到软增强的概念，其中学习目标会随着应用于样本的变换程度而非线性地软化：例如，更激进的图像裁剪增强会产生更不确定的学习目标。作者展示了软目标允许更激进的数据增强，提供更稳健的性能提升，适用于其他增强策略，并且产生更好校准的模型（因为它们在激进裁剪/遮挡示例上训练时更不自信）。结合现有的激进增强策略，软目标在Cifar-10、Cifar-100、ImageNet-1K和ImageNet-V2上可以将top-1准确率提升翻倍，将模型遮挡性能提升多达4倍，并将期望校准误差（ECE）减半。最后，作者展示软增强可以推广到自监督分类任务中。

Paper45 OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution

摘要原文: Arbitrary-scale image super-resolution (SR) is often tackled using the implicit neural representation (INR) approach, which relies on a position encoding scheme to improve its representation ability. In this paper, we introduce orthogonal position encoding (OPE), an extension of position encoding, and an OPE-Upscale module to replace the INR-based upsampling module for arbitrary-scale image super-resolution. Our OPE-Upscale module takes 2D coordinates and latent code as inputs, just like INR, but does not require any training parameters. This parameter-free feature allows the OPE-Upscale module to directly perform linear combination operations, resulting in continuous image reconstruction and achieving arbitrary-scale image reconstruction. As a concise SR framework, our method is computationally efficient and consumes less memory than state-of-the-art methods, as confirmed by extensive experiments and evaluations. In addition, our method achieves comparable results with state-of-the-art methods in arbitrary-scale image super-resolution. Lastly, we show that OPE corresponds to a set of orthogonal basis, validating our design principle.

中文总结: 这段话主要内容是介绍了在任意尺度图像超分辨率（SR）中使用隐式神经表示（INR）方法来处理问题，该方法依赖于一种位置编码方案来提高其表示能力。作者引入了正交位置编码（OPE）作为位置编码的扩展，并提出了一个OPE-Upscale模块来替代基于INR的上采样模块，用于任意尺度图像超分辨率。OPE-Upscale模块接受2D坐标和潜在编码作为输入，类似于INR，但不需要任何训练参数。这种无参数特性使得OPE-Upscale模块能够直接执行线性组合操作，实现连续图像重建并实现任意尺度图像重建。作为简洁的SR框架，该方法计算效率高，消耗的内存比最先进的方法少，经过广泛实验和评估得到验证。此外，该方法在任意尺度图像超分辨率中取得了与最先进方法可比较的结果。最后，作者表明OPE对应于一组正交基，验证了他们的设计原则。

Paper46 Generalized Decoding for Pixel, Image, and Language

摘要原文: We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition. Code, demo, video and visualization are available at: https://x-decoder-vl.github.io.

中文总结: 本文介绍了一种名为X-Decoder的通用解码模型，可以无缝预测像素级分割和语言标记。X-Decoder接受两种类型的查询作为输入：（i）通用非语义查询和（ii）从文本输入中诱导的语义查询，以在相同的语义空间中解码不同的像素级和标记级输出。通过这种新颖的设计，X-Decoder是第一个提供统一支持所有类型图像分割和各种视觉-语言（VL）任务的工作。此外，我们的设计实现了在不同粒度的任务之间无缝交互，并通过学习一个共同丰富的像素级视觉-语义理解空间，而无需任何伪标记，带来相互的益处。在对有限数量的分割数据和数百万个图像-文本对进行预训练后，X-Decoder在零样本和微调设置下展现出强大的可迁移性，值得注意的是，它在八个数据集上实现了（1）开放词汇分割和指代分割的最新结果；（2）在分割和VL任务上，与其他通用和专业模型相比具有更好或竞争性的微调性能；（3）具有高效微调和新任务组合的灵活性。代码、演示、视频和可视化可在以下链接找到：https://x-decoder-vl.github.io。

Paper47 Document Image Shadow Removal Guided by Color-Aware Background

摘要原文: Existing works on document image shadow removal mostly depend on learning and leveraging a constant background (the color of the paper) from the image. However, the constant background is less representative and frequently ignores other background colors, such as the printed colors, resulting in distorted results. In this paper, we present a color-aware background extraction network (CBENet) for extracting a spatially varying background image that accurately depicts the background colors of the document. Furthermore, we propose a background-guided document images shadow removal network (BGShadowNet) using the predicted spatially varying background as auxiliary information, which consists of two stages. At Stage I, a background-constrained decoder is designed to promote a coarse result. Then, the coarse result is refined with a background-based attention module (BAModule) to maintain a consistent appearance and a detail improvement module (DEModule) to enhance the texture details at Stage II. Experiments on two benchmark datasets qualitatively and quantitatively validate the superiority of the proposed approach over state-of-the-arts.

中文总结: 这段话主要讨论了关于文档图像阴影去除的现有研究。现有的文档图像阴影去除方法主要依赖于学习和利用图像中的恒定背景（纸张的颜色）。然而，恒定背景往往不够代表性，经常忽略其他背景颜色，如印刷颜色，导致结果失真。本文提出了一种颜色感知背景提取网络（CBENet），用于提取准确反映文档背景颜色的空间变化背景图像。此外，我们提出了一种基于背景的文档图像阴影去除网络（BGShadowNet），利用预测的空间变化背景作为辅助信息，包括两个阶段。在第一阶段，设计了一个受背景约束的解码器来促进粗糙结果。然后，在第二阶段，通过一个基于背景的注意力模块（BAModule）对粗糙结果进行细化，以保持一致的外观和增强纹理细节的细化模块（DEModule）。在两个基准数据集上的实验定性和定量地验证了所提方法优于现有技术的优越性。

Paper48 Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble

摘要原文: Automatically estimating 3D skeleton, shape, camera viewpoints, and part articulation from sparse in-the-wild image ensembles is a severely under-constrained and challenging problem. Most prior methods rely on large-scale image datasets, dense temporal correspondence, or human annotations like camera pose, 2D keypoints, and shape templates. We propose Hi-LASSIE, which performs 3D articulated reconstruction from only 20-30 online images in the wild without any user-defined shape or skeleton templates. We follow the recent work of LASSIE that tackles a similar problem setting and make two significant advances. First, instead of relying on a manually annotated 3D skeleton, we automatically estimate a class-specific skeleton from the selected reference image. Second, we improve the shape reconstructions with novel instance-specific optimization strategies that allow reconstructions to faithful fit on each instance while preserving the class-specific priors learned across all images. Experiments on in-the-wild image ensembles show that Hi-LASSIE obtains higher fidelity state-of-the-art 3D reconstructions despite requiring minimum user input. Project page: chhankyao.github.io/hi-lassie/

中文总结: 这段话主要讨论了从稀疏的野外图像集合中自动估计3D骨架、形状、相机视角和部件关节是一个严重不受约束且具有挑战性的问题。大多数先前的方法依赖于大规模图像数据集、密集的时间对应关系，或者人类注释，如相机姿势、2D关键点和形状模板。作者提出了Hi-LASSIE方法，可以仅从野外的20-30张在线图像中执行3D关节重建，而无需用户定义的形状或骨架模板。该方法通过自动从选定的参考图像中估计类别特定的骨架，并利用新颖的实例特定优化策略改进形状重建，使重建能够在每个实例上忠实地拟合，同时保留在所有图像中学习到的类别特定先验知识。实验结果表明，Hi-LASSIE在野外图像集合上获得了更高质量的3D重建结果，尽管需要最少的用户输入。

Paper49 Pix2map: Cross-Modal Retrieval for Inferring Street Maps From Images

摘要原文: Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.

中文总结: 这篇论文介绍了一种名为Pix2Map的方法，用于从自车视角图像中直接推断城市街道地图拓扑结构，以便不断更新和扩展现有地图。这是一项具有挑战性的任务，因为我们需要直接从原始图像数据中推断复杂的城市道路拓扑结构。该论文的主要观点是，这个问题可以被看作是跨模态检索，通过学习一个联合的、跨模态的嵌入空间，将图像和现有地图表示为编码视觉环境拓扑布局的离散图形。我们使用Argoverse数据集进行实验评估，并展示了确实可以仅从图像数据中准确检索对应于已知和未知道路的街道地图。此外，我们展示了我们检索到的地图可以用于更新或扩展现有地图，甚至展示了基于空间图形的视觉定位和图像检索的概念验证结果。

Paper50 Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

摘要原文: Text-to-image person retrieval aims to identify the target person based on a given textual description query. The primary challenge is to learn the mapping of visual and textual modalities into a common latent space. Prior works have attempted to address this challenge by leveraging separately pre-trained unimodal models to extract visual and textual features. However, these approaches lack the necessary underlying alignment capabilities required to match multimodal data effectively. Besides, these works use prior information to explore explicit part alignments, which may lead to the distortion of intra-modality information. To alleviate these issues, we present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, we first design an Implicit Relation Reasoning module in a masked language modeling paradigm. This achieves cross-modal interaction by integrating the visual cues into the textual tokens with a cross-modal multimodal interaction encoder. Secondly, to globally align the visual and textual embeddings, Similarity Distribution Matching is proposed to minimize the KL divergence between image-text similarity distributions and the normalized label matching distributions. The proposed method achieves new state-of-the-art results on all three public datasets, with a notable margin of about 3%-9% for Rank-1 accuracy compared to prior methods.

中文总结: 这段话主要介绍了文本到图像人物检索的研究，旨在根据给定的文本描述查询来识别目标人物。主要挑战是学习将视觉和文本模态映射到一个共同的潜在空间。先前的研究尝试通过利用分别预训练的单模态模型来提取视觉和文本特征来解决这一挑战。然而，这些方法缺乏有效匹配多模态数据所需的基本对齐能力。此外，这些工作使用先前信息来探索显式部分对齐，这可能导致内部模态信息的扭曲。为了缓解这些问题，作者提出了IRRA：一个跨模态的隐式关系推理和对齐框架，该框架学习了局部视觉-文本令牌之间的关系，并增强了全局图像-文本匹配，而无需额外的先前监督。具体地，首先设计了一个隐式关系推理模块，采用掩码语言建模范式。通过将视觉线索与文本令牌集成到跨模态多模态交互编码器中，实现了跨模态交互。其次，为了全局对齐视觉和文本嵌入，提出了相似性分布匹配方法，以最小化图像-文本相似性分布和标准化标签匹配分布之间的KL散度。该方法在所有三个公共数据集上取得了新的最先进结果，与先前方法相比，Rank-1准确率有显著的提升，约为3%-9%。

Paper51 CLIPPO: Image-and-Language Understanding From Pixels Only

摘要原文: Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without modifications. Code and pretrained models are available at https://github.com/google-research/big_vision.

中文总结: 这段话主要讨论了多模态模型的发展趋势以及一个新的模型CLIPPO。多模态模型越来越有效，部分原因是统一组件，如Transformer架构。然而，多模态模型仍然通常由许多任务和模态特定的部分和训练程序组成。作者提出了一个新的统一方法：使用基于像素的模型执行图像、文本和多模态任务。他们的模型仅使用对比损失进行训练，因此称之为CLIP-Pixels Only (CLIPPO)。CLIPPO使用一个单一编码器处理常规图像和文本渲染为图像。通过图像-文本对比学习和下一个句子对比学习的联合训练，CLIPPO在自然语言理解任务上表现良好，无需任何单词级别的损失，优于基于像素的先前工作。最后，作者展示了CLIPPO在多语言多模态检索上表现强劲，而无需修改。

Paper52 Real-Time 6K Image Rescaling With Rate-Distortion Optimization

摘要原文: The task of image rescaling aims at embedding an high-resolution (HR) image into a low-resolution (LR) one that can contain embedded information for HR image reconstruction. Existing image rescaling methods do not optimize the LR image file size and recent flow-based rescaling methods are not real-time yet for HR image reconstruction (e.g., 6K). To address these two challenges, we propose a novel framework (HyperThumbnail) for real-time 6K rate-distortion-aware image rescaling. Our HyperThumbnail first embeds an HR image into a JPEG LR image (thumbnail) by an encoder with our proposed learnable JPEG quantization module, which optimizes the file size of the embedding LR JPEG image. Then, an efficient decoder reconstructs a high-fidelity HR (6K) image from the LR one in real time. Extensive experiments demonstrate that our framework outperforms previous image rescaling baselines in both rate-distortion performance and is much faster than prior work in HR image reconstruction speed.

中文总结: 这段话主要讨论了图像重缩放的任务，旨在将高分辨率（HR）图像嵌入到可以包含用于HR图像重建的信息的低分辨率（LR）图像中。现有的图像重缩放方法并未优化LR图像文件大小，而最近基于流的重缩放方法尚未实现对HR图像（如6K）的实时重建。为解决这两个挑战，提出了一个新颖的框架（HyperThumbnail）用于实时6K速率失真感知图像重缩放。HyperThumbnail首先通过一个具有我们提出的可学习JPEG量化模块的编码器将HR图像嵌入到JPEG LR图像（缩略图）中，优化嵌入LR JPEG图像的文件大小。然后，一个高效的解码器实时从LR图像中重建出高保真度的HR（6K）图像。大量实验证明，我们的框架在速率失真性能方面优于以前的图像重缩放基线，并且在HR图像重建速度方面比以前的工作要快得多。

Paper53 Ingredient-Oriented Multi-Degradation Learning for Image Restoration

摘要原文: Learning to leverage the relationship among diverse image restoration tasks is quite beneficial for unraveling the intrinsic ingredients behind the degradation. Recent years have witnessed the flourish of various All-in-one methods, which handle multiple image degradations within a single model. In practice, however, few attempts have been made to excavate task correlations in that exploring the underlying fundamental ingredients of various image degradations, resulting in poor scalability as more tasks are involved. In this paper, we propose a novel perspective to delve into the degradation via an ingredients-oriented rather than previous task-oriented manner for scalable learning. Specifically, our method, named Ingredients-oriented Degradation Reformulation framework (IDR), consists of two stages, namely task-oriented knowledge collection and ingredients-oriented knowledge integration. In the first stage, we conduct ad hoc operations on different degradations according to the underlying physics principles, and establish the corresponding prior hubs for each type of degradation. While the second stage progressively reformulates the preceding task-oriented hubs into single ingredients-oriented hub via learnable Principal Component Analysis (PCA), and employs a dynamic routing mechanism for probabilistic unknown degradation removal. Extensive experiments on various image restoration tasks demonstrate the effectiveness and scalability of our method. More importantly, our IDR exhibits the favorable generalization ability to unknown downstream tasks.

中文总结: 这段话主要内容是介绍了学习如何利用不同图像恢复任务之间的关系，有助于揭示图像退化背后的内在因素。近年来，各种一体化方法蓬勃发展，可以在单个模型内处理多种图像退化问题。然而，在实践中，很少有尝试去挖掘任务之间的相关性，从而探索各种图像退化的基本成分，导致随着涉及的任务越多，可扩展性较差。本文提出了一种新的视角，通过基于成分而不是以往的任务为导向的方式来探究退化，以实现可扩展的学习。具体而言，我们的方法名为基于成分的退化重构框架（IDR），包括两个阶段，即任务为导向的知识收集和基于成分的知识整合。在第一阶段，我们根据潜在的物理原理对不同的退化进行临时操作，并为每种类型的退化建立相应的先验中心。而第二阶段则通过可学习的主成分分析（PCA）逐渐将之前任务为导向的中心重构为基于单一成分的中心，并采用动态路由机制用于概率性未知退化的去除。对各种图像恢复任务的广泛实验表明了我们方法的有效性和可扩展性。更重要的是，我们的IDR展现了对未知下游任务的良好泛化能力。

Paper54 Weakly-Supervised Single-View Image Relighting

摘要原文: We present a learning-based approach to relight a single image of Lambertian and low-frequency specular objects. Our method enables inserting objects from photographs into new scenes and relighting them under the new environment lighting, which is essential for AR applications. To relight the object, we solve both inverse rendering and re-rendering. To resolve the ill-posed inverse rendering, we propose a weakly-supervised method by a low-rank constraint. To facilitate the weakly-supervised training, we contribute Relit, a large-scale (750K images) dataset of videos with aligned objects under changing illuminations. For re-rendering, we propose a differentiable specular rendering layer to render low-frequency non-Lambertian materials under various illuminations of spherical harmonics. The whole pipeline is end-to-end and efficient, allowing for a mobile app implementation of AR object insertion. Extensive evaluations demonstrate that our method achieves state-of-the-art performance. Project page: https://renjiaoyi.github.io/relighting/.

中文总结: 这段话主要介绍了一种基于学习的方法来重新照明具有Lambertian和低频镜面对象的单个图像。该方法使得能够将照片中的对象插入到新场景中，并在新环境光照下重新照明，这对增强现实应用至关重要。为了重新照明对象，他们解决了逆渲染和重新渲染两个问题。为了解决逆渲染的不适定性，他们提出了一种通过低秩约束的弱监督方法。为了促进弱监督训练，他们贡献了一个名为Relit的大规模（750K张图片）视频数据集，其中包含了在不同照明条件下对齐的对象。对于重新渲染，他们提出了一个可微的镜面渲染层，以在各种球谐光照下渲染低频非Lambertian材料。整个流程端到端且高效，可用于在移动应用中实现增强现实对象插入。广泛的评估表明，他们的方法实现了最先进的性能。项目页面：https://renjiaoyi.github.io/relighting/。

Paper55 Interventional Bag Multi-Instance Learning on Whole-Slide Pathological Images

摘要原文: Multi-instance learning (MIL) is an effective paradigm for whole-slide pathological images (WSIs) classification to handle the gigapixel resolution and slide-level label. Prevailing MIL methods primarily focus on improving the feature extractor and aggregator. However, one deficiency of these methods is that the bag contextual prior may trick the model into capturing spurious correlations between bags and labels. This deficiency is a confounder that limits the performance of existing MIL methods. In this paper, we propose a novel scheme, Interventional Bag Multi-Instance Learning (IBMIL), to achieve deconfounded bag-level prediction. Unlike traditional likelihood-based strategies, the proposed scheme is based on the backdoor adjustment to achieve the interventional training, thus is capable of suppressing the bias caused by the bag contextual prior. Note that the principle of IBMIL is orthogonal to existing bag MIL methods. Therefore, IBMIL is able to bring consistent performance boosting to existing schemes, achieving new state-of-the-art performance. Code is available at https://github.com/HHHedo/IBMIL.

中文总结: 多实例学习（MIL）是一种有效的范式，用于整张病理图像（WSIs）分类，以处理千兆像素分辨率和幻灯片级别标签。现有的MIL方法主要集中在改进特征提取器和聚合器上。然而，这些方法的一个不足之处是包上下文先验可能会使模型捕捉到包和标签之间的虚假相关性。这种不足是一种混杂因素，限制了现有MIL方法的性能。在本文中，我们提出了一种新颖的方案，干预包多实例学习（IBMIL），以实现去混淆的包级别预测。与传统的基于似然的策略不同，所提出的方案是基于反门调整来实现干预训练，因此能够抑制由包上下文先验引起的偏差。值得注意的是，IBMIL的原则与现有的包MIL方法是正交的。因此，IBMIL能够为现有方案带来一致的性能提升，实现新的最先进性能。代码可在https://github.com/HHHedo/IBMIL 找到。

Paper56 Realistic Saliency Guided Image Enhancement

摘要原文: Common editing operations performed by professional photographers include the cleanup operations: de-emphasizing distracting elements and enhancing subjects. These edits are challenging, requiring a delicate balance between manipulating the viewer’s attention while maintaining photo realism. While recent approaches can boast successful examples of attention attenuation or amplification, most of them also suffer from frequent unrealistic edits. We propose a realism loss for saliency-guided image enhancement to maintain high realism across varying image types, while attenuating distractors and amplifying objects of interest. Evaluations with professional photographers confirm that we achieve the dual objective of realism and effectiveness, and outperform the recent approaches on their own datasets, while requiring a smaller memory footprint and runtime. We thus offer a viable solution for automating image enhancement and photo cleanup operations.

中文总结: 这段话主要讲述了专业摄影师常进行的常见编辑操作，包括清理操作：减弱分散注意力的元素并增强主体。这些编辑是具有挑战性的，需要在操纵观众注意力的同时保持照片逼真度之间取得微妙平衡。虽然最近的方法可以夸耀成功的注意力减弱或增强示例，但大多数也受到频繁不真实编辑的困扰。我们提出了一种针对显著性引导图像增强的逼真损失方法，以在各种图像类型上保持高逼真度，同时减弱干扰因素并增强感兴趣的对象。与专业摄影师的评估证实，我们实现了逼真度和有效性的双重目标，并在自己的数据集上胜过最近的方法，同时需要更小的内存占用和运行时间。因此，我们提供了一个可行的解决方案，用于自动化图像增强和照片清理操作。

Paper57 RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories

摘要原文: Whole Slide Images (WSIs) are usually gigapixel in size and lack pixel-level annotations. The WSI datasets are also imbalanced in categories. These unique characteristics, significantly different from the ones in natural images, pose the challenge of classifying WSI images as a kind of weakly supervise learning problems. In this study, we propose, RankMix, a data augmentation method of mixing ranked features in a pair of WSIs. RankMix introduces the concepts of pseudo labeling and ranking in order to extract key WSI regions in contributing to the WSI classification task. A two-stage training is further proposed to boost stable training and model performance. To our knowledge, the study of weakly supervised learning from the perspective of data augmentation to deal with the WSI classification problem that suffers from lack of training data and imbalance of categories is relatively unexplored.

中文总结: 这段话主要内容是关于全切片图像（WSIs）通常具有数十亿像素大小并缺乏像素级注释。WSI数据集在类别上也存在不平衡。这些与自然图像中的特征显著不同的独特特点，给WSI图像分类带来了挑战，使其成为一种弱监督学习问题。在这项研究中，我们提出了一种名为RankMix的数据增强方法，该方法将一对WSI中的排名特征混合在一起。RankMix引入了伪标记和排名的概念，以提取对WSI分类任务有贡献的关键WSI区域。此外，进一步提出了两阶段训练以提高稳定的训练和模型性能。据我们所知，从数据增强的角度研究弱监督学习以应对WSI分类问题中缺乏训练数据和类别不平衡的问题相对较少。

Paper58 Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space

摘要原文: Concept-based explanation aims to provide concise and human-understandable explanations of an image classifier. However, existing concept-based explanation methods typically require a significant amount of manually collected concept-annotated images. This is costly and runs the risk of human biases being involved in the explanation. In this paper, we propose counterfactual explanation with text-driven concepts (CounTEX), where the concepts are defined only from text by leveraging a pre-trained multi-modal joint embedding space without additional concept-annotated datasets. A conceptual counterfactual explanation is generated with text-driven concepts. To utilize the text-driven concepts defined in the joint embedding space to interpret target classifier outcome, we present a novel projection scheme for mapping the two spaces with a simple yet effective implementation. We show that CounTEX generates faithful explanations that provide a semantic understanding of model decision rationale robust to human bias.

中文总结: 这段话主要讨论了基于概念的解释方法旨在为图像分类器提供简洁且易于理解的解释。然而，现有的基于概念的解释方法通常需要大量手动收集的概念标注图像，这是昂贵的，并存在人为偏见的风险。在这篇论文中，我们提出了一种名为CounTEX的基于反事实解释方法，其中概念仅从文本中定义，通过利用预训练的多模态联合嵌入空间而无需额外的概念标注数据集。利用文本驱动的概念生成概念反事实解释。为了利用联合嵌入空间中定义的文本驱动概念来解释目标分类器的结果，我们提出了一种新颖的投影方案，用于将两个空间进行映射，具有简单而有效的实现。我们展示了CounTEX生成的解释是忠实的，提供了对模型决策基础的语义理解，同时对人为偏见具有鲁棒性。

Paper59 TopDiG: Class-Agnostic Topological Directional Graph Extraction From Remote Sensing Images

摘要原文: Rapid development in automatic vector extraction from remote sensing images has been witnessed in recent years. However, the vast majority of existing works concentrate on a specific target, fragile to category variety, and hardly achieve stable performance crossing different categories. In this work, we propose an innovative class-agnostic model, namely TopDiG, to directly extract topological directional graphs from remote sensing images and solve these issues. Firstly, TopDiG employs a topology-concentrated node detector (TCND) to detect nodes and obtain compact perception of topological components. Secondly, we propose a dynamic graph supervision (DGS) strategy to dynamically generate adjacency graph labels from unordered nodes. Finally, the directional graph (DiG) generator module is designed to construct topological directional graphs from predicted nodes. Experiments on the Inria, CrowdAI, GID, GF2 and Massachusetts datasets empirically demonstrate that TopDiG is class-agnostic and achieves competitive performance on all datasets.

中文总结: 这段话主要讲述了近年来遥感图像自动矢量提取领域取得了快速发展，但现有大多数研究集中在特定目标上，对类别多样性脆弱，并且在不同类别之间很难实现稳定性能。作者提出了一种创新的无类别模型TopDiG，直接从遥感图像中提取拓扑方向图，并解决了这些问题。TopDiG首先利用拓扑集中的节点检测器（TCND）来检测节点并获得拓扑组件的紧凑感知。其次，提出了一种动态图监督（DGS）策略，从无序节点动态生成邻接图标签。最后，设计了方向图（DiG）生成器模块，从预测的节点构建拓扑方向图。在Inria、CrowdAI、GID、GF2和Massachusetts数据集上的实验证明，TopDiG是无类别的，并在所有数据集上实现了竞争性能。

Paper60 InstructPix2Pix: Learning To Follow Image Editing Instructions

摘要原文: We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models–a language model (GPT-3) and a text-to-image model (Stable Diffusion)–to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per-example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.

中文总结: 本文提出了一种根据人类指令编辑图像的方法：给定一个输入图像和一条书面指令，告诉模型该如何操作，我们的模型会按照这些指令来编辑图像。为了解决这个问题，我们结合了两个大型预训练模型的知识——一个是语言模型（GPT-3），另一个是文本到图像模型（Stable Diffusion），生成了一个大型的图像编辑示例数据集。我们的有条件扩散模型InstructPix2Pix是在我们生成的数据上训练的，并且在推断时可以泛化到真实图像和用户编写的指令。由于它在正向传递中执行编辑，不需要每个示例进行微调或反演，我们的模型可以在几秒钟内快速编辑图像。我们展示了对各种输入图像和书面指令的引人注目的编辑结果。

Paper61 Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution

摘要原文: Flow-based methods have demonstrated promising results in addressing the ill-posed nature of super-resolution (SR) by learning the distribution of high-resolution (HR) images with the normalizing flow. However, these methods can only perform a predefined fixed-scale SR, limiting their potential in real-world applications. Meanwhile, arbitrary-scale SR has gained more attention and achieved great progress. Nonetheless, previous arbitrary-scale SR methods ignore the ill-posed problem and train the model with per-pixel L1 loss, leading to blurry SR outputs. In this work, we propose “Local Implicit Normalizing Flow” (LINF) as a unified solution to the above problems. LINF models the distribution of texture details under different scaling factors with normalizing flow. Thus, LINF can generate photo-realistic HR images with rich texture details in arbitrary scale factors. We evaluate LINF with extensive experiments and show that LINF achieves the state-of-the-art perceptual quality compared with prior arbitrary-scale SR methods.

中文总结: 这段话主要讨论了基于流的方法在解决超分辨率（SR）问题中表现出的有前途的结果，通过学习高分辨率（HR）图像的分布以及正规化流来应对SR的不适定性。然而，这些方法只能执行预定义的固定比例的SR，限制了它们在现实世界应用中的潜力。与此同时，任意比例的SR引起了更多关注并取得了巨大进展。然而，先前的任意比例SR方法忽视了不适定性问题，并使用逐像素L1损失来训练模型，导致模糊的SR输出。在这项工作中，我们提出了“局部隐式正规化流”（LINF）作为上述问题的统一解决方案。LINF使用正规化流来建模不同缩放因子下的纹理细节分布。因此，LINF能够生成具有丰富纹理细节的逼真HR图像，适用于任意比例因子。我们通过广泛的实验评估LINF，并展示LINF相比先前的任意比例SR方法实现了最先进的感知质量。

Paper62 Referring Image Matting

摘要原文: Different from conventional image matting, which either requires user-defined scribbles/trimap to extract a specific foreground object or directly extracts all the foreground objects in the image indiscriminately, we introduce a new task named Referring Image Matting (RIM) in this paper, which aims to extract the meticulous alpha matte of the specific object that best matches the given natural language description, thus enabling a more natural and simpler instruction for image matting. First, we establish a large-scale challenging dataset RefMatte by designing a comprehensive image composition and expression generation engine to automatically produce high-quality images along with diverse text attributes based on public datasets. RefMatte consists of 230 object categories, 47,500 images, 118,749 expression-region entities, and 474,996 expressions. Additionally, we construct a real-world test set with 100 high-resolution natural images and manually annotate complex phrases to evaluate the out-of-domain generalization abilities of RIM methods. Furthermore, we present a novel baseline method CLIPMat for RIM, including a context-embedded prompt, a text-driven semantic pop-up, and a multi-level details extractor. Extensive experiments on RefMatte in both keyword and expression settings validate the superiority of CLIPMat over representative methods. We hope this work could provide novel insights into image matting and encourage more follow-up studies. The dataset, code and models are available at https://github.com/JizhiziLi/RIM.

中文总结: 这段话主要介绍了一种新的图像抠图任务称为Referring Image Matting (RIM)，与传统的图像抠图不同，RIM旨在根据给定的自然语言描述提取最符合描述的特定对象的细致 alpha 蒙版，从而为图像抠图提供更自然和简单的指导。作者首先建立了一个大规模的具有挑战性的数据集RefMatte，该数据集包含230个对象类别、47,500张图像、118,749个表达区域实体和474,996个表达。此外，作者构建了一个包含100张高分辨率自然图像的真实测试集，并手动注释复杂短语，以评估RIM方法的跨领域泛化能力。作者还提出了一种新颖的基线方法CLIPMat，包括上下文嵌入提示、文本驱动的语义弹出和多级细节提取器。在RefMatte数据集上进行了广泛实验，验证了CLIPMat在关键字和表达设置下优于代表性方法。希望这项工作能为图像抠图提供新的见解，并鼓励更多的后续研究。数据集、代码和模型可在https://github.com/JizhiziLi/RIM 上获得。

Paper63 Polarized Color Image Denoising

摘要原文: Single-chip polarized color photography provides both visual textures and object surface information in one snapshot. However, the use of an additional directional polarizing filter array tends to lower photon count and SNR, when compared to conventional color imaging. As a result, such a bilayer structure usually leads to unpleasant noisy images and undermines performance of polarization analysis, especially in low-light conditions. It is a challenge for traditional image processing pipelines owing to the fact that the physical constraints exerted implicitly in the channels are excessively complicated. In this paper, we propose to tackle this issue through a noise modeling method for realistic data synthesis and a powerful network structure inspired by vision Transformer. A real-world polarized color image dataset of paired raw short-exposed noisy images and long-exposed reference images is captured for experimental evaluation, which has demonstrated the effectiveness of our approaches for data synthesis and polarized color image denoising.

中文总结: 这段话主要讨论了单芯片偏振彩色摄影技术在一次快照中提供了视觉纹理和物体表面信息的优势，但使用额外的定向偏振滤波器阵列会降低光子计数和信噪比，导致噪音图像和性能下降的问题，特别是在低光条件下。传统图像处理流程面临挑战，因为通道中隐含的物理约束过于复杂。为了解决这一问题，作者提出了通过噪声建模方法进行实际数据合成，并结合受视觉Transformer启发的强大网络结构的方法。作者还拍摄了一个真实世界的偏振彩色图像数据集，包括原始短曝光噪声图像和长曝光参考图像，用于实验评估，结果表明了他们的方法在数据合成和偏振彩色图像去噪方面的有效性。

开放原子开发者工作坊

开放原子开发者工作坊旨在鼓励更多人参与开源活动，与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动，如meetup、训练营等，主打技术交流，干货满满，真诚地邀请各位开发者共同参与！

更多推荐

第二届开放原子大赛首批创新成果集结武汉，诚邀广大开发者共鉴开源技术盛宴

开放原子开发者工作坊

诚邀报名 | 开源基础设施能力建设分论坛：打造开源生态的“心脏”

开放原子开发者工作坊

诚邀报名 | 编程语言分论坛：AI时代的技术革新与开源实践

开放原子开发者工作坊

所有评论(0)

查看更多评论

木木阳

@weixin_44287798

已为社区贡献7条内容