Fei Panfeipan [at] umich.edu
CV | Google Scholar
I am a Research Fellow in EECS at University of Michigan
and fortunate to work with Prof. Stella X. Yu.
My research lies in Computer Vision and Machine Learning.
I am interested in developing large-scale learning algorithms
for visual tasks with strong generalizability, vigorous robustness, and minimal human supervision.
I obtained my Ph.D. degree in 2023 under the supervision from Prof. In So Kweon at KAIST.
I've received Innovation Fellowship from Qualcomm and
Ph.D. scholarship from BOSCH during my Ph.D. course.
Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models.
Fei Pan, Sangryul Jeon, Brian Wang, Frank Mckenna, Stella Yu.
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024. [pdf] [code] [poster]
Modern building recognition methods, exemplified by the BRAILS framework, utilize supervised learning to extract information from satellite and street-view images for image classification and semantic segmentation tasks. However, each task module requires human-annotated data, hindering the scalability and robustness to regional variations and annotation imbalances. In response, we propose a new zero-shot workflow for building attribute extraction that utilizes large-scale vision and language models to mitigate reliance on external annotations. The proposed workflow contains two key components: image-level captioning and segment-level captioning for the building images based on the vocabularies pertinent to structural and civil engineering. These two components generate descriptive captions by computing feature representations of the image and the vocabularies, and facilitating a semantic match between the visual and textual representations. Consequently, our framework offers a promising avenue to enhance AI-driven captioning for building attribute extraction in the structural and civil engineering domains, ultimately reducing reliance on human annotations while bolstering performance and adaptability.
Key Words: Zero-shot Leanring, Building Attribute Extraction, Large-Scale Vision & Language Models.
Masking-augmented Collaborative Domain Congregation for
Multi-target Domain Adaptation in Semantic Segmentation.
Fei Pan, Dong He, Xu Yin, Chengshuang Zhang, Munchurl Kim.
Under Reivew, 2023.
This paper addresses the challenges in multi-target domain adaptive segmentation which aims at learning a single model that adapts to multiple diverse target domains. Existing methods show limited performance as they only consider the difference in visual appearance (style) while ignoring the (contextual) variations among multiple target domains. In contrast, we propose a novel approach termed Masking-augmented Collaborative Domain Congregation (MacDC) to handle the style gap and contextual gap altogether. The proposed MacDC comprises two key parts: collaborative domain congregation (CDC) and multi-context masking consistency (MCMC). Our CDC handles the style and contextual gaps among target domains by data mixing, which generates image-level and region-level intermediate domains among target domains. To further strengthen contextual alignment, our MCMC applies a masking-based self-supervised augmentation consistency that enforces the model's understanding of diverse contexts together. MacDC directly learns a single model for multi-target domain adaptation without requiring multiple network training and subsequent distillation. Despite its simplicity, MacDC shows efficacy in mitigating the style and contextual gap among multiple target domains and demonstrates superior performance on multi-target domain adaptation for segmentation benchmarks compared to existing state-of-the-art approaches.
Key Words: Multi-target Domain Adaptation, Semantic Segmentation, Masking Consistency, Self-supervised Data Augmentation.
Fine-grained Background Representation for Weakly Supervised Semantic Segmentation.
Xu Yin, Woobin Im, Dongbo Min, Yuchi Huo, Fei Pan, Sungeui Yoon.
Under Reivew, 2023.
Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems. We abandon the use of the class prototype or pixel-level features for BG representation. Instead, we develop a novel primitive, negative region of interest (NROI), to capture the fine-grained BG semantic information and conduct the pixel-to-NROI contrast to distinguish the confusing BG pixels. We also present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning to activate the entire object region. Thanks to the simplicity of design and convenience in use, our FBR is architecture-agnostic and can be seamlessly plugged into various WSSS models. Experimental results on Pascal Voc 2012 and MS COCO 2014 show that FBR exceeds existing studies, obtaining new state-of-art performances in different settings.
Key Words: Representation Learning, Class Activation Map, Semantic Segmentation.
MoDA: Leveraging Motion Prior from Videos for Advancing Unsupervised Domain
Adaptation in Semantic Segmentation.
Fei Pan*, Xu Yin*, Seokju Lee, Sungeui Yoon, In So Kweon.
Under Review, 2023. [pdf]
Unsupervised domain adaptation (UDA) is an effective approach to handle the lack of annotations in the target domain for the semantic segmentation task. In this work, we consider a more practical UDA setting where the target domain contains sequential frames of the unlabeled videos which are easy to collect in practice. A recent study suggests self-supervised learning of the object motion from unlabeled videos with geometric constraints. We design a motion-guided domain adaptive semantic segmentation framework (MoDA), that utilizes self-supervised object motion to learn effective representations in the target domain. MoDA differs from previous methods that use temporal consistency regularization for the target domain frames. Instead, MoDA deals separately with the domain alignment on the foreground and background categories using different strategies. Specifically, MoDA contains foreground object discovery and foreground semantic mining to align the foreground domain gaps by taking the instance-level guidance from the object motion. Additionally, MoDA includes background adversarial training which contains a background category-specific discriminator to handle the background domain gaps. Experimental results on multiple benchmarks highlight the effectiveness of MoDA against existing approaches in the domain adaptive image segmentation and domain adaptive video segmentation. Moreover, MoDA is versatile and can be used in conjunction with existing state-of-the-art approaches to further improve performance.
Key Words: Unsupervised domain adaptation, Semantic Segmentation, Domain Adaptive Video Segmentation, Geometric Learning.
CCTV-Calib: a Toolbox to Calibrate Surveillance Cameras Around the Globe.
Francois Rameau, Jaesung Choe, Fei Pan, Seokju Lee, In So Kweon.
Machine Vision and Applications, 2023. [pdf] [code]
In this paper, we propose CCTV-Calib, a user-friendly toolbox to calibrate traffic cameras using satellite views. Specifically, CCTV-Calib can estimate the intrinsic and extrinsic parameters as well as the GPS location of one or multiple CCTV cameras in a few clicks. Previous surveillance camera calibration strategies rely on various assumptions on the camera parameters (e.g., absence of radial distortion), location, or detected objects in the scene. In contrast, our system is able to calibrate both perspective and fisheye cameras without restrictive structural or semantic assumptions. In fact, only a few correspondences between an image and its satellite view are sufficient to accurately calibrate a camera. Such kind of camera geo-localization and calibration via satellite imaging has yet attracted narrow attention. As a result, most existing techniques naively rely on manually clicked keypoint correspondences between the satellite view and the CCTV image, leading to poor accuracy and repeatability. To cope with these limitations and to ease the calibration process, we propose an automated keypoints matching stage and a refinement process improving the accuracy of the computed parameters. Our toolbox has been qualitatively and quantitatively evaluated using synthetic and real data from various traffic cameras around the globe. We made these unique datasets freely available to the community. Finally, in order to illustrate the relevance of our calibration strategy, we demonstrate its applicability to 3D vehicle geolocalization. Our novel calibration pipeline is integrated in a easy to use GUI and is freely available via the following link: https://github.com/rameau-fr/CCTV-Calib.
Key Words: Camera Calibration, CCTV, Vehicle Geolocalization.
ML-BPM: Multi-teacher Learning with Bidirectional Photometric Mixing for Open
Compound Domain Adaptation in Semantic Segmentation.
Fei Pan, Sungsu Hur, Seokju Lee, Junsik Kim, In So Kweon.
European Conference on Computer Vision (ECCV), 2022. [pdf]
Open compound domain adaptation (OCDA) considers the target domain as the compound of multiple unknown homogeneous subdomains. The goal of OCDA is to minimize the domain gap between the labeled source domain and the unlabeled compound target domain, which benefits the model generalization to the unseen domains. Current OCDA for semantic segmentation methods adopt manual domain separation and employ a single model to simultaneously adapt to all the target subdomains. However, adapting to a target subdomain might hinder the model from adapting to other dissimilar target subdomains, which leads to limited performance. In this work, we introduce a multi-teacher framework with bidirectional photometric mixing to separately adapt to every target subdomain. First, we present an automatic domain separation to find the optimal number of subdomains. On this basis, we propose a multi-teacher framework in which each teacher model uses bidirectional photometric mixing to adapt to one target subdomain. Furthermore, we conduct an adaptive distillation to learn a student model and apply consistency regularization to improve the student generalization. Experimental results on benchmark datasets show the efficacy of the proposed approach for both the compound domain and the open domains against existing state-of-the-art approaches.
Key Words: Domain Adaptation, Open Compound Domain Adaptation, Semantic Segmentation, Multi-teacher Distillation.
Labeling Where Adapting Fails: Cross-Domain Semantic Segmentation with Point
Supervised via Active Learning.
Fei Pan, Francois Rameau, Junsik Kim, In So Kweon.
arXiv, 2022. [pdf]
Training models dedicated to semantic segmentation requires a large amount of pixel-wise annotated data. Due to their costly nature, these annotations might not be available for the task at hand. To alleviate this problem, unsupervised domain adaptation approaches aim at aligning the feature distributions between the labeled source and the unlabeled target data. While these strategies lead to noticeable improvements, their effectiveness remains limited. To guide the domain adaptation task more efficiently, previous works attempted to include human interactions in this process under the form of sparse single-pixel annotations in the target data. In this work, we propose a new domain adaptation framework for semantic segmentation with annotated points via active selection. First, we conduct an unsupervised domain adaptation of the model; from this adaptation, we use an entropy-based uncertainty measurement for target points selection. Finally, to minimize the domain gap, we propose a domain adaptation framework utilizing these target points annotated by human annotators. Experimental results on benchmark datasets show the effectiveness of our methods against existing unsupervised domain adaptation approaches. The propose pipeline is generic and can be included as an extra module to existing domain adaptation strategies.
Key Words: Active Learning, Unsupervised Domain Adaptation, Semantic Segmentation.
Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation.
Seokju Lee, Francois Rameau, Fei Pan, In So Kweon.
International Conference on Computer Vision (ICCV), 2021. [pdf] [code]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task that often relies on the so-called scene rigidity assumption. When observing a dynamic environment, this assumption is violated which leads to an ambiguity between the ego-motion of the camera and the motion of the objects. To solve this problem, we present a self-supervised learning framework for 3D object motion field estimation from monocular videos. Our contributions are two-fold. First, we propose a two-stage projection pipeline to explicitly disentangle the camera ego-motion and the object motions with dynamics attention module, called DAM. Specifically, we design an integrated motion model that estimates the motion of the camera and object in the first and second warping stages, respectively, controlled by the attention module through a shared motion encoder. Second, we propose an object motion field estimation through contrastive sample consensus, called CSAC, taking advantage of weak semantic prior (bounding box from an object detector) and geometric constraints (each object respects the rigid body motion model). Experiments on KITTI, Cityscapes, and Waymo Open Dataset demonstrate the relevance of our approach and show that our method outperforms state-of-the-art algorithms for the tasks of self-supervised monocular depth estimation, object motion segmentation, monocular scene flow estimation, and visual odometry.
Key Words: Motion Field Estimation, Monocular Depth Prediction, Geometric Learning.
Two-phase Pseudo Label Densification for Self-training based Domain Adaptation.
Inkyu Shin, Sanghyun Woo, Fei Pan, In So Kweon.
European Conference on Computer Vision (ECCV), 2020. [pdf]
Recently, deep self-training approaches emerged as a powerful solution to the unsupervised domain adaptation. The self-training scheme involves iterative processing of target data; it generates target pseudo labels and retrains the network. However, since only the confident predictions are taken as pseudo labels, existing self-training approaches inevitably produce sparse pseudo labels in practice. We see this is critical because the resulting insufficient training-signals lead to a suboptimal, error-prone model. In order to tackle this problem, we propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD. In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images. In the second phase, we perform a confidence-based easy-hard classification. For the easy samples, we now employ their full pseudo labels. For the hard ones, we instead adopt adversarial learning to enforce hard-to-easy feature alignment. To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss. We show the proposed TPLD can be easily integrated into existing self-training based approaches and improves the performance significantly. Combined with the recently proposed CRST self-training framework, we achieve new state-of-the-art results on two standard UDA benchmarks.
Key Words: Self-training, Domain Adaptation, Pseudo Label Correction.
Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-supervision.
Fei Pan, Inkyu Shin, Francois Rameau, Seokju Lee, In So Kweon.
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2020. [pdf] [code] Oral Presentation
Convolutional neural network-based approaches have achieved remarkable progress in semantic segmentation. However, these approaches heavily rely on annotated data which are labor intensive. To cope with this limitation, automatically annotated data generated from graphic engines are used to train segmentation models. However, the models trained from synthetic data are difficult to transfer to real images. To tackle this issue, previous works have considered directly adapting models from the source data to the unlabeled target data (to reduce the inter-domain gap). Nonetheless, these techniques do not consider the large distribution gap among the target data itself (intra-domain gap). In this work, we propose a two-step self-supervised domain adaptation approach to minimize the inter-domain and intra-domain gap together. First, we conduct the inter-domain adaptation of the model; from this adaptation, we separate the target domain into an easy and hard split using an entropy-based ranking function. Finally, to decrease the intra-domain gap, we propose to employ a self-supervised adaptation technique from the easy to the hard split. Experimental results on numerous benchmark datasets highlight the effectiveness of our method against existing state-of-the-art approaches. The source code is available at https://github.com/feipanir/IntraDA.
Key Words: Domain Adaptation, Adversarial Training, Semantic Segmentation, Self-supervised Learning.
Variational Prototyping-Encoder: One-shot Learning with Prototypical Images.
Junsik Kim, Tae-hyun Oh, Seokju Lee, Fei Pan, In So Kweon.
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2019. [pdf] [code]
In daily life, graphic symbols, such as traffic signs and brand logos, are ubiquitously utilized around us due to its intuitive expression beyond language boundary. We tackle an open-set graphic symbol recognition problem by one-shot classification with prototypical images as a single training example for each novel class. We take an approach to learn a generalizable embedding space for novel tasks. We propose a new approach called variational prototyping-encoder (VPE) that learns the image translation task from real-world input images to their corresponding prototypical images as a meta-task. As a result, VPE learns image similarity as well as prototypical concepts which differs from widely used metric learning based approaches. Our experiments with diverse datasets demonstrate that the proposed VPE performs favorably against competing metric learning based one-shot methods. Also, our qualitative analyses show that our meta-task induces an effective embedding space suitable for unseen data representation.
Key Words: One-Shot Learning, Prototypical Learning, Variational Auto-encoder.
Driver Drowsiness Detection System Based on Feature Representation Learning Using Various Deep Networks.
Sanghyuk Park, Fei Pan, Sunghun Kang, Chang D. Yoo.
Asian Conference on Computer Vision Workshops (ACCVW), 2016. [pdf]
Statistics have shown that 20% of all road accidents are fatigue-related, and drowsy detection is a car safety algorithm that can alert a snoozing driver in hopes of preventing an accident. This paper proposes a deep architecture referred to as deep drowsiness detection (DDD) network for learning effective features and detecting drowsiness given a RGB input video of a driver. The DDD network consists of three deep networks for attaining global robustness to background and environmental variations and learning local facial movements and head gestures important for reliable detection. The outputs of the three networks are integrated and fed to a softmax classifier for drowsiness detection. Experimental results show that DDD achieves 73.06% detection accuracy on NTHU-drowsy driver detection benchmark dataset.
Key Words: Driver Drowsiness Detection, Representation learning.
The three fundamental problems of computer vision are correspondence, correspondence, and correspondence! -- Takeo Kanade