EdgeCrafter
Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

* Equal Contribution    ‡ Project Leader    † Corresponding Author

Demo

ECSeg-L Demo (34M Params, 111 GFLOPs)

Abstract

We introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder–decoder design. We first adapt a large DINOv3- pretrained ViT to object detection and use it as a task-specialized teacher to distill rich representations into compact student backbones.

On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF- DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter’s reliance on extensive Objects365 pretraining.

Unified Framework

Supports object detection, instance segmentation, and human pose estimation seamlessly.

State-of-the-Art

Superior accuracy-to-parameter ratio across multiple challenging vision tasks.

Real-Time Edge Inference

Architectural design for practical, real-world applications.

51.7
Detection mAP
10M
Parameters

Methodology

Task-Specialized Distillation Pipeline

Distillation Pipeline Diagram

ECDet Architecture

ECDet Architecture Diagram

Results

Object Detection

Object Detection Results

Pose Estimation

Pose Estimation Results

Instance Segmentation

Instance Segmentation Results

Citation

If you find our work useful, please consider citing:

BibTeX
@article{liu2026edgecrafter,
  title={EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation},
  author={Liu, Longfei and Hou, Yongjie and Li, Yang and Wang, Qirui and Sha, Youyang and Yu, Yongjun and Wang, Yinzhi and Ru, Peizhe and Yu, Xuanlong and Shen, Xi},
  journal={arXiv},
  year={2026}
}