We introduce EdgeCrafter, a unified compact ViT framework
for edge dense prediction centered on ECDet, a detection model built from a distilled compact
backbone and an edge-friendly encoder–decoder design. We first adapt a large DINOv3-
pretrained ViT to object detection and use it as a task-specialized teacher to distill rich
representations into compact student backbones.
On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO
annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-
DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches
74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter’s reliance
on extensive Objects365 pretraining.
Supports object detection, instance segmentation, and human pose estimation seamlessly.
Superior accuracy-to-parameter ratio across multiple challenging vision tasks.
Architectural design for practical, real-world applications.
If you find our work useful, please consider citing:
@article{liu2026edgecrafter,
title={EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation},
author={Liu, Longfei and Hou, Yongjie and Li, Yang and Wang, Qirui and Sha, Youyang and Yu, Yongjun and Wang, Yinzhi and Ru, Peizhe and Yu, Xuanlong and Shen, Xi},
journal={arXiv},
year={2026}
}