EdgeCrafter: Compact ViTs for Edge Dense Prediction

Abstract

We introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder–decoder design. We first adapt a large DINOv3- pretrained ViT to object detection and use it as a task-specialized teacher to distill rich representations into compact student backbones.

On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF- DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter’s reliance on extensive Objects365 pretraining.

Unified Framework

Supports object detection, instance segmentation, and human pose estimation seamlessly.

State-of-the-Art

Superior accuracy-to-parameter ratio across multiple challenging vision tasks.

Real-Time Edge Inference

Architectural design for practical, real-world applications.

51.7

Detection mAP

10M

Parameters

Citation

If you find our work useful, please consider citing:

BibTeX

@article{liu2026edgecrafter,
  title={EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation},
  author={Liu, Longfei and Hou, Yongjie and Li, Yang and Wang, Qirui and Sha, Youyang and Yu, Yongjun and Wang, Yinzhi and Ru, Peizhe and Yu, Xuanlong and Shen, Xi},
  journal={arXiv},
  year={2026}
}

EdgeCrafter
Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Demo

ECSeg-L Demo (34M Params, 111 GFLOPs)

Abstract

Unified Framework

State-of-the-Art

Real-Time Edge Inference

Methodology

Task-Specialized Distillation Pipeline

ECDet Architecture

Results

Object Detection

Pose Estimation

Instance Segmentation

Citation