Part II: 로봇 조작 기술의 최전선

Chapter 5: Jim Fan과 GEAR — 일반화된 embodied agent 연구의 지도

집필일: 2026-06-08 최종수정일: 2026-06-24

개요

Jim Fan과 GEAR 계열의 연구는 "로봇 논문 목록"이 아니라 일반화된 embodied agent를 만들기 위한 장기 프로그램으로 읽어야 한다. MineDojo와 Voyager는 open-ended environment, automatic curriculum, executable skill library를 보여줬고, VIMA는 multimodal prompt를 manipulation으로 옮겼으며, Eureka와 DrEureka는 reward와 sim-to-real 파라미터 설계를 자동화하는 방향을 열었다 ^[1]; ^[2]; ^[3]; ^[4]; ^[5]. 2025-2026년의 GEAR publication은 여기서 한 단계 더 나아가 DreamGen, FLARE, DreamDojo, DreamZero, CaP-X, SCIZOR, PLD, Isaac Lab 같은 "데이터 생성·world/action model·coding agent·simulation infrastructure" 계열과, SOMA/BONES-SEED/GENMO/Kimodo/MotionBricks/SONIC/GRAIL 같은 human-motion-to-humanoid-control 계열로 확장되었다 ^[11]; ^[18]; ^[20]; ^[21]; ^[22]; ^[23]; ^[24].

제조 로보틱스에서 이 계보의 의미는 Minecraft나 benchmark 성능이 아니다. 핵심은 작업자의 암묵지를 skill graph, reward, world model, human-video pretraining, failure feedback으로 바꾸는 방법론이다.

이 장을 읽고 나면... - GEAR 계열을 open-ended agent, reward automation, world/action model, human-video scaling, coding-agent robotics의 다섯 축으로 정리할 수 있습니다. - 사람 motion representation, controllable motion generation, whole-body tracking, synthetic task demonstration이 GEAR flywheel에서 맡는 위치를 설명할 수 있습니다. - MineDojo, Voyager, VIMA가 제조 skill library 설계에 주는 간접 교훈을 설명할 수 있습니다. - Eureka와 DrEureka가 공장 셀의 reward 및 domain randomization 설계에 왜 중요한지 판단할 수 있습니다. - GEAR식 연구를 생산 셀에 옮길 때 필요한 guardrail과 데이터 소유권을 구분할 수 있습니다.

Figure 5.1: Code-as-Policies 계열에서 언어 모델이 로봇 스킬 코드를 생성하는 구조. source: S3 reused figure

5.1 GEAR의 연구 지도를 제조 언어로 바꾸기

GEAR의 중심 질문은 "한 환경의 한 task를 푸는 agent"를 넘어서 "새 목표를 만나도 탐색하고, 스킬을 만들고, 피드백으로 고치는 agent"를 어떻게 만들 것인가다. MineDojo는 open-ended task, 인터넷 규모 지식, video-language reward를 묶었다 ^[1]. Voyager는 LLM이 자동 curriculum을 만들고, 실행 가능한 code skill을 축적하며, 실패 로그를 다시 prompt로 넣는 구조를 보여줬다 ^[2].

공장은 Minecraft보다 훨씬 좁지만 실패 비용은 훨씬 크다. 그래서 제조 적용은 open-ended autonomy가 아니라 bounded open-endedness여야 한다. agent가 새 행동을 마음대로 시도하는 것이 아니라, 승인된 skill graph 안에서 variant를 탐색하고, 실패하면 작업자와 품질팀이 이해할 수 있는 로그를 남겨야 한다.

GEAR 축	연구에서의 의미	제조 셀로 번역한 의미
Open-ended environment	다양한 목표와 긴 horizon	제품 variant, fixture variant, 재작업 경로
Skill library	실행 가능한 code 또는 policy 축적	승인된 작업 primitive와 recovery recipe
Automated feedback	verifier, 실행 오류, self-correction	품질 판정, 안전 interlock, 작업자 승인
Data flywheel	video, simulation, human feedback	현장 demonstration, 실패 로그, QA 이미지
World/action model	미래 상태와 행동의 joint modeling	policy 평가, synthetic rollout, 실패 전 예측

5.2 MineDojo, Voyager, VIMA의 간접 의미

MineDojo와 Voyager를 공장에 그대로 옮길 수는 없다. 하지만 둘은 제조사가 physical AI를 "모델 도입"이 아니라 "skill library 운영"으로 봐야 한다는 점을 강하게 보여준다 ^[1]; ^[2]. 제조 셀의 skill도 code처럼 버전 관리되어야 한다. 어떤 precondition에서 호출되는지, 어떤 센서 확인을 거치는지, 실패하면 어떤 recovery path로 이동하는지, 어떤 모델 버전에서 승인됐는지가 남아야 한다.

VIMA는 robotics task specification이 언어 하나로만 오지 않는다는 점을 일찍 보여줬다 ^[3]. 제조 현장의 지시는 텍스트 작업표준서, CAD geometry, camera view, fixture pose, 이전 성공 사례, 불량 샘플 이미지가 섞여 있다. 따라서 좋은 manufacturing agent는 자연어 명령 처리기라기보다 multimodal prompt interpreter에 가깝다.

Figure 5.2: CaP-X처럼 코딩 에이전트의 execute-debug 루프를 로봇 조작으로 옮기는 구조. source: S3 reused figure

5.3 Eureka와 DrEureka: reward automation의 제조적 의미

Eureka는 LLM이 reward code를 생성하고 반복 개선해 다양한 RL task에서 human-designed reward를 넘어설 수 있음을 보였다 ^[4]. DrEureka는 여기서 한 걸음 더 나아가 reward뿐 아니라 domain randomization 설정까지 LLM이 제안해 sim-to-real 이전을 돕는 방향을 제시했다 ^[5].

제조 셀에서 reward automation은 매력적이지만 위험하다. 삽입 성공률을 높이는 reward가 부품 표면을 긁거나 fixture에 과도한 힘을 줄 수 있다. cycle time reward가 안전 속도 제한을 압박할 수 있다. 불량률을 줄이는 reward가 애매한 사례를 reject로 밀어내 throughput을 망칠 수 있다.

따라서 공장 reward는 생성하되 승인되어야 한다. LLM은 reward 후보와 randomization 후보를 만들 수 있지만, 품질팀과 안전팀은 금지 조건, force limit, 검사 기준, 재작업 비용을 별도 constraint로 고정해야 한다. reward는 목표이고, constraint는 생산 책임이다.

5.4 DreamGen, FLARE, DreamDojo, DreamZero: world model이 policy가 되는 흐름

GEAR의 최신 지형에서 가장 큰 변화는 world model이 배경 기술이 아니라 robot policy stack의 중심으로 들어왔다는 점이다. DreamGen은 video world model로 synthetic robot trajectories를 만들고, pseudo-action sequence를 복원해 제한된 teleoperation data를 넘어서려 한다 ^[12]. FLARE는 VLA 내부에 future latent representation을 예측하는 토큰을 추가해, 표준 diffusion-transformer policy가 장기 결과를 고려하도록 만든다 ^[13].

DreamDojo와 DreamZero는 이 방향을 더 노골적으로 밀어붙인다. DreamDojo는 44k 시간 규모의 egocentric human video에서 interaction과 dexterous control을 학습하고, robot data로 post-training해 live teleoperation, policy evaluation, model-based planning 같은 응용을 겨냥한다 ^[6]. DreamZero는 world action model을 "zero-shot policy"로 해석하고, video diffusion backbone이 미래 세계 상태와 action을 함께 모델링할 때 VLA가 약한 unseen physical motion 일반화가 개선될 수 있음을 보인다 ^[14].

제조 관점에서 이 흐름은 Cosmos/Isaac과 직접 연결된다. 공장은 모든 실패를 실제 셀에서 반복할 수 없다. world model은 승인된 simulation asset과 현장 비디오를 연결해 "이 recovery를 시도하면 부품과 fixture가 어떻게 변할 것인가"를 먼저 평가하는 층이 될 수 있다. 다만 생성 world model은 품질 검사자가 아니다. 생산 승인에는 여전히 실제 센서 로그, force/tactile trace, 최종 검사 데이터가 필요하다.

5.5 CaP-X와 coding-agent robotics

CaP-X는 GEAR 지도의 또 다른 축이다. Code-as-Policy는 VLA와 반대로 거대한 행동 policy를 직접 학습하기보다, perception/control primitive를 조합하는 executable program을 생성한다. CaP-X는 CaP-Gym, CaP-Bench, CaP-Agent0, CaP-RL을 통해 coding agent가 조작 task에서 얼마나 designer scaffolding에 의존하는지, 그리고 multi-turn execution feedback과 skill synthesis가 어느 정도 robustness를 회복하는지 평가한다 ^[15].

제조 셀에서는 이 접근이 특히 현실적이다. 모든 행동을 VLA에 맡기기보다, 검사 위치로 이동, CAD pose 조회, force threshold 확인, retry sequence 실행처럼 검증 가능한 primitive를 code로 묶을 수 있다. 중요한 것은 LLM이 자유롭게 코드를 쓰게 하는 것이 아니라, 승인된 API와 simulator, unit test, safety checker 안에서만 iterate하게 하는 것이다.

GEAR 2025-2026 흐름	대표 논문	제조 stack에서의 위치
World model data generation	DreamGen, DreamDojo	synthetic rollout, policy evaluation, data augmentation
Implicit/action world model	FLARE, DreamZero	미래 상태 예측과 closed-loop action generation
Coding-agent robotics	CaP-X	검증 가능한 primitive composition과 debug loop
Self-improving VLA	PLD	실패 영역 탐색과 deployment-aligned data collection
Simulation infrastructure	Isaac Lab	multi-modal robot learning과 domain randomization runtime

여기에 2026년 humanoid motion/control 자료를 붙이면 GEAR의 중간층이 더 선명해진다. SOMA와 BONES-SEED는 사람 motion을 공통 skeleton과 robot-compatible format으로 정리하고, GENMO는 video/text/keypoint/music/keyframe 조건에서 사람 motion estimation과 generation을 통합한다 ^[19]; ^[20]; ^[18]. Kimodo와 MotionBricks는 이 표현 위에서 text, keyframe, waypoint, path, contact constraint를 가진 reference motion을 만들고, SONIC과 GRAIL은 whole-body tracking과 task-level loco-manipulation demonstration으로 내려간다 ^[21]; ^[22]; ^[24]; ^[23].

추가 계층	대표 자료	제조 stack에서의 위치
Motion representation	SOMA, BONES-SEED, GENMO	작업자 동작과 human video를 로봇 학습 가능한 중간 표현으로 바꾼다
Reference motion generation	Kimodo, MotionBricks	사람이 하던 동작을 controllable humanoid reference로 생성한다
Whole-body/task execution	SONIC, GRAIL	reference motion과 3D asset을 실제 humanoid control 및 task demo 후보로 내린다

5.6 EgoScale: human video를 로봇 데이터로 바꾸기

DreamDojo와 EgoScale은 GEAR 계열이 human video를 단순 참고자료가 아니라 robot world model과 dexterous motor prior의 원천으로 본다는 점을 보여준다 ^[6]; ^[7]. EgoScale은 20,854시간 이상의 action-labeled egocentric human video로 VLA를 pretrain하고, human data 규모와 validation loss 사이의 log-linear scaling, 그리고 validation loss와 real robot performance의 상관을 보고한다 ^[7]. 사람 손의 egocentric video는 대규모로 모을 수 있고, 세밀한 접촉 전후 행동을 담고 있으며, 다양한 물체와 상황을 포함한다.

제조 수작업에도 같은 가능성이 있다. 숙련 작업자의 손동작, 불량을 감지하는 시선 이동, 재작업을 시작하는 작은 단서, 공구를 잡는 순서, 부품을 잠시 지지하는 보조 동작은 기존 PLC log에 남지 않는다. 이 데이터를 수집해 task id, product lot, fixture version, quality outcome과 연결하면 human-video pretraining이 공장 지식 자산이 될 수 있다.

하지만 개인정보, 보안, 공정 기밀, 작업자 동의가 먼저다. GEAR식 scaling을 제조에 적용하려면 camera를 많이 설치하는 것보다 데이터 거버넌스를 먼저 설계해야 한다. 어떤 영상이 학습에 쓰이고, 어떤 영상은 폐기되며, 어떤 feature만 남기는지 명확해야 한다.

5.7 제조 적용을 위한 선별 기준

GEAR의 모든 논문을 공장 로봇으로 바로 연결하면 과장된다. 더 좋은 기준은 각 연구가 manufacturing agent stack의 어느 층을 강화하는지 보는 것이다. Voyager는 skill memory와 debug loop, VIMA는 multimodal instruction, Eureka/DrEureka는 reward와 randomization, DreamDojo는 world model, EgoScale은 human-video motor prior, GR00T는 humanoid execution interface에 가깝다.

제조사는 이 층들을 한 번에 도입할 필요가 없다. 첫 단계는 한 셀에서 task schema와 skill library를 만드는 것이다. 두 번째는 simulation과 real demonstration을 연결하는 것이다. 세 번째는 failure feedback을 reward, randomization, retraining queue로 되돌리는 것이다. 네 번째가 되어야 open-ended agent가 일부 셀 운영을 제안하거나 자동 조정하게 할 수 있다.

5.8 Manufacturing Cell Checkpoint

GEAR식 접근의 첫 생산 체크포인트는 agent가 똑똑한지가 아니라, agent의 학습 루프가 공장 governance 안에 들어오는지다.

체크 항목	통과 기준
Skill graph	작업 primitive, precondition, postcondition, recovery path가 버전 관리된다
Feedback source	verifier, QA 판정, 작업자 승인, 실패 로그의 우선순위가 정해져 있다
Reward guardrail	reward 후보와 hard constraint가 분리되어 승인된다
Human video policy	촬영 범위, 익명화, 보관 기간, 학습 사용 조건이 문서화된다
Deployment boundary	agent가 제안할 수 있는 변경과 사람이 승인해야 하는 변경이 분리된다
Simulator/test harness	agent code와 policy update가 real cell 전에 Isaac/검증 harness를 통과한다

5.9 다음에 배울 것

이 장은 embodied agent 연구의 지도와 제조 skill library의 관계를 정리했다. 최신 GEAR 흐름은 world model, coding-agent loop, human-video scaling, simulation infrastructure를 하나의 data flywheel로 묶고 있다. 다음 장은 다시 물리 세계로 내려간다. 아무리 좋은 VLA와 agent loop가 있어도 손, 촉각, 접촉, 양손 협응이 약하면 수작업 자동화는 fixture 앞에서 멈춘다.

참고문헌

Linxi Fan et al. (2022). MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. arXiv preprint. https://arxiv.org/abs/2206.08853
Guanzhi Wang et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint. https://arxiv.org/abs/2305.16291
Yunfan Jiang et al. (2022). VIMA: General Robot Manipulation with Multimodal Prompts. arXiv preprint. https://arxiv.org/abs/2210.03094
Yecheng Jason Ma et al. (2023). Eureka: Human-Level Reward Design via Coding Large Language Models. arXiv preprint. https://arxiv.org/abs/2310.12931
Yecheng Jason Ma et al. (2024). DrEureka: Language Model Guided Sim-To-Real Transfer. arXiv preprint. https://arxiv.org/abs/2406.01967
Shenyuan Gao et al. (2026). DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos. arXiv preprint. https://arxiv.org/abs/2602.06949
Ruijie Zheng et al. (2026). EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data. arXiv preprint. https://arxiv.org/abs/2602.16710
Johan Bjorck et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv preprint. https://arxiv.org/abs/2503.14734
Yongchao Chen et al. (2025). Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation. IROS. https://arxiv.org/abs/2503.01700
Karl Pertsch et al. (2024). Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models. arXiv preprint. https://arxiv.org/abs/2412.14058
NVIDIA GEAR (2026). Publications. NVIDIA Research. https://research.nvidia.com/labs/gear/publications/
Joel Jang et al. (2025). DreamGen: Unlocking Generalization in Robot Learning through Video World Models. arXiv preprint. https://arxiv.org/abs/2505.12705
Ruijie Zheng et al. (2025). FLARE: Robot Learning with Implicit World Modeling. arXiv preprint. https://arxiv.org/abs/2505.15659
Seonghyeon Ye et al. (2026). World Action Models are Zero-shot Policies. arXiv preprint. https://arxiv.org/abs/2602.15922
Max Fu et al. (2026). CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation. arXiv preprint. https://arxiv.org/abs/2603.22435
Wenli Xiao et al. (2025). Self-Improving Vision-Language-Action Models with Data Generation via Residual RL. arXiv preprint. https://arxiv.org/abs/2511.00091
NVIDIA et al. (2025). Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning. arXiv preprint. https://arxiv.org/abs/2511.04831
Jiefeng Li et al. (2025). GENMO: A GENeralist Model for Human MOtion. arXiv preprint. https://arxiv.org/abs/2505.01425
NVIDIA SOMA team (2026). SOMA: Unifying Parametric Human Body Models. arXiv preprint. https://arxiv.org/abs/2603.16858
BONES Studio (2026). BONES-SEED: Skeletal Everyday Embodiment Dataset. Hugging Face dataset. https://huggingface.co/datasets/bones-studio/seed
Davis Rempe et al. (2026). Kimodo: Scaling Controllable Human Motion Generation. arXiv preprint. https://arxiv.org/abs/2603.15546
Tingwu Wang et al. (2026). MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives. ACM Transactions on Graphics / arXiv. https://arxiv.org/abs/2604.24833
Tianyi Xie et al. (2026). GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors. arXiv preprint. https://arxiv.org/abs/2606.05160
Zhengyi Luo et al. (2026). SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control. arXiv preprint. https://arxiv.org/abs/2511.07820