Part II: Frontier Robot Manipulation

Chapter 5: Jim Fan and GEAR — A Map of Generalist Embodied Agents

Written: 2026-06-08 Last updated: 2026-06-24

Overview

Jim Fan and the GEAR line should be read as a long-running program for generalist embodied agents, not as a list of isolated robot papers. MineDojo and Voyager show open-ended environments, automatic curricula, and executable skill libraries; VIMA brings multimodal prompts into manipulation; Eureka and DrEureka automate reward and sim-to-real parameter design ^[1]; ^[2]; ^[3]; ^[4]; ^[5]. The 2025-2026 GEAR publications then expand this program into data generation, world/action models, coding-agent robotics, and simulation infrastructure through DreamGen, FLARE, DreamDojo, DreamZero, CaP-X, SCIZOR, PLD, and Isaac Lab, and into human-motion-to-humanoid-control infrastructure through SOMA, BONES-SEED, GENMO, Kimodo, MotionBricks, SONIC, and GRAIL ^[11]; ^[18]; ^[20]; ^[21]; ^[22]; ^[23]; ^[24].

For manufacturing robotics, the lesson is not Minecraft performance or benchmark novelty. The lesson is a method for turning tacit operator knowledge into skill graphs, rewards, world models, human-video pretraining, and failure feedback.

After reading this chapter, you will be able to... - Organize the GEAR lineage around open-ended agents, reward automation, world/action models, human-video scaling, and coding-agent robotics. - Explain where human-motion representation, controllable motion generation, whole-body tracking, and synthetic task demonstrations sit in the GEAR flywheel. - Explain the indirect manufacturing lessons from MineDojo, Voyager, and VIMA. - Judge why Eureka and DrEureka matter for factory-cell reward and domain-randomization design. - Separate the guardrails and data ownership required before GEAR-style research enters production cells.

Figure 5.1: Code-as-Policies style language-model-generated robot skills. source: S3 reused figure

5.1 Translating the GEAR Map Into Manufacturing Language

The core GEAR question is how to move beyond an agent that solves one task in one environment toward an agent that explores new goals, creates skills, and repairs itself through feedback. MineDojo combined open-ended tasks, internet-scale knowledge, and video-language reward ^[1]. Voyager showed an LLM-driven loop that builds an automatic curriculum, stores executable code skills, and feeds execution failures back into prompting ^[2].

A factory is narrower than Minecraft, but the cost of failure is much higher. Manufacturing adoption therefore needs bounded open-endedness. The agent should not freely try new behaviors; it should explore variants inside an approved skill graph and leave logs that operators, safety teams, and quality teams can understand.

GEAR axis	Meaning in research	Manufacturing translation
Open-ended environment	Many goals and long horizons	Product variants, fixture variants, rework paths
Skill library	Accumulated executable code or policies	Approved task primitives and recovery recipes
Automated feedback	Verifier, execution error, self-correction	Quality judgment, safety interlock, operator approval
Data flywheel	Video, simulation, human feedback	Field demonstrations, failure logs, QA images
World/action model	Joint modeling of future state and action	Policy evaluation, synthetic rollout, failure prediction

5.2 What MineDojo, Voyager, and VIMA Indirectly Teach

MineDojo and Voyager cannot be moved directly into a factory. They do show why physical AI should be treated as skill-library operations rather than model procurement ^[1]; ^[2]. A manufacturing skill should be versioned like code: when it can be called, which sensor checks it performs, where it goes on failure, and which model version was approved.

VIMA showed early that robot task specification is not only language ^[3]. Factory instructions mix text work standards, CAD geometry, camera views, fixture poses, previous successful cases, and defect sample images. A useful manufacturing agent is therefore closer to a multimodal prompt interpreter than a natural-language command parser.

Figure 5.2: Robot manipulation as an execute-debug loop inspired by coding agents. source: S3 reused figure

5.3 Eureka and DrEureka: Reward Automation for Factories

Eureka showed that an LLM can generate and iteratively improve reward code, outperforming human-designed rewards across many RL tasks ^[4]. DrEureka extended that idea by using LLM guidance for both rewards and domain-randomization settings in sim-to-real transfer ^[5].

Reward automation is attractive in a factory, but it is also dangerous. A reward that improves insertion success may scratch a surface or overload a fixture. A cycle-time reward can pressure safety speed limits. A yield reward can reject ambiguous parts too aggressively and damage throughput.

Factory rewards should therefore be generated but approved. An LLM can propose reward candidates and randomization ranges, but quality and safety teams must freeze forbidden states, force limits, inspection rules, and rework costs as separate constraints. Reward is an objective; constraint is production responsibility.

5.4 DreamGen, FLARE, DreamDojo, DreamZero: When World Models Become Policies

The largest recent shift in GEAR is that world models have moved from background infrastructure into the center of the robot policy stack. DreamGen uses video world models to generate synthetic robot trajectories and recover pseudo-action sequences, expanding beyond limited teleoperation data ^[12]. FLARE adds future-latent prediction tokens to a VLA-style diffusion-transformer policy so the policy can reason about longer-term consequences ^[13].

DreamDojo and DreamZero push that direction further. DreamDojo learns interaction and dexterous control from 44k hours of egocentric human video, then post-trains with robot data for live teleoperation, policy evaluation, and model-based planning ^[6]. DreamZero interprets a world action model as a zero-shot policy: by jointly modeling future world states and actions with a video diffusion backbone, it targets the physical-motion generalization that standard VLAs often lack ^[14].

For manufacturing, this connects directly to the Cosmos/Isaac story. A factory cannot repeat every failure in the real cell. A world model can connect approved simulation assets with field video to ask what a recovery may do to a part, fixture, or surrounding process before trying it physically. But a generated world model is not a quality inspector. Production approval still needs real sensor logs, force/tactile traces, and final inspection data.

5.5 CaP-X and Coding-Agent Robotics

CaP-X adds another axis to the GEAR map. Code-as-Policy differs from a large learned action policy: it asks a model to generate executable programs that compose perception and control primitives. CaP-X introduces CaP-Gym, CaP-Bench, CaP-Agent0, and CaP-RL to study how much coding agents depend on designer scaffolding, and how multi-turn execution feedback and automatic skill synthesis recover robustness ^[15].

This is especially practical in factory cells. Instead of letting a VLA own every action, a manufacturer can expose verified primitives for moving to inspection poses, querying CAD pose, checking force thresholds, and executing retry sequences. The key is not free-form code generation. It is iteration inside approved APIs, simulators, unit tests, and safety checkers.

GEAR 2025-2026 line	Representative work	Place in the manufacturing stack
World-model data generation	DreamGen, DreamDojo	Synthetic rollout, policy evaluation, data augmentation
Implicit/action world model	FLARE, DreamZero	Future-state prediction and closed-loop action generation
Coding-agent robotics	CaP-X	Verifiable primitive composition and debug loop
Self-improving VLA	PLD	Failure-region probing and deployment-aligned data collection
Simulation infrastructure	Isaac Lab	Multi-modal robot learning and domain-randomization runtime

Adding the 2026 humanoid motion/control sources makes the middle of the GEAR map clearer. SOMA and BONES-SEED organize human motion into common skeletons and robot-compatible formats, while GENMO unifies human-motion estimation and generation from video, text, keypoints, music, and 3D keyframes ^[19]; ^[20]; ^[18]. Kimodo and MotionBricks generate reference motion with text, keyframe, waypoint, path, and contact constraints; SONIC and GRAIL then lower those references into whole-body tracking and task-level loco-manipulation demonstrations ^[21]; ^[22]; ^[24]; ^[23].

Additional layer	Representative work	Place in the manufacturing stack
Motion representation	SOMA, BONES-SEED, GENMO	Converts worker motion and human video into learnable intermediate representations
Reference motion generation	Kimodo, MotionBricks	Turns human-like motion into controllable humanoid references
Whole-body/task execution	SONIC, GRAIL	Lowers reference motion and 3D assets into humanoid control and task-demo candidates

5.6 EgoScale: Turning Human Video Into Robot Data

DreamDojo and EgoScale show that the GEAR line treats human video as a source for robot world models and dexterous motor priors, not merely as reference footage ^[6]; ^[7]. EgoScale pretrains a VLA on more than 20,854 hours of action-labeled egocentric human video, reports log-linear scaling between human-data scale and validation loss, and links that loss to real-robot performance ^[7]. Egocentric human video can be collected at scale, captures pre-contact and post-contact behavior, and contains diverse objects and situations.

Manufacturing manual work has the same opportunity. Skilled workers' hand motions, gaze shifts before defect detection, small cues that trigger rework, tool-grasp order, and temporary support motions are not captured in ordinary PLC logs. If that video is linked to task id, product lot, fixture version, and quality outcome, human-video pretraining can become a factory knowledge asset.

Privacy, security, process confidentiality, and worker consent come first. Applying GEAR-style scaling to manufacturing is less about installing more cameras and more about designing data governance. The company must specify which footage can train models, which footage is discarded, and which derived features may be retained.

5.7 Selection Criteria for Manufacturing

It would be misleading to connect every GEAR paper directly to a factory robot. A better filter is to ask which layer of a manufacturing agent stack each work strengthens. Voyager informs skill memory and debugging loops. VIMA informs multimodal instruction. Eureka and DrEureka inform reward and randomization. DreamDojo informs world models. EgoScale informs human-video motor priors. GR00T informs humanoid execution interfaces.

Manufacturers do not need to adopt all layers at once. The first step is to build a task schema and skill library for one cell. The second is to connect simulation and real demonstrations. The third is to route failure feedback into rewards, randomization, and retraining queues. Only after that should an open-ended agent propose or automatically tune parts of cell operation.

5.8 Manufacturing Cell Checkpoint

The first production checkpoint for a GEAR-style approach is not whether the agent seems smart. It is whether the agent's learning loop fits inside factory governance.

Check	Passing condition
Skill graph	Task primitives, preconditions, postconditions, and recovery paths are versioned
Feedback source	Verifier, QA judgment, operator approval, and failure logs have a priority order
Reward guardrail	Reward candidates and hard constraints are separated and approved
Human-video policy	Capture scope, anonymization, retention, and training use are documented
Deployment boundary	Changes the agent may propose are separated from changes humans must approve
Simulator/test harness	Agent code and policy updates pass Isaac or another validation harness before the real cell

5.9 What To Learn Next

This chapter mapped embodied-agent research to manufacturing skill-library design. The latest GEAR line ties world models, coding-agent loops, human-video scaling, and simulation infrastructure into one data flywheel. The next chapter returns to the physical layer. Even strong VLAs and agent loops stall at the fixture if hands, touch, contact, and bimanual coordination are weak.

References

Linxi Fan et al. (2022). MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. arXiv preprint. https://arxiv.org/abs/2206.08853
Guanzhi Wang et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint. https://arxiv.org/abs/2305.16291
Yunfan Jiang et al. (2022). VIMA: General Robot Manipulation with Multimodal Prompts. arXiv preprint. https://arxiv.org/abs/2210.03094
Yecheng Jason Ma et al. (2023). Eureka: Human-Level Reward Design via Coding Large Language Models. arXiv preprint. https://arxiv.org/abs/2310.12931
Yecheng Jason Ma et al. (2024). DrEureka: Language Model Guided Sim-To-Real Transfer. arXiv preprint. https://arxiv.org/abs/2406.01967
Shenyuan Gao et al. (2026). DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos. arXiv preprint. https://arxiv.org/abs/2602.06949
Ruijie Zheng et al. (2026). EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data. arXiv preprint. https://arxiv.org/abs/2602.16710
Johan Bjorck et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv preprint. https://arxiv.org/abs/2503.14734
Yongchao Chen et al. (2025). Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation. IROS. https://arxiv.org/abs/2503.01700
Karl Pertsch et al. (2024). Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models. arXiv preprint. https://arxiv.org/abs/2412.14058
NVIDIA GEAR (2026). Publications. NVIDIA Research. https://research.nvidia.com/labs/gear/publications/
Joel Jang et al. (2025). DreamGen: Unlocking Generalization in Robot Learning through Video World Models. arXiv preprint. https://arxiv.org/abs/2505.12705
Ruijie Zheng et al. (2025). FLARE: Robot Learning with Implicit World Modeling. arXiv preprint. https://arxiv.org/abs/2505.15659
Seonghyeon Ye et al. (2026). World Action Models are Zero-shot Policies. arXiv preprint. https://arxiv.org/abs/2602.15922
Max Fu et al. (2026). CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation. arXiv preprint. https://arxiv.org/abs/2603.22435
Wenli Xiao et al. (2025). Self-Improving Vision-Language-Action Models with Data Generation via Residual RL. arXiv preprint. https://arxiv.org/abs/2511.00091
NVIDIA et al. (2025). Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning. arXiv preprint. https://arxiv.org/abs/2511.04831
Jiefeng Li et al. (2025). GENMO: A GENeralist Model for Human MOtion. arXiv preprint. https://arxiv.org/abs/2505.01425
NVIDIA SOMA team (2026). SOMA: Unifying Parametric Human Body Models. arXiv preprint. https://arxiv.org/abs/2603.16858
BONES Studio (2026). BONES-SEED: Skeletal Everyday Embodiment Dataset. Hugging Face dataset. https://huggingface.co/datasets/bones-studio/seed
Davis Rempe et al. (2026). Kimodo: Scaling Controllable Human Motion Generation. arXiv preprint. https://arxiv.org/abs/2603.15546
Tingwu Wang et al. (2026). MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives. ACM Transactions on Graphics / arXiv. https://arxiv.org/abs/2604.24833
Tianyi Xie et al. (2026). GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors. arXiv preprint. https://arxiv.org/abs/2606.05160
Zhengyi Luo et al. (2026). SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control. arXiv preprint. https://arxiv.org/abs/2511.07820