The Egocentric Gold Rush: Who's Striking It Rich, and How?
Four industries are betting big on first-person data. Only some of them are right.
This is Part II of a three-part series. Part I covered why egocentric data matters. Part III will ask: can it scale to general intelligence, or will it collapse into a race to the bottom?
In Part I, we established a simple thesis: text taught AI to speak, egocentric data will teach it to act. First-person video encodes intent, affordances, and context: the verbs and stories that LLMs never learned.
But here’s what I didn’t say: most of the egocentric data being collected right now is worthless.
I've watched teams strap GoPros to workers' heads, record a few hundred hours, dump it on hard drives, and call it a data strategy. No annotations. No task labels. No outcome tracking. Just terabytes of shaky first-person footage that teaches a model nothing. They're collecting volume when they should be collecting signal.
The gap between “we have egocentric data” and “we have egocentric data that’s actually useful” is the entire game. And it plays out very differently depending on the industry.
Let’s walk through four sectors where the gold rush is real and be honest about who’s finding gold and who’s just digging holes.
1. Surgery: The Highest-Value Data on Earth
Surgery is the clearest case for egocentric data, and it’s not even close.
A third-person camera in an operating room shows you a procedure happening. An egocentric camera on a surgeon’s head shows you how a master thinks with their hands. Where their eyes linger before an incision. How they adjust grip when tissue tension changes. The micro-hesitation before they commit to a cut.
This isn’t abstract. The EgoSurgery-Phase dataset captured 15 hours of real open surgery from head-mounted cameras with eye gaze tracking, encoding not just what the surgeon did, but where they were looking when they did it. EgoExOR, presented at NeurIPS 2025, went further: fusing first-person and third-person views of spine procedures with gaze, hand pose, and spatial relationship annotations across 84,000 frames. Meanwhile, teams are using AR headsets to collect surgical robot training data 41% faster than real-robot setups.
Why does this matter commercially? Because surgical training is broken. It takes 5–7 years to train a surgeon, and most of that training is apprenticeship, first watching, then doing, then being corrected. Egocentric data compresses that loop. Instead of standing behind a master surgeon and hoping you catch what they caught, you literally see through their eyes. And unlike a human mentor, the data doesn’t retire, burn out, or move to a different hospital.
Would you trust a surgeon trained entirely on textbooks, with no time in the OR? That's what we're building when we skip the first-person view.
But here’s where people get it wrong. An hour of surgical footage with no annotations is a liability, not an asset. It sits on a server, costs money to store, and teaches nothing. The datasets that matter pair video with phase labels, gaze data, hand tracking, and post-operative outcomes. That last part is tying the footage to whether the patient actually got better, this is what separates a research curiosity from a training revolution.
The hard truth: most surgical video being recorded today has none of this. The data is there. The annotation isn’t.
What makes it real: Outcome-linked annotations, multimodal capture (gaze + hand + video), and integration into actual training workflows. Not just footage on a server.
2. Manufacturing & Warehousing: The Quiet Goldmine
If surgery is the highest-value play, manufacturing is the most underestimated.
There are roughly 10 million manufacturing facilities worldwide. The vast majority run on human labor doing repetitive manipulation tasks like picking, placing, assembling, inspecting, packing. Every one of those workers is generating potential training data with their hands, eight hours a day, five days a week.
This is where the robotics data crisis is most acute. The "bitter lesson" behind every recent AI breakthrough — that sheer scale of data beats clever algorithms — has hit a wall in robotics. You can’t scrape manipulation data from the internet the way OpenAI scraped text. Robot teleoperation (humans remote-controlling robots to record demonstrations) is expensive, slow, and chained to specific hardware.
So the field is turning to humans instead. Apple’s EgoDex dataset was a landmark: 829 hours of people doing 194 different tabletop tasks including tying shoelaces, unscrewing caps, flipping book pages all captured through Vision Pro with full 3D tracking of every finger joint. The whole point was to show that you can learn manipulation from watching people, not robots.
The MECCANO and ENIGMA-51 datasets took this logic straight to the factory floor getting egocentric footage of workers assembling objects and using tools in industrial settings. Egocentric.io claims 92% of their factory footage is directly mappable from human actions to robot actions, compared to roughly 60% for Ego4D and just 12% for general assembly datasets.
Here’s why manufacturing is such fertile ground:
A worker on an assembly line picks and places the same component hundreds of times a day. That’s not noise, that’s the richest manipulation dataset you could ask for. The environments are controlled (consistent lighting, known objects, fixed layouts), which slashes annotation costs. And there’s a direct buyer for the output: robotics labs building foundation models for warehouse and factory automation — a market projected at $35 billion.
Where people get it wrong: Treating collection as a side project. Bolt a camera to a hard hat, record a few shifts, toss it on a drive. But the difference between usable and unusable factory data comes down to calibration, frame rate, and whether you captured depth alongside RGB. A shaky 720p stream of someone’s forehead is not a dataset. It’s a GoPro reel.
What makes it real: High task density, controlled environments, and a direct buyer — robotics labs that will pay for quality manipulation data today, not someday.
3. Construction & Field Service: The Trust Gap
Here’s where the tension starts to build.
Construction accounts for over 20% of worker fatalities in the United States. It’s one of the least digitized industries on Earth. The need for egocentric data for safety monitoring, skills transfer, quality inspection is enormous. The willingness to adopt it is almost zero.
The technical case is strong. Fixed cameras on construction sites have fundamental blind spots. Workers move through chaotic, shifting environments where hazards change hourly. A security camera can tell you someone wasn’t wearing a hard hat. An egocentric camera on a worker’s helmet can tell you why they took it off, maybe a confined space, heat exhaustion, or adjusting after crawling under scaffolding. The EgoSafe framework demonstrated exactly this: helmet-mounted cameras detecting tripping hazards from the worker’s own perspective, catching things fixed cameras would never see.
Safe-Construct, presented at CVPR 2025, is pushing toward egocentric and 360° vision as the next layer of construction site AI. The wearable technology market for construction is projected to hit $7.3 billion by 2030. On paper, everything lines up.
But construction isn’t a paper problem. It’s a people problem.
This is where the gold rush runs into a wall. Construction workers, reasonably, don’t want cameras on their heads. Unions have opinions about surveillance. Site managers worry about liability when footage captures a safety violation. The workers generating the most valuable data (experienced tradespeople with decades of knowledge in their hands) are exactly the ones most skeptical of the technology.
And the stakes for skills transfer are real. The average age of an electrician in the U.S. is 43. Plumbers, welders, HVAC technicians are all aging, all facing severe labor shortages. The knowledge lives in their hands and eyes, and it’s walking out the door every day. Egocentric data could capture it. But only if someone solves the consent, incentive, and data ownership puzzle first.
The companies that crack construction won’t be the ones with the best computer vision. They’ll be the ones who figure out how to make a 55-year-old ironworker actually want to put the helmet on.
What makes it real: Labor shortages create genuine urgency for knowledge capture. But the unlock is trust and incentives, not better computer vision.
4. The Home: Everyone’s Pitch Deck, Nobody’s Reality
Every robotics company ends their deck the same way: a humanoid folding laundry in a sunlit kitchen.

It’s compelling. It’s also where the gap between aspiration and reality is widest.
The home is the hardest environment for robots and it's not even close. Physical Intelligence wrote a great piece reframing Moravec's paradox as a data problem: AI can solve gold-medal math olympiad questions but can't spread peanut butter on bread, not because spreading is "harder," but because there's no internet-scale dataset of peanut butter spreading to learn from. The home is Moravec's paradox at full volume. In Part I, I showed myself chopping an avocado from a first-person view — the rhythm, the wrist angle, the pause to clear the board. Now multiply that by every task in a household. A robot needs thousands of those demonstrations, and every home is different. Different layouts, different objects, different lighting, different humans who all disagree about how the towel should be folded. A factory workstation has maybe 50 components. Your kitchen has 500 objects, half of which are in the wrong drawer.
This is exactly why 1X Technologies is deploying its $20,000 NEO into homes through teleoperation first, not because autonomy is impossible, but because they need the data. As their CEO told the Wall Street Journal: if we don’t have your data, we can’t make the product better. Sunday Robotics took a different route, shipping over 2,000 “Skill Capture Gloves” to households, converting everyday hand movements into robot training data. LG showed its CLOiD robot retrieving milk from a fridge and folding clothes at CES 2026.
The research supports this path. EgoMimic, out of Carnegie Mellon, showed that one hour of human egocentric data is more valuable than one hour of robot teleoperation data for teaching manipulation tasks. Apple’s EgoDex confirmed the scaling law: more diverse human hand data reliably improves robot policy performance. EPIC-KITCHENS, one of the longest-running egocentric datasets, has given the field years of cooking and kitchen data but it was never designed for robot transfer, which is why purpose-built datasets are now taking the lead.
The home robotics market is already $20 billion and growing at 15% annually. The demand is real. The hardware is arriving. The missing piece isn’t the robot it’s the map of human domestic life that would let the robot actually help.
And here’s the uncomfortable truth: that map requires cameras inside people’s homes. Not for a day. For months. Watching how you cook, clean, organize, and live. Every company in this space will live or die by how they handle data consent, storage, and anonymization. The most sophisticated manipulation model in the world is worthless if no one trusts you enough to generate the training data.
Would you let a camera watch you cook, clean, and live for six months straight — if it meant never doing dishes again?
What makes it real: A $20B+ market with clear consumer demand. But the data moat requires solving privacy at a scale no one has cracked yet.
The Pattern No One’s Talking About
Zoom out across these four sectors and something becomes clear:
The value of egocentric data is inversely proportional to how easy it is to collect.
Surgery data is the most valuable per hour and the hardest to get. Factory data is high-value and increasingly collectible. Construction is high-need but stuck behind a trust wall. Home data is the ultimate prize and the ultimate privacy minefield.
This is the tension at the heart of the egocentric gold rush. The industries where the data would be most transformative are exactly the ones where collection is slowest. And the industries where collection is easiest, well, easiest is relative when your “easy” case still requires custom hardware, calibrated sensors, task annotations, and consent frameworks.
Raw FPV footage is already becoming a commodity. You can record thousands of hours of first-person video for almost nothing.
But volume without structure is just surveillance.
The companies that win this race won’t be the ones with the most hours on a hard drive. They’ll be the ones who:
Embed into real workflows — make collection invisible and make the value obvious to the person wearing the camera.
Annotate at depth — pair footage with gaze, hand pose, 3D scene understanding, and task outcomes. Video without labels is noise.
Build trust rails — consent frameworks, data ownership models, and anonymization that actually work.
Miss any one of these three and you’re not building a data moat. You’re just burning storage.
What’s Next
We’ve covered the why (Part I) and the where (this piece). The biggest question remains: can egocentric data scale to general-purpose intelligence, or will it fragment into a thousand vertical silos?
Part III will make that call.
Thanks for reading. Subscribe for Part III.

Great read, valuable information about the real bottlenecks with egocentric data collection. I agree the long‑term goal should be to get as close as possible to well labeled, outcome‑linked egocentric data.
I’m not an expert in this space, but do we mostly write off unannotated egocentric footage that exists today? My intuition is that with newer training approaches (self-supervised learning, better VLMs, etc.), messy egocentric footage can be mined for patterns over time, especially in repetitive settings like kitchens and factories.
first of all - fire read. I do think one big hurdle is missing in your argument about construction. You explained that trust and incentives are the main hurdle. However, I think the primary hurdle is a systemic issue with construction - the economic and structural features of the construction industry. Construction as an industry has brutally thin margins, is very dependent on contracted workers, and has a large undocumented workforce. This makes it that not collecting egocentric data is actually a very rational choice.
If they did collect data - 1. The compliance headache and costs for companies would increase massively, costs which fall very low on the priority list given that they already have thin margins. 2. The training and re-training costs would be super high for those who want their employees to capture good data (due to contracted employee churn). 3. It exposes undocumented workers to a risk of on-camera evidence. With that lens, I think breaking into the construction industry to collect egocentric data is much more than a trust issue, but more of an economic/structural issue. Let me know your thoughts.