A delicate touch: teaching robots to handle the unknown
William Xie, a first-year PhD student in computer science, is teaching a robot to reason how gently it should grasp previously unknown objects by using large language models (LLMs).
DeliGrasp, Xie's project, is an intriguing step beyond the custom, piecemeal solutions currently used to avoid pinching or crushing novel objects.
In addition, Deligrasp helps the robot translate what it can 'touch' into meaningful information for people.
"William has gotten some neat results by leveraging common sense information from large language models. For example, the robot can estimate and explain the ripeness of various fruits after touching them." Said his advisor, Professor Nikolaus Correll.
Let's learn more about DeliGrasp, Xie's journey to robotics, and his plans for the conference Japan and beyond.
[video:https://www.youtube.com/watch?v=OMzTgY1gxLw]
How would you describe this research?
As humans, we’re able to quickly intuit how exactly we need to pick up a variety of objects, including delicate produce or unwieldy, heavy objects. We’re informed by the visual appearance of an object, what prior knowledge we may have about it, and most importantly, how it feels to the touch when we initially grasp it.
Robots don’t have this all-encompassing intuition though, and they don’t have end-effectors (grippers/hands) as effective as human hands. So solutions are piecemeal: the community has researched “hands” across the spectrum of mechanical construction, sensing capabilities (tactile, force, vibration, velocity), material (soft, rigid, hybrid, woven, etc…). And then the corresponding machine learning models and/or control methods to enable “appropriately forceful” gripping are bespoke for each of these architectures.
Embedded in LLMs, which are trained on an internet’s worth of data, is common sense physical-reasoning that crudely approximates a human’s (as the saying goes: “all models are wrong, some are useful”). We use the LLM-estimated mass and friction to simplify the grasp controller and deploy it on a two-finger gripper, a prevalent and relatively simple architecture. Key to the controller working is the force feedback sensed by the gripper as it grasps an object, and knowing at what force threshold to stop—the LLM-estimated values directly determine this threshold for any arbitrary object, and our initial results are quite promising.
How did you get inspired to pursue this research?
I wouldn’t say that I was inspired to pursue this specific project. I think, like a lot of robotics research, I had been working away at a big problem for a while, and stumbled into a solution for a much smaller problem. My goal since I arrived here has been to research techniques for assistive robots and devices that restore agency for the elderly and/or mobility-impaired in their everyday lives. I’m particularly interested in shopping (but eventually generalist) robots—one problem we found is that it is really hard to determine, let alone pick ripe fruits and produce with a typical robot gripper and just a camera. In early February, I took a day to try out picking up variably sized objects via hand-tuning our MAGPIE gripper’s force sensing (an affordable, open-source gripper developed by the Correll Lab). It worked well; I let ChatGPT calibrate the gripper which worked even better, and it evolved very quickly into DeliGrasp.
What would you say is one of your most interesting findings so far?
LLMs do a reasonable job of estimating an arbitrary object’s mass (friction, not as well) from just a text description. This isn’t in the paper, but when paired with a picture, they can extend this reasoning for oddballs—gigantic paper airplanes, or miniature (plastic) fruits and vegetables.
With our grasping method, we can sense the contact forces on the gripper as it closes around an object—this is a really good measure of ripeness, it turns out. We can then further employ LLMs to reason about these contact forces to pick out ripe fruit and vegetables!
What does the day-to-day of this research look like?
Leading up to submission, I was running experiments on the robot and picking up different objects with different strategies pretty much every day. A little repetitive, but also exciting. Prior to that, and now that I’m trying to improve the project for the next conference, I spend most of my time reading papers, thinking/coming up with ideas, and setting up small, one-off experiments to try out those ideas.
How did you come to study at CU Boulder?
For a few years, I’ve known that I really wanted to build robots that could directly, immediately help my loved ones and community. I had a very positive first research experience in my last year of undergrad and learned what it felt like to have true personal agency in pursuing work that I cared about. At the same time I knew I’d be relocating to Boulder after graduation. I was very fortunate that Nikolaus accepted me and let me keep pursuing this goal of mine.
It’d be unfathomable if I could keep doing this research in academia or industry, though of course that would be ideal. But I’m biased toward academia, particularly teaching. I’ve been teaching high school robotics for 5 years now, and now teaching/mentoring undergrads at CU—each day is as fulfilling as the first. I have great mentors across the robotics faculty and senior PhD students we work in ECES 111, a giant, well-equipped space that 3 robotics labs share, and it’s great for collaboration and brainstorming.
What are your hopes for this international conference (and what conference is it?)
The venue is a workshop at the 2024 International Conference on Robotics and Automation (ICRA 2024), happening in Yokohama, Japan from May 13-17. The name of the workshop is a mouthful: Vision-Language Models for Navigation and Manipulation (VLMNM).
A workshop is detached from the main conference, and kind of is its own little bubble (like a big supermarket—the conference—hosting a pop-up food tasting event—the workshop). I'm really excited to meet other researchers and pick their brains. As a first-year, I’ve spent the past year reading papers from practically everyone on the workshop panel, and from their students. I’ll probably also spend half my time exploring (eating) around the Tokyo area.