Training Machines and Forging Ahead in our AI Era
KRISTEN GRAUMAN, Ph.D.
Kristen Grauman is a professor of computer science and specializes in machine learning and computer vision.
Interviewed by Esther-Robards-Forbes.
With large language artificial intelligence models suddenly bursting on the scene, it seems like AI is everywhere. What does it look like from your perspective?
Over the last decade or so, there have been two big moments that really shifted what was possible with machine learning and AI. The first was the AlexNet advance in 2012 for image classification and object recognition. There was a confluence of GPU computing power, abundant crowdsourced annotated images and deep neural networks that allowed for a big leap forward in computer vision. It caused the needle to jump. The second was the recent advances in large language models like ChatGPT. Such advances reset researchers’ own expectations about what is possible and exactly where we can direct our attention for bigger, even more challenging advances down the road. For example, the success in image classification paved the way to tackling deeper challenges in video and embodied AI over the last 10 years. The success with large language models now allows us to bridge visual perception problems with those in knowledge and reasoning more meaningfully than ever before.
You’re working with UT’s new Center for Generative AI. What kind of challenges do you hope to work on there?
Video understanding is a very important frontier for computer vision and AI. The more mature areas of computer vision to date have focused on still images, often with a 2D and object-centric view of the world. In contrast, video presents substantial new challenges for capturing activity, events and causal relationships. “Egocentric” or “first-person” video taken from a human or agent’s perspective is particularly important for fields like robotics, virtual reality and augmented reality. But technical challenges abound.
What else are you and your team working on?
My group is also exploring new questions in multimodal perception. We are particularly interested in audio-visual and vision-language problems. For example, we have developed reinforcement learning agents that discover how to efficiently navigate a new environment based on both how it looks and how things sound; video models that can separate out a clean speech signal amidst a noisy environment; and methods for leveraging the connections between what people describe in video and the visible actions they make, for example, when a person narrates their activity in a how-to video.
Where is UT positioned in the world of AI research?
I think we’re in a very strong position. The Department of Computer Science has been a pioneer in AI going back decades. If you look at the success of the NSF Institute for the Foundations of Machine Learning [which is based at UT] and this new Center for Generative AI, if you look at the support from the administration with the Year of AI programming, if you look at our graduates being in high demand from top companies, if you look at our recent faculty hires, I think it’s easy to see why our department is in the top 10 programs for both computer science and AI in the country.
Watch a video where Grauman explains related work she leads with Meta and 14 universities.