HCI Research

Vision as an interface: Rethinking Natural Language as a Medium for Human-AI Interaction

May 7, 2025

computer vision vs natural language in the style of the old macintosh icon

Abstract

The current wave of AI interfaces has overwhelmingly embraced natural language as the primary mode of interaction—particularly through chat-based systems. While natural language enables flexible communication, it is fundamentally inefficient for transferring complex or large-scale information. In contrast, vision—anchored in the visual cortex—offers high-bandwidth, low-friction information transfer with minimal cognitive effort. This paper argues that graphical user interfaces (GUIs), which leverage vision, are inherently better suited for interaction with intelligent systems than chat interfaces. We explore the cognitive load imposed by natural language, the data loss in translation from thought to text, and the intuitive efficiency of visual perception. By re-evaluating interface paradigms, we advocate for a return to visual-first design in AI systems that aligns more closely with the strengths of human cognition.

1. Introduction

in the AI-dominated design landscape, chat interfaces have emerged as the default method for interacting with large language models (LLMs). Tech products increasingly rely on conversational paradigms, embedding AI assistants into workflows through natural language. While this approach appears accessible, it introduces profound inefficiencies. The central premise of this paper is simple: interaction is about transferring data—clearly, accurately, and quickly. Vision outperforms language in this domain.

2. Background and Related Work

Early computing interfaces ranged from command-line interfaces (CLIs) to graphical user interfaces (GUIs), each shaped by technological constraints and human factors. GUIs emerged as dominant for their ability to reduce cognitive load and visually represent complex data. More recently, conversational user interfaces (CUIs) have gained traction alongside LLMs. HCI literature has long investigated modalities and cognitive demands; however, few have critically examined the inefficiencies of language as an interaction medium compared to vision.

3. The Limitations of Natural Language

Natural language is inherently lossy. Turning complex internal thoughts into text requires precision and structure, and even then, context often gets lost. A simple instruction like "organize my files by project timeline" can result in ambiguity when translated into text, especially without shared context. Users must first conceptualize their intent, then serialize it linguistically, and finally hope that the system interprets it correctly. Each of these steps introduces potential for misunderstanding and friction. The cognitive effort required to compose precise prompts also limits speed and spontaneity.

4. The Power of Vision

Visual processing is the brain's most efficient channel. With a glance, a user can understand the layout of a dashboard, the hierarchy of folders, or the status of multiple systems. Consider a kitchen scene: in under a second, a person can identify objects, spatial relationships, and even estimate use cases or recent activity. Translating that same scene into language would take paragraphs, introduce omissions, and still fail to communicate it fully. GUIs exploit this bandwidth by enabling point-and-click actions that demand minimal interpretation.

5. Chatbots vs. Graphical Interfaces

Chat interfaces, while flexible, rely on precision in language and serial, linear exchanges. In contrast, GUIs allow simultaneous perception of multiple options, immediate feedback, and intuitive discovery. Most productivity tools (e.g., design software, dashboards, OS file systems) still rely on graphical metaphors because they optimize for speed, control, and familiarity. Chatbots disrupt that flow, often requiring users to describe what could otherwise be selected visually.

6. Design Implications for the AI Era

The AI boom offers a chance to rethink interfaces. Designers must resist the trend of chat-first experiences and explore ways to visualize LLM outputs and interactions. Interfaces should use visual controls, semantic previews, drag-and-drop elements, and guided discovery to help users navigate AI functionality. Hybrid designs that combine visual and linguistic inputs might offer a better path, leveraging the strengths of both without overloading the user.

7. Conclusion and Future Work

Natural language feels natural, but it is not optimal for interaction. Vision, with its capacity for instantaneous comprehension and minimal cognitive friction, is a better foundation for future AI interfaces. As we shape the next generation of intelligent tools, we must re-embrace the visual and design systems that align with human cognition. Future work should prototype and test vision-based AI experiences that replace or augment language-based interaction.