Architecture

tapioka.ai uses a sophisticated hybrid architecture that combines Large Language Models (LLMs) with Computer Vision to achieve human-like interaction with digital interfaces.

1. The Core Pipeline

  1. Instruction Interpreter (LLM): Translates natural language instructions into a series of logical intents and expected outcomes.
  2. AI Execution Agent: Operates on the target device. It perceives the UI not just through the DOM/Code, but through visual analysis.
  3. Visual Validator: Uses computer vision to identify and number interactive elements (buttons, inputs, icons) in real-time, ensuring the agent clicks the correct target even if the underlying code changes.

2. Testing Infrastructure

Our SaaS platform manages a diverse fleet of execution environments:

  • Virtual Devices: Scalable cloud-based Android and iOS emulators/simulators.
  • Physical Device Fleet: Real hardware (e.g., Samsung S23, iPhone 15) for high-fidelity mobile and TV testing.
  • Cross-Platform Support: Native support for Web, Mobile (App), Desktop, and TV applications.

3. The Learning Engine

The architecture includes a dedicated Learning Mode that persists interaction paths. This ensures that once a "Learned" state is achieved, the test script remains resilient to minor UI updates (like changing a button's ID or moving a menu item), as the AI understands the semantic purpose of the element rather than its hardcoded location.

4. Visual Evidence & Artifacts

Every step in the architecture generates high-resolution artifacts:

  • Labeled Screenshots: Every interaction is documented with visual overlays.
  • Execution Metadata: Detailed technical logs exported as JSON/XML for reporting.
  • Session Video: Full recording of the AI agent's session on the device.