Toward A Transformerless Future

Breaking Free from Matmul - Distributed AI Model Training
work-single-image

Note: This article was updated September 27, 2025, incorporating insights from recent research and a recent Richard Sutton interview that affirm many of the tenets we have put forward over the years.

We’re considering designs with innovative approaches to distributed training of models that extend beyond the constraints of matrix multiplication. Matrix multiplication has served as the computational cornerstone of deep learning for over a decade, yet examining its dominance reveals an architectural assumption that may be limiting the field’s potential. The emergence of reinforcement learning components in post-training workflows - from RLHF to constitutional AI to test-time compute optimization - suggests the industry is already backing into a different paradigm, one where models learn from interaction and experience, not just from static datasets.

The ML community has made significant strides in optimizing training and inference across diverse hardware. OpenXLA represents an important step forward, providing mechanisms for host offloading and managing memory transfers between devices. When training large models, OpenXLA enables operations to be distributed between accelerators (GPUs, TPUs) and host CPUs.

The Current Landscape: OpenXLA and Its Evolution

Examining OpenXLA’s approach reveals a fundamental assumption: memory spaces are distinct and data movement between them is inevitable. This leads to a focus on optimizing copies, accepting them as necessary overhead in distributed computation.

%%{init: {'theme': 'neutral'}}%% flowchart subgraph "OpenXLA Approach" A[Data on
Accelerator] -->|Copy to Host| B[Data on
Host CPU] B -->|Process| C[Modified Data
on Host] C -->|Copy back to
Accelerator| D[Data on
Accelerator] style A fill:#f9d5e5,stroke:#333 style B fill:#eeac99,stroke:#333 style C fill:#eeac99,stroke:#333 style D fill:#f9d5e5,stroke:#333 end

The OpenXLA team has built sophisticated mechanisms to schedule these copies asynchronously and overlap them with computation:

# Conceptual representation of OpenXLA's approach
def process_with_host_offloading(data, model_params):
    # Copy data from device to host (explicit transfer)
    host_data = copy_to_host(data)
    
    # Process on host CPU
    processed_data = host_computation(host_data)
    
    # Copy back to device (explicit transfer)
    device_data = copy_to_device(processed_data)
    
    # Continue device computation
    result = device_computation(device_data, model_params)
    return result

This approach schedules copies efficiently but doesn’t challenge the underlying model. OpenXLA performs important work in managing these transfers, yet the fundamental paradigm of separate memory spaces remains unquestioned.

The SpeakEZ Difference: Zero-Copy Architecture with BAREWire

At SpeakEZ Technologies we’ve developed BAREWire, our patent-pending technology that represents a fundamental rethinking of memory management across heterogeneous computing environments. BAREWire uses a zero-copy architecture providing direct access to memory across different devices without unnecessary transfers:

// Type-safe memory management with units of measure
module BAREWire =
    // Units of measure for memory safety
    [<Measure>] type addr      // Memory address
    [<Measure>] type bytes     // Size in bytes
    [<Measure>] type gpu_mem   // GPU memory space
    [<Measure>] type cpu_mem   // CPU memory space
    [<Measure>] type unified   // Unified memory space

    // Create a shared buffer without copying
    let createShared<'T> (size: int<bytes>) : SharedBuffer<'T, unified> =
        // Allocate memory accessible to both CPU and GPU
        let ptr = allocateUnifiedMemory<'T>(size)
        
        {
            Address = ptr
            Size = size
            Layout = MemoryLayout.getOptimized<'T>()
        }
    
    // Create views without copying data
    let createCpuView<'T> (buffer: SharedBuffer<'T, unified>) =
        // No copying - just creates a typed view
        { buffer with MemSpace = typedefof<cpu_mem> }
%%{init: {'theme': 'neutral'}}%% flowchart subgraph "BAREWire Zero-Copy Approach" A[Memory
Region] --- B[CPU View of
Memory] A --- C[GPU View of
Memory] B <-->|Synchronization
Only| C style A fill:#d0f0c0,stroke:#333 style B fill:#a8d8ea,stroke:#333 style C fill:#a8d8ea,stroke:#333 end

This paradigm shift replaces data copying with unified memory abstraction and typed views that maintain strict type safety. Our F# implementation leverages units of measure to ensure memory operations remain type-safe at compile time, preventing common errors before they occur.

The approach changes distributed computation fundamentally by allowing heterogeneous compute devices to safely access shared memory regions through typed interfaces. Memory overhead decreases dramatically, transfer bottlenecks are eliminated, and training efficiency improves measurably.

Beyond MatMul: New Frontiers in Distributed Model Training

The transition from transformer architectures to emerging alternatives represents more than an optimization strategy. Current models train on what amounts to fossilized intelligence - the final outputs of human reasoning captured in text, without access to the iterative processes that produced those outputs. A model learning from Wikipedia articles about scientific discoveries never experiences the failed experiments, revised hypotheses, or conceptual breakthroughs that led to those discoveries. This distinction between learning from outcomes versus learning from processes becomes particularly relevant when considering distributed training of new architectures.

Our patent-pending BAREWire zero-copy architecture becomes powerful when applied to emerging model architectures that operate without traditional matrix multiplication constraints.

MatMul-Free Models: Rethinking Fundamental Operations

Transformers revolutionized deep learning, yet their computational backbone remains matrix multiplication. We continue to explore models that replace traditional matmul operations with alternative computational primitives that are more efficient and scalable when distributed across multiple compute resources.

%%{init: {'theme': 'neutral'}}%% flowchart subgraph "Traditional Model" A1[Input
Embedding] --> B1[MatMul
Attention] B1 --> C1[MatMul
FFN] C1 --> D1[Output
Layer] end style A1 fill:#f9d5e5,stroke:#333 style B1 fill:#f9d5e5,stroke:#333 style C1 fill:#f9d5e5,stroke:#333 style D1 fill:#f9d5e5,stroke:#333
%%{init: {'theme': 'neutral'}}%% flowchart subgraph "MatMul-Free" A2[Input
Embedding] --> B2[Alternative
Pattern Matching] B2 --> C2[Sparse
Operations] C2 --> D2[Output
Layer] end style A2 fill:#d0f0c0,stroke:#333 style B2 fill:#d0f0c0,stroke:#333 style C2 fill:#d0f0c0,stroke:#333 style D2 fill:#d0f0c0,stroke:#333

These approaches require fundamentally different memory access patterns that traditional frameworks struggle to support efficiently. BAREWire’s pre-optimization memory layouts suit these novel computational patterns, enabling distributed training of innovative architectures:

// Type-safe ternary weight matrix with dimensionality checking
type TernaryMatrix<[<Measure>] 'Rows, [<Measure>] 'Cols> = {
    Values: sbyte[,]       // -1, 0, 1 values
    ScaleFactor: float     // Learned scaling factor
    Rows: int<'Rows>
    Cols: int<'Cols>
}

// Zero-copy distributed computation for MatMul-free operations
let distributedTernaryComputation (input: Vector<float32, 'InDim>) 
                               (weights: TernaryMatrix<'InDim, 'OutDim>) =
    // Create result with type-safety guarantees
    let result = Vector.zero<float32, 'OutDim>()
    
    // Distribute computation across processing nodes
    let partitions = 4 // Number of compute nodes
    let partitionSize = dimensions<'OutDim> / partitions
    
    // Zero-copy distribution using BAREWire
    let partitionedResults = 
        [0..partitions-1]
        |> List.map (fun p -> 
            let startRow = p * partitionSize
            let endRow = min ((p+1) * partitionSize - 1) (dimensions<'OutDim> - 1)
            
            // Execute on specific hardware without data copying
            if p % 2 = 0 then
                // Execute on GPU (even partitions)
                GPU.execute (fun () -> computePartition startRow endRow)
            else
                // Execute on CPU (odd partitions)
                CPU.execute (fun () -> computePartition startRow endRow)
        )
    
    // Merge results (zero-copy when possible)
    partitionedResults |> List.iteri (fun p partResult ->
        Vector.blit partResult 0 result (p * partitionSize) partResult.Length
    )
    
    result

This implementation leverages static typing to guarantee dimensional consistency while enabling efficient distribution of compute workloads across heterogeneous hardware without unnecessary copies.

BitNet Ternary Operations: AI for Resource-Constrained Environments

BitNet and other extremely quantized models represent another frontier in AI, replacing high-precision floating-point operations with ternary (-1, 0, 1) or binary operations. Traditional training frameworks expect uniform precision throughout the model, creating implementation challenges.

Our distributed training approach enables:

  1. Progressive Quantization: Incrementally convert model components from floating-point to ternary while training continues
  2. Mixed-Precision Training: Maintain high-precision gradients while using low-precision weights
  3. CPU Optimization: Direct bit-level operations optimized for CPU SIMD instructions

Models can run efficiently on consumer CPUs while maintaining accuracy comparable to much larger models, trained efficiently with our zero-copy distributed approach.

MLA and MAMBA: Enhancing Inference with Dynamic Updates

Multi-Head Latent Attention (MLA) and MAMBA’s state space models represent approaches to making models more capable during inference through continuous refinement. The rigid separation between training and inference phases in current systems prevents models from learning through deployment. Our framework challenges this separation by enabling continuous model evolution.

Our actor-based incremental inference system enables progressive enhancement of deployed models:

%%{init: {'theme': 'neutral'}}%% flowchart subgraph "Progressive Model Enhancement" A[Live Running
Model] --> B{Enhance
Component?} B -->|Yes| C[Create Enhanced
Replacement] C --> D[Hot-Swap
Component] D --> A B -->|No| A end style A fill:#a8d8ea,stroke:#333 style B fill:#eeac99,stroke:#333 style C fill:#d0f0c0,stroke:#333 style D fill:#d0f0c0,stroke:#333

This allows continuous improvement of models in production using our zero-copy memory model:

// Zero-copy actor-based model enhancement
type ModelComponent<'Input, 'Output> = {
    Id: ComponentId
    Forward: 'Input -> 'Output
    Implementation: Implementation
}

// Implementation variants - MLA and MAMBA use different approaches
type Implementation =
    | StandardAttention of AttentionConfig
    | MultiHeadLatentAttention of MLAConfig
    | StateSpaceModel of SSMConfig

// Upgrade component without service interruption
let enhanceModelComponent<'Input, 'Output> 
    (model: DeployedModel) 
    (componentId: ComponentId) 
    (newImplementation: Implementation) =
    
    // Create shared memory buffer for state transfer
    let sharedState = BAREWire.createShared<byte>(component.StateSize)
    
    // Extract current state via zero-copy
    model.ExtractComponentState(componentId, sharedState)
    
    // Create new implementation with zero-copy state initialization
    let newComponent = createComponent newImplementation sharedState
    
    // Use zero-copy memory for in-place component swapping
    model.ReplaceComponent(componentId, newComponent)

We can convert standard attention modules to MLA or MAMBA implementations on-the-fly, without service interruption, using our zero-copy memory approach to ensure efficient state transfer. This capability acknowledges that learning doesn’t end at deployment - models should continue evolving through interaction with real-world data, a principle fundamental to biological intelligence yet absent from current AI systems.

Building the Future of Distributed Training

OpenXLA and SPIR-V provide foundations for distributed computation across heterogeneous hardware. Our vision extends beyond current capabilities. By combining zero-copy memory management with actor-based architecture, we’re creating a system that addresses fundamental limitations in how models learn and evolve:

// Extensible platform configuration for distributed training
type PlatformConfig = {
    MemoryModel: MemoryModelType
    DeviceType: DeviceType
    DistributionStrategy: DistributionStrategy
}

// Memory models with capabilities beyond OpenXLA's model
type MemoryModelType =
    | DiscreteDevices          // Similar to current OpenXLA model
    | UnifiedAddressSpace      // BAREWire zero-copy model
    | PartiallyUnifiedHybrid   // Mix of unified and discrete memory spaces

// Distribution strategies with zero-copy where architecturally possible
type DistributionStrategy =
    | Pipelined of NumStages: int
    | DataParallel of Shards: int
    | TensorParallel of Splits: int
    | ExpertParallel of NumExperts: int * ActiveExperts: int
    | Hybrid of (int * DistributionStrategy) list

This approach enables:

  1. Distributed Training Across Heterogeneous Hardware: Leverage CPUs, GPUs, and specialized accelerators in concert with zero-copy memory sharing
  2. Support for Novel Computational Patterns: Enable architectures that break free from traditional matmul constraints
  3. Continuous Model Evolution: Update deployed models through interaction without retraining or downtime
  4. Efficient Scaling: Minimize unnecessary data movement to maximize computational efficiency

The distinction between training and inference phases increasingly appears as an artifact of current architectures. Biological systems don’t have separate training and deployment phases - they learn continuously through experience. The incorporation of reinforcement learning into post-training workflows, from RLHF to constitutional AI, represents the industry’s gradual recognition of this principle. Our framework makes this continuous learning paradigm explicit and efficient.

Memory Management Beyond OpenXLA

Our approach to memory management represents a fundamental departure from OpenXLA’s paradigm. The industry’s evolution toward continuous learning - evident in the proliferation of RL-based post-training methods - requires memory systems that can support seamless transitions between learning and inference.

BAREWire fundamentally changes the paradigm by:

  1. Creating Unified Memory Abstractions: Representing memory as shared resources with device-specific views
  2. Providing Type-Safe Memory Management: Using units of measure to prevent address and size errors
  3. Optimizing Memory Layouts: Pre-configuring memory layouts optimal for each hardware target
  4. Eliminating Unnecessary Copies: Enabling true zero-copy operation where architecturally possible

Conclusion: Designed Intelligence for Continuous Learning

The fundamental difference between OpenXLA and SpeakEZ’s approach reflects different assumptions about the nature of intelligence:

OpenXLASpeakEZ BAREWire
Memory spaces are distinctMemory can be unified or shared
Data movement is necessaryData movement can often be eliminated
Focus on scheduling copies efficientlyFocus on eliminating copies where possible
Optimize for copy overlap with computationOptimize for zero-copy direct access
Training and inference are separate phasesLearning is continuous

The shift from imitation-based learning to experience-based learning requires more than incremental improvements to existing frameworks. Current models learn from crystallized knowledge without understanding the processes that created it. The increasing adoption of reinforcement learning in post-training - what might be called a “ship of Theseus” transformation of the field - indicates the industry is already moving toward continuous learning paradigms, albeit without acknowledging the fundamental architectural changes required.

OpenXLA provides a solid foundation for heterogeneous computation within current architectural assumptions. Our BAREWire technology reimagines memory management for AI systems that learn continuously through experience. By eliminating unnecessary copies and providing unified memory abstraction with type safety guarantees, we’re creating an efficient, scalable approach to distributed model training that aligns with how intelligence actually develops.

The underlying technology, built on our “System and Method for Zero-Copy Inter-Process Communication Using BARE Protocol” (US 63/786,247), creates new possibilities for AI systems that can efficiently distribute computation across heterogeneous hardware while minimizing overhead traditionally associated with data movement. This patent-pending software innovation from SpeakEZ Technologies represents a significant advancement in distributed AI model training.

This shift from copy scheduling to zero-copy architecture, combined with the transition from static training to continuous learning, represents a paradigm change in how distributed AI systems can be implemented. We’re not building organic mimicry but designed intelligence - systems that combine the efficiency of engineered solutions with the adaptability of continuous learning. This approach enables efficient training of next-generation AI models that move beyond traditional transformer architectures and matrix multiplication operations, creating systems that understand processes, not just patterns. SpeakEZ is pleased to pioneer this important facet to the ecosystem that will guide the future of intelligent workload development.

Author
Houston Haynes
date
May 13, 2025
category
AI
reference:

We want to hear from you!

Contact Us