LowLevelInstanceData & animation

AppleOS 26 introduces LowLevelInstanceData that can reduce CPU draw calls significantly by instancing. However, I have noticed trouble with animating each individual instance.

As I wanted low-level control, I'm using a custom system and LowLevelInstanceData.replace(using:) to update the transform each frame. The update closure itself is extremely efficient (Xcode Instruments reports nearly no cost). But I noticed extremely high runloop time, reach around 20ms. Time Profiler shows that the CPU is blocked by kernel.release.t6401.

I think it is caused by synchronization between CPU and GPU, however, as I am already using a MTLCommandBuffer to coordinate it, I don't understand why I am still seeing large CPU time.

Answered by DTS Engineer in 885508022

The 20ms CPU stall you're seeing is likely caused by a CPU/GPU synchronization hazard when updating the instance transform buffer.

LowLevelInstanceData provides three distinct methods for writing transform data, each with different synchronization behavior:

  • withMutableTransforms — Gives you a mutable pointer to the current backing buffer. If the GPU is still reading this buffer from the previous frame's render, the CPU will block until the GPU finishes. This is the most likely cause of the stall you're seeing — the kernel.release.t6401 time in the Time Profiler is the CPU waiting for the GPU to release the buffer.

  • replaceMutableTransforms — Gives you a mutable pointer to a fresh buffer. RealityKit handles buffer rotation internally, so there's no synchronization stall. When your closure completes, RealityKit swaps the new buffer in for subsequent renders. This is the correct method for per-frame CPU-side animation.

  • replace(using:) with an MTLCommandBuffer — Returns an MTLBuffer for GPU compute shader writes. Best for very large instance counts where GPU parallelism outperforms CPU iteration.

This three-tier pattern (with… / withMutable… / replaceMutable…) is consistent across all of RealityKit's low-level types (LowLevelBuffer, LowLevelMesh, LowLevelInstanceData).

Regarding your MTLCommandBuffer coordination — RealityKit manages its own internal render pipeline, so your command buffer doesn't affect when RealityKit reads the instance data. The synchronization needs to happen through the LowLevelInstanceData API itself (via the replace variants), not through an external command buffer.

If you're currently using withMutableTransforms for your per-frame updates, switching to replaceMutableTransforms should eliminate the stall — the closure signature is the same, so it's a one-line change.

If this doesn't resolve the issue, could you share the code where you're updating the transforms each frame? Seeing the actual update call and the surrounding context would help narrow down what's happening.

The 20ms CPU stall you're seeing is likely caused by a CPU/GPU synchronization hazard when updating the instance transform buffer.

LowLevelInstanceData provides three distinct methods for writing transform data, each with different synchronization behavior:

  • withMutableTransforms — Gives you a mutable pointer to the current backing buffer. If the GPU is still reading this buffer from the previous frame's render, the CPU will block until the GPU finishes. This is the most likely cause of the stall you're seeing — the kernel.release.t6401 time in the Time Profiler is the CPU waiting for the GPU to release the buffer.

  • replaceMutableTransforms — Gives you a mutable pointer to a fresh buffer. RealityKit handles buffer rotation internally, so there's no synchronization stall. When your closure completes, RealityKit swaps the new buffer in for subsequent renders. This is the correct method for per-frame CPU-side animation.

  • replace(using:) with an MTLCommandBuffer — Returns an MTLBuffer for GPU compute shader writes. Best for very large instance counts where GPU parallelism outperforms CPU iteration.

This three-tier pattern (with… / withMutable… / replaceMutable…) is consistent across all of RealityKit's low-level types (LowLevelBuffer, LowLevelMesh, LowLevelInstanceData).

Regarding your MTLCommandBuffer coordination — RealityKit manages its own internal render pipeline, so your command buffer doesn't affect when RealityKit reads the instance data. The synchronization needs to happen through the LowLevelInstanceData API itself (via the replace variants), not through an external command buffer.

If you're currently using withMutableTransforms for your per-frame updates, switching to replaceMutableTransforms should eliminate the stall — the closure signature is the same, so it's a one-line change.

If this doesn't resolve the issue, could you share the code where you're updating the transforms each frame? Seeing the actual update call and the surrounding context would help narrow down what's happening.

Hey there, thank you so much for your reply.

Yes, I realized that replace(using:) won't work as soon as I noticed that the return buffer is GPU-only, and I am using replaceMutableTransforms on the assumption that replacing could be faster than having GPU read the buffer and hand it over to CPU.

The current code is something like this:

public struct InstanceAnimatorSystem: System {
    
    static let query = EntityQuery(where: .has(SyncInstanceAnimationComponent.self))
    
    public func update(context: SceneUpdateContext) {
        ... // update states
        
        context.entities(matching: Self.query, updatingSystemWhen: .rendering).forEach { entity in
            guard let instanceAnimation = entity.components[SyncInstanceAnimationComponent.self] else { return }
            guard let component = entity.components[MeshInstancesComponent.self] else { return }
            
            for part in component.parts {
                part.data.replaceMutableTransforms { transforms in
                    for i in 0 ..< 88 { // 88 instances.
                        let transform = ... // obtain transform
                        
                        transforms.initializeElement(at: i, to: transform)
                    }
                }
            }
        }
    }
}

I am using a DispatchSource instead of System here, as System fires update(context:) over 90 times per second, and CPU just can't keep up.

The closure itself is extremely efficient though, Xcode Instruments reports nearly no cost.

However, both simulator and physical Apple Vision Pro results are pretty poor, runloops hits 20ms all the time, while reporting Custom RealityKit Systems as severe bottlenecks. Time profiler is also showing heavy CPU usage, constantly over 200%, however, about 95% of the work is kernel.release.

I'm running Xcode Version 26.4.1 (17E202) and visionOS 26.4.

Thank you for sharing the code. Since you're already using replaceMutableTransforms — which is the correct method for per-frame updates — and the closure itself is cheap, the 20ms stall with 95% kernel.release time for only 88 instances (~5.5 KB of transform data) seems disproportionate.

One thing worth trying: you mention driving updates with a DispatchSource instead of the RealityKit System. Updating instance data from outside the render loop could introduce synchronization overhead, since RealityKit may need to coordinate the buffer swap with its own frame timing. If the System update rate is the concern, consider keeping the update inside the System but throttling it — track elapsed time and skip frames where the delta is below your target interval, rather than moving the update outside the render loop entirely.

That said, I'd recommend filing a Feedback report with your sample project and the Instruments trace. The performance you're seeing doesn't match what you'd expect for this workload, and the RealityKit team would be in the best position to investigate whether the internal buffer synchronization is behaving as expected on visionOS 26.4.

Bug Reporting: How and Why? has tips on creating your bug report.

LowLevelInstanceData & animation
 
 
Q