Apple researchers have developed LiTo, an advanced AI model that reconstructs detailed 3D objects from a single image while preserving realistic lighting effects like reflections, highlights, and specularities across multiple viewing angles.4143
LiTo: Unified 3D Latent Representation
The model introduces a novel 3D latent representation that simultaneously captures object geometry and view-dependent appearance. Unlike previous approaches that prioritize either shape reconstruction or basic diffuse textures, LiTo handles complex visual effects under varied lighting conditions.40
LiTo leverages RGB-depth images as samples of a surface light field—a 5D function mapping 3D surface points and viewing directions to outgoing radiance. Researchers encode random subsamples of these light fields into a compact set of 8192 latent vectors, each 32 dimensions, using a Perceiver IO-based tokenizer.43
Innovative Training Approach
To train LiTo, the team rendered thousands of high-quality 3D assets from Objaverse-XL—over 500,000 objects—across 150 viewing angles and three distinct lighting setups per object: fixed smooth area lights, all-white environment maps, and random lighting configurations.43
The tokenizer processes small random subsets of these multi-view renders, compressing them into latent codes. A dedicated decoder then reconstructs the full 3D geometry and appearance, learning to generalize from sparse inputs. This enables the system to infer complete models from just one image. Training occurs on large GPU clusters, with the tokenizer refined over 90,000 iterations.43
Superior Reconstruction and Generation
A Gaussian decoder outputs view-dependent 3D Gaussians with spherical harmonics up to degree three, supporting effects like Fresnel reflections. Results on benchmarks like Toys4k show LiTo outperforming competitors such as TRELLIS, achieving higher PSNR (34.16 vs. 31.12 on simple views), SSIM (0.985), and lower LPIPS scores.43
For single-image generation, a flow-matching model conditioned on input images produces 3D assets aligned with the source photo’s lighting and materials. This yields better fidelity and visual quality, with lower FID (6.216) and KID metrics compared to baselines.43
LiTo advances 3D AI by unifying geometry and realistic appearance in a single latent space, paving the way for applications in AR, content creation, and beyond.

