For separating a single convex polytope (with m faces), the paper establishes an upper bound for 2-layer networks (width m is sufficient) and a lower bound that deeper networks must satisfy (constraints on the combination of widths across layers). Inverse problem (trained network → extracting polytope structure of data): Viewing trained ReLU networks as intrinsically encoding the polytope structure of data, the paper presents an algorithm to find minimal representations by 'compressing' polytopes through neuron removal/scaling. In toy geometries like "two triangles" or "outer hexagon + inner pentagonal hole," training with BCE/MSE shows convergence to the 'boundary hyperplane/polytope' structures predicted by theory, visually demonstrated.
Polytope simplicity of real datasets: Analyzing MNIST/Fashion-MNIST/CIFAR10 with one-vs-all for each class, each class can be covered by (at most) 2 polytopes (face count is generally limited to 30 or fewer, split into 2 when necessary), and the paper provides concrete numbers for a small-width 3-layer architecture that can achieve complete separation (constructive) for MNIST/Fashion-MNIST/CIFAR10. The resulting networks have very sparse connections, suggesting a possible connection to LTH, particularly noting that maintaining weight sign is important for preserving convex polytope structure (Lottery Ticket Hypothesis)

Seonglae Cho