Greedy Independent Set Thresholding
An algorithm for selecting representative training samples from large-scale datasets. Simultaneously maximizes diversity + utility. An algorithm for selecting a small but high-quality training subset from large-scale data. Sets a distance threshold to avoid selecting data points that are too close (similar) to each other. Under this constraint, greedily selects only data with high utility. This problem is originally NP-hard (very difficult to find exact solution). GIST guarantees performance of at least 1/2 of the optimal solution (provable guarantee).
research.google
https://research.google/blog/introducing-gist-the-next-stage-in-smart-sampling/

Seonglae Cho