Future prediction is integration, and rule decomposition is differentiation. Derivation is the process of obtaining local rules (direction/velocity/slope), while Integral is the process of advancing the state forward according to those rules. In other words, because differentiation is definable, it provides guidance on how to handle each specific data sample or each instance of reality. Inference is integration; it approximates the uncertain future prediction. Therefore, training is differentiation and inference is integration.
In real-world applications, in most cases, the test distribution does not match the training distribution
- iteration means The number of passes to complete one epoch.
- batch size means Total number of training examples present in a single batch
Model Training Notion
- Data preprocessing
- Model architecture
- Reasonable Loss Function
- Start with small dataset for overfitting and fewer iterations
- Find optimal Learning Rate and Model Regularization parameters
- if loss > 3 * original loss, quit early (learning rate is too high)
- Increase data size & epoch iteration
Model Training Usages
Previously, it was advantageous to set batch sizes as powers of 2 (e.g., 64, 128, 256...) for performance reasons, but according to recent research, it's not necessary to strictly use powers of 2 for batch sizes. What's more important is finding the optimal value through actual experiments.

Seonglae Cho
