Understanding Minimax Optimization in Modern Machine Learning

Zhang, Guojun

dc.contributor.author	Zhang, Guojun
dc.date.accessioned	2021-07-21 14:39:07 (GMT)
dc.date.available	2021-07-21 14:39:07 (GMT)
dc.date.issued	2021-07-21
dc.date.submitted	2021-07-20
dc.identifier.uri	http://hdl.handle.net/10012/17157
dc.description.abstract	Recent years has seen a surge of interest in building learning machines through adversarial training. One type of adversarial training is through a discriminator or an auxiliary classifier, such as Generative Adversarial Networks (GANs). For example, in GANs, the discriminator aims to tell the difference between true and fake data. At the same time, the generator aims to generate some fake data that deceives the discriminator. Another type of adversarial training is with respect to the data. If the samples that we learn from are perturbed slightly, a learning machine should still be able to perform tasks such as classification relatively well, although for many state-of-the-art deep learning models this is not the case. People build robust learning machines in order to defend against the attacks on the input data. In most cases, the formulation of adversarial training is through minimax optimization, or smooth games in a broader sense. In minimax optimization, we have a bi-variate objective function. The goal is to minimize the objective function with respect to one variable, and to maximize the objective function with respect to another. Historically, such a problem has been widely studied with convex-concave functions, where saddle points are a desirable concept. However, due to non-convexity, results with convex-concave functions would often not apply to adversarial training problems. It becomes important to understand the theory of non-convex minimax optimization in these models. There are mainly two focuses within recent minimax optimization research. One is on the solution concepts: what is a desirable solution concept that is both meaningful in practice and easy to compute? Unfortunately, there is no definite answer for it, especially in GAN training. Besides, since non-convex minimax optimization includes non-convex minimization as a special case, there is no known efficient algorithm that can find global solutions. Therefore, local solution concepts, as surrogates, are necessary. Usually, people use local search methods such as gradient algorithms to find a good solution. So such concept must be at least stationary (critical) points. Based on the notion of stationarity, a solution concept called local minimax points is recently proposed. Local minimax points include local saddle points and they are stationary points at the same time. Moreover, they correspond to the well-known Gradient Descent Ascent (GDA) algorithm to some extent. I provide a comprehensive analysis of local minimax points, such as their relation with global solutions and other local solution concepts, their optimality conditions and the stability of gradient algorithms at local minimax points. My results show that although local minimax points are good surrogates of global solutions in e.g. quadratic functions, we may have to go beyond this minimax formulation since gradient algorithms may not be stable near local minimax points. Another focus of recent research in the area of minimax optimization is on the algorithms. Including GDA, many old and new algorithms are proposed or analyzed for non-convex minimax optimization. Convergence rates and lower bounds of gradient algorithms are given, improved and compared. Compared to these noticeable contributions, my work focuses more on the stability side of these algorithms, as it is widely-known that gradient algorithms often exhibit some cyclic behaviour around a desirable solution in e.g. GAN training. I use the simplest bilinear case as an illustrative model for understanding the stability. I show that for a wide array of gradient algorithms, updating the two variables one-by-one is often more stable than updating them simultaneously. My stability analysis for bilinear functions can also be extended to general non-linear smooth functions, which allows us to distinguish hyper-parameter choices for more stable algorithms. Finally, I propose new algorithms for minimax optimization. Most algorithms use gradient information for local search, with few exceptions that use the Hessian information as well to improve stability. I give a synthetic view of the convergence rates of current algorithms that use second-order information, and propose Newton-type methods for minimax optimization. My methods alleviate the problem of ill-conditioning in a local neighborhood, which is inevitable for gradient algorithms. This claim is proved by my theory and verified in my experiments.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	minimax optimization	en
dc.subject	gradient algorithms	en
dc.subject	Newton-type methods	en
dc.subject	bilinear games	en
dc.subject	local minimax points	en
dc.subject	nonconvex	en
dc.title	Understanding Minimax Optimization in Modern Machine Learning	en
dc.type	Doctoral Thesis	en
dc.pending	false
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.degree	Doctor of Philosophy	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Poupart, Pascal
uws.contributor.advisor	Yu, Yaoliang
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Zhang_Guojun.pdf
Size:: 6.895Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Show simple item record