Skip to Main content Skip to Navigation

Apprentissage à partir de données extrêmes multivariées : application au traitement du langage naturel

Abstract : Extremes surround us and appear in a large variety of data. Natural data likethe ones related to environmental sciences contain extreme measurements; inhydrology, for instance, extremes may correspond to floods and heavy rainfalls or on the contrary droughts. Data related to human activity can also lead to extreme situations; in the case of bank transactions, the money allocated to a sale may be considerable and exceed common transactions. The analysis of this phenomenon is one of the basis of fraud detection. Another example related to humans is the frequency of encountered words. Some words are ubiquitous while others are rare. No matter the context, extremes which are rare by definition, correspond to uncanny data. These events are of particular concern because of the disastrous impact they may have. Extreme data, however, are less considered in modern statistics and applied machine learning, mainly because they are substantially scarce: these events are out numbered –in an era of so-called ”big data”– by the large amount of classical and non-extreme data that corresponds to the bulk of a distribution. Thus, the wide majority of machine learning tools and literature may not be well-suited or even performant on the distributional tails where extreme observations occur. Through this dissertation, the particular challenges of working with extremes are detailed and methods dedicated to them are proposed. The first part of the thesisis devoted to statistical learning in extreme regions. In Chapter 4, non-asymptotic bounds for the empirical angular measure are studied. Here, a pre-established anomaly detection scheme via minimum volume set on the sphere, is further im-proved. Chapter 5 addresses empirical risk minimization for binary classification of extreme samples. The resulting non-parametric analysis and guarantees are detailed. The approach is particularly well suited to treat new samples falling out of the convex envelop of encountered data. This extrapolation property is key to designing new embeddings achieving label preserving data augmentation. Chapter 6 focuses on the challenge of learning the latter heavy-tailed (and to be precise regularly varying) representation from a given input distribution. Empirical results show that the designed representation allows better classification performanceon extremes and leads to the generation of coherent sentences. Lastly, Chapter7 analyses the dependence structure of multivariate extremes. By noticing that extremes tend to concentrate on particular clusters where features tend to be recurrently large simulatenously, we define an optimization problem that identifies the aformentioned subgroups through weighted means of features.
Document type :
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Monday, July 19, 2021 - 5:05:09 PM
Last modification on : Wednesday, November 3, 2021 - 6:20:35 AM
Long-term archiving on: : Wednesday, October 20, 2021 - 7:21:55 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03291376, version 1



Hamid Jalalzai. Apprentissage à partir de données extrêmes multivariées : application au traitement du langage naturel. Machine Learning [stat.ML]. Institut Polytechnique de Paris, 2020. English. ⟨NNT : 2020IPPAT043⟩. ⟨tel-03291376⟩



Les métriques sont temporairement indisponibles