Welcome on the SIMERG(2)E webpage ...

Statistical Inference for the Management of Extreme Risks, Genetics and Global Epidemiology

SIMERGE (Statistical Inference for the Management of Extreme Risks and Global Epidemiology)

SIMERGE is a LIRIMA project-team started in January 2015. It includes researchers from Mistis/Statify (Inria Grenoble - Rhône-Alpes, France), LERSTAD (Laboratoire d'Etudes et de Recherches en Statistiques et Développement, Université Gaston Berger, Sénégal), IRD (Institut de Recherche pour le Développement, Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes, Dakar, Sénégal) and LEM lab (Lille Economie et Management, Université Lille 1, 2, 3, Modal, Inria Lille Nord-Europe, France).

SIMERG2E (Statistical Inference for the Management of Extreme Risks, Genetics and Global Epidemiology)

In January 2018, SIMERGE was extended to SIMERG2E. This Associate team is built on the same two research themes as SIMERGE, with some adaptations to new applications. The Institut Pasteur de Dakar joined the team. The Associate team is built on two research themes:

Axis 1. Spatial extremes, application to management of extreme risks

Weather variability, both in terms of space and time, is of prime importance in many hydrological, agricultural and energy contexts. Therefore, spatio-temporal modelling of environmental data is well studied in the literature. The basic objectives are: (i) to infer the nature of spatial variation of extreme precipitations and temperatures based on meteorological observations and (ii) to model the pattern of variability of these data components. Different characterizations of multivariate extreme dependence structures have been proposed in the literature (see, for instance, Coles et al (2000), Ledford and Tawn (1996)). These works were the basis of recent studies to characterize the dependence between extremes of a spatial process, see for instance Huser et al (2017) or Wadsworth et al (2017). Once the modeling step is achieved, the inference of the associated risk can be tackled. One of the most popular risk measures is the Value-at-Risk (VaR) introduced in the 1990's. In statistical terms, the VaR at level alpha in (0, 1) corresponds to the upper alpha-quantile of the loss distribution. Even though the VaR has been introduced to deal with financial risks, it is also of interest in meteorological applications where it is interpreted as a return level. The Value-at-Risk however suffers from two main weaknesses. First, it provides us only with a pointwise information: VaR(alpha) does not take into consideration what the loss will be beyond this quantile. Second, random loss variables with light-tailed distributions or heavy- tailed distributions may have the same Value-at-Risk (Embrechts et al, 1999). Consequently, the definition of new risk measures, the study of their properties in case of extreme events, i.e. when alpha tends to zero and their estimation from data are three major statistical challenges (Bellini and Di Bernardino (2017)). Three tasks have been identified to conclude and extend our previous works.

1.1. Investigate the estimation of general risk measures in case of extreme losses making heavy use of the extreme-value theory. We shall investigate both the cases of spectral risk measures and distortion risk measures. This work has been initiated during SIMERGE in the framework of heavy-tailed distributions. It should be concluded and extended to light-tailed distributions.

1.2. Second, we also aim at proposing new estimators of such extreme risk measures able to deal with real-valued or functional covariates. We shall investigate the use of new semiparametric models. Such models should lead to more efficient estimators of extreme risks than the purely nonparametric ones introduced in SIMERGE.

1.3. We shall develop new models which take into account both the spatial and temporal nature of the data as well as the fact that the observations are extremes. For instance, we shall extend the linear regression model and spatio-temporal autoregressive moving average process to this context. We also aim at extending the extremality measures for independent functional data to determine extreme spatial observations of functional nature (for instance a rain flow curve at some station during a certain period of time).

Axis 2. Classification, application to genetics and global epidemiology

We address the challenge to build statistical models in order to test association between diseases and human host genetics in a context of genome-wide screening. Adequate models should allow to handle com- plexity in genomic data (e.g. linkage disequilibrium or correlation between genetic markers, high dimensionality) and additional statistical issues present in data collected from a family-based longitudinal survey (e.g. non-independence between individuals due to familial relationship (kinship) and non-independence within individuals due to repeated measurements on a same person over time). Our genomic data consist of genotypes on 719,656 SNPs (Single Nucleotide Polymorphism) typed on 481 individuals in Senegal, in rural area where malaria and arboviral diseases are endemic. These SPNs data can be considered as categorical variables in high dimension (p = 719, 656 and n = 481). New unsupervised classification methods and co-clustering approaches will be proposed to classify individuals according the different disease status. Indeed, the situation p >> n is an obstacle to most statistical methods and, moreover, individuals may not be independent due to their parental links. This phenomenon further reduces the number of independent observations. Comparing to SIMERGE, we would like to consider more general classification problems (task 2.1) and adapt our methods to genetic data (task 2.2).

2.1. We shall propose classification methods effective on non-standard data (e.g categorical data) in high dimension and allowing to handle dependencies. In SIMERGE, new tools were proposed to take into account the curse-of-dimensionality issue in the context of verbal autopsy data. Here, we would like first to adapt and/or improve similar approaches in the context of SNP data. Second, Functional Data Analysis (FDA) classification based methods will be used by transforming very high dimensional data into functional data. Rather than using sparsity dependency patterns, SNP data can be analyzed by FDA methods taking advantage of the high dimensionality.

2.2. In the context of genetic data, the number of explanatory variables can be very large (from several hundred of thousand to several millions). Thus, selecting relevant variables impacting the outcome is not a simple challenge. The role of variable selection is to have an optimal subset of variables which could explain the phenotype. In the framework of the SIMERG2E project, we aim at developing a variable selection method to pinpoint significant genes implicated in the occurrence of malaria or arboviral diseases in a Senegalese population. Thus, we intend to implement a statistic taking into account intra- and inter-individual dependencies to measure the influence of a subset of variables on the disease phenotype. This statistic will enable us to develop an algorithm allowing to browse all variables subsets in the research of optimal solutions.

Links

- LIRIMA

- Statify, Inria Grenoble Rhône-Alpes

- Modal, Inria Lille Europe

Contact the members

- Head: Stéphane Girard

- Co-head: Abdou Kâ Diongue