Welcome on the SIMERG(2)E webpage ...
Statistical Inference for the Management of Extreme Risks, Genetics and Global Epidemiology
SIMERGE (Statistical Inference for the Management of Extreme Risks and Global Epidemiology)
SIMERGE is a LIRIMA project-team started in January 2015.
It includes researchers from
Mistis/Statify (Inria Grenoble - Rhône-Alpes, France), LERSTAD (Laboratoire d'Etudes et de Recherches en Statistiques et Développement,
Université Gaston Berger, Sénégal),
IRD (Institut de Recherche pour le Développement, Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes, Dakar, Sénégal)
and LEM lab (Lille Economie et Management, Université Lille 1, 2, 3, Modal, Inria Lille Nord-Europe,
France).
SIMERG2E (Statistical Inference for the Management of Extreme Risks, Genetics and Global Epidemiology)
In January 2018, SIMERGE was extended to SIMERG2E. This Associate team
is built on the same two research themes as SIMERGE, with some adaptations to
new applications. The Institut Pasteur de Dakar joined the team.
The Associate team is built on two research themes:
Axis 1. Spatial extremes, application to management of extreme risks
Weather variability, both in terms of space and time, is of prime importance in many hydrological, agricultural and energy contexts. Therefore, spatio-temporal modelling of environmental data is well studied in
the literature. The basic objectives are: (i) to infer the nature of spatial variation of extreme precipitations and temperatures based on meteorological observations and (ii) to model the pattern of variability of these data components. Different characterizations of multivariate extreme dependence structures have been proposed in the literature (see, for instance, Coles et
al (2000), Ledford and Tawn (1996)). These works were the basis of recent studies to characterize the dependence between extremes of a spatial process, see for instance Huser et al (2017) or Wadsworth et al (2017). Once the modeling step is achieved, the inference of the associated risk
can be tackled. One of the most popular risk measures is the Value-at-Risk (VaR) introduced in
the 1990's. In statistical terms, the VaR at level alpha in (0, 1) corresponds to the upper alpha-quantile
of the loss distribution. Even though the VaR has been introduced to deal with financial
risks, it is also of interest in meteorological applications where it is interpreted as a return
level. The Value-at-Risk however suffers from two main weaknesses. First, it provides us only
with a pointwise information: VaR(alpha) does not take into consideration what the loss will be
beyond this quantile. Second, random loss variables with light-tailed distributions or heavy-
tailed distributions may have the same Value-at-Risk (Embrechts et al, 1999). Consequently,
the definition of new risk measures, the study of their properties in case of extreme events, i.e.
when alpha tends to zero and their estimation from data are three major statistical challenges (Bellini and
Di Bernardino (2017)).
Three tasks have been identified to
conclude and extend our previous works.
1.1. Investigate the estimation of general risk measures in case of extreme losses making heavy
use of the extreme-value theory. We shall investigate both the cases of spectral risk measures
and distortion risk measures. This work has been initiated during SIMERGE in the framework of
heavy-tailed distributions. It should be concluded and extended to light-tailed distributions.
1.2. Second, we also aim at proposing new estimators of such extreme risk measures able to deal
with real-valued or functional covariates. We shall investigate the use of new semiparametric
models. Such models should lead to more efficient estimators of extreme risks than the purely
nonparametric ones introduced in SIMERGE.
1.3. We shall develop new models which take into account both the spatial and temporal nature
of the data as well as the fact that the observations are extremes. For instance, we shall extend
the linear regression model and spatio-temporal autoregressive moving average process to this
context. We also aim at extending the extremality measures for independent functional data to
determine extreme spatial observations of functional nature (for instance a rain flow curve at
some station during a certain period of time).
Axis 2. Classification, application to genetics and global epidemiology
We address the challenge to build statistical models in order to test association between diseases and human host
genetics in a context of genome-wide screening. Adequate models should allow to handle com-
plexity in genomic data (e.g. linkage disequilibrium or correlation between genetic markers, high
dimensionality) and additional statistical issues present in data collected from a family-based
longitudinal survey (e.g. non-independence between individuals due to familial relationship
(kinship) and non-independence within individuals due to repeated measurements on a same
person over time). Our genomic data consist of genotypes on 719,656 SNPs (Single Nucleotide
Polymorphism) typed on 481 individuals in Senegal, in rural area where malaria and arboviral
diseases are endemic. These SPNs data can be considered as categorical variables in high dimension (p = 719, 656 and n = 481). New unsupervised classification methods and co-clustering
approaches will be proposed to classify individuals according the different disease status. Indeed, the situation p >> n is an obstacle to most statistical methods and, moreover, individuals
may not be independent due to their parental links. This phenomenon further reduces the
number of independent observations.
Comparing to SIMERGE, we would like to consider more general classification problems (task
2.1) and adapt our methods to genetic data (task 2.2).
2.1. We shall propose classification methods effective on non-standard data (e.g categorical data)
in high dimension and allowing to handle dependencies. In SIMERGE, new tools were proposed
to take into account the curse-of-dimensionality issue in the context of verbal autopsy data. Here,
we would like first to adapt and/or improve similar approaches in the context of SNP data.
Second, Functional Data Analysis (FDA) classification based methods will be used by
transforming very high dimensional data into functional data. Rather than using sparsity
dependency patterns, SNP data can be analyzed by FDA methods taking advantage of the high
dimensionality.
2.2. In the context of genetic data, the number of explanatory variables can be very large (from
several hundred of thousand to several millions). Thus, selecting relevant variables impacting
the outcome is not a simple challenge. The role of variable selection is to have an optimal subset
of variables which could explain the phenotype. In the framework of the SIMERG2E project, we
aim at developing a variable selection method to pinpoint significant genes implicated in the
occurrence of malaria or arboviral diseases in a Senegalese population. Thus, we intend to
implement a statistic taking into account intra- and inter-individual dependencies to measure
the influence of a subset of variables on the disease phenotype. This statistic will enable us to
develop an algorithm allowing to browse all variables subsets in the research of optimal
solutions.
Links
- LIRIMA
- Statify, Inria Grenoble Rhône-Alpes
- Modal, Inria Lille Europe
Contact the members
- Head: Stéphane Girard
- Co-head: Abdou Kâ Diongue