Data Analysis in Modeling Probability Density Functions as Data Objects

Overview and Initial Setup

The aim of this document is to provide an expanded version of the data analyses presented in the paper Modeling Probability Density Functions as Data Objects (hereinafter referred to as the “main paper”). The analyses in the main paper and this document also utilize the following source code and packages:

Cazelles et al. (2018) source code.
Petersen, Liu, and Divani (2020) source code ⁴.
Machalova, Hron, and Monti (2016) source code.
The HADES package source code.

IMPORTANT: All the Github repository files should be cloned under your R working directory with the file structure unchanged. Users should also follow the instructions in Setup.R and source this script to include all the libraries and utility functions needed. Since we will run MATLAB from R, please make sure that your Matlab can be launched from command line by the command matlab. If not, please add the path to your MATLAB binary to the PATH variable.

1. Representative Mean Density

We start with a simple illustration of mean density evaluation using approaches detailed in Section 2 and 3 of the main paper. The analysis covered in this section can be obtained by running the script MouseMean.R. We aim to briefly go over the content of this script and highlight some functions/steps therein.

1.1. Ts1Cje/2N Mouse Data Set

We consider a data example from Amano et al. (2004) (also analyzed in Zhang and Müller (2011)) in which global gene expression files were obtained from the brains of six Ts1Cje mice (used to model Down syndrome) and six normal (2N) mice using DNA microarrays. Kernel density estimations are obtained by applying the density_estimation() function. E.g., to obtain the density functions for the six Ts1Cje mice, simply import the data set as mouse2_list and run

mouse2_dens = density_estimation(mouse2_list, band_choice = "Silverman", kernel = "gaussian")

The resulting density estimations are shown in Figure 1.

Figure 1: Density estimation: The mouse data.

One can observe that there are prominent location shifts in density modes in both groups of mice. The main paper also identified the height of the mode as a secondary source of variability.

1.2. Bayes Space

We start with a brief discussion of the notations. Denote by \(\mathcal{D}\) a subset of probability density functions on the real line \(\mathbb{R},\) i.e., \[\begin{equation} \label{e:Ddef} \mathcal{D}\subset \{f:\, f \geq 0 \textrm{ and } \int_\mathbb{R} f(x) \mathrm{d}x = 1\}, \end{equation}\] and let \(d\) be a generic metric on \(\mathcal{D}.\)

A random density is a measurable map \(\mathfrak{f}: (\Omega, \mathcal{S}, P) \rightarrow (\mathcal{D}, \mathcal{B}_{\mathcal{D}}),\) where \((\Omega, \mathcal{S}, P)\) is a probability space and \(\mathcal{B}_{\mathcal{D}}\) contains the Borel sets on \((\mathcal{D},d).\) We assume that a sample of i.i.d. random densities \(\mathfrak{f}_i\), \(i = 1,\dots, n,\) is available for analysis. In practice, it is often the case that one needs to estimate the density from the vector generated by the random mechanism characterized by the density.

Denote by \(B\), a Bayes space is given by \[\begin{equation} \label{e:Bdef} B = \{f: \, f > 0, \ \int_{I} \{ \log f(x) \}^2 \mathrm{d} x < \infty\}, \end{equation}\] where \(I\) is a closed interval. I.e., \(B\) contains densities (and other positive functions) with square integrable logarithm. Based on a Hilbert space structure on \(B\) (proposed by Egozcue, Dı́az–Barrero, and Pawlowsky–Glahn (2006) and Van den Boogaart, Egozcue, and Pawlowsky-Glahn (2014)), the centered log ratio (clr) transformation defines an isomorphic isometry of \(B\) and \(L_0^2(I) := \{h \in L^2(I): \int_I h(x)\mathrm{d}x = 0\}\), which is given by
\[\begin{equation} \label{e:clr} \psi_{\mathrm{clr}}(f) := \log(f) - \frac{1}{|I|} \int_I \log\{f(x)\} \mathrm{d} x. \end{equation}\]

Thus the sample Fréchet mean under \(d_B\), which is a metric on \(B\) induced by its Hilbert space norm, is given by \[\begin{equation} \label{e:sBmean} \hat{f}_{\oplus}^{d_B}(\cdot) = \frac{\exp\{\hat{h}_\oplus(\cdot)\}}{\int_I \exp\{\hat{h}_\oplus(x)\} \mathrm{d}x},\quad \hat{h}_\oplus = \frac{1}{n}\sum_{i = 1}^n \psi_{\mathrm{clr}}(\mathfrak{f}_i). \end{equation}\]

Following the steps detailed in Talská et al. (2018), we adopt the B-spline representation for density functions in Bayes spaces, which utilizes the B-spline smoothing procedure proposed by Machalova, Hron, and Monti (2016) so that the constraint \(\int_I \psi_{clr}(\mathfrak{f}_i)(x) \mathrm{d}x = 0,\) \(\mathfrak{f}_i \in B,\) \(i = 1,\dots, n,\) is satisfied almost surely. The following lines of code, which are used to obtain the B-spline representation of the clr transformed densities for the six Ts1Cje mice, are presented as an illustration of this smoothing procedure.

# k, l: B-spline degree and order of derivative for smoothness penalty
# alpha: weight of smoothness penalty
# xcp, knots: control points and the number B-spline knots
# cores: number of cores used for parallelization

# choosing smoothing penalty weight alpha
CV2 = smoothSplinesVal(k=3, l=2, alpha=10^seq(-3,3,by=1), data=mouse2_dens, xcp=mouse2_dSup, knots=100, cores=2)

# clr transformation
dens_clr_coef2 = smoothSplines(k=3, l=2, alpha=10^seq(-3,3,by=1)[which(CV2$CVerror == min(CV2$CVerror))], data=mouse2_dens, xcp=mouse2_dSup, knots=100)

# evaluate the clr-transformed data
clr_basis2 = create.bspline.basis(range(mouse2_dSup), nbasis = 100 - 2 + 4, norder = 4)

clr2 = t(eval.fd(mouse2_dSup, fd(t(dens_clr_coef2$bspline), clr_basis2)))

The mean density is then obtained by taking pointwise average in the clr-tranformed space followed by a transformation back to the density space.

1.3. Wasserstein Metric

Denote by \(d_W\), the Wasserstein metric is an optimal transport distance metric. The Wasserstein distance for elements in \(\mathcal{W}_2 = \{\mu : \mu \mathrm{\text{ is a probability measure on }} \mathbb{R} \mathrm{\text{ and }}\int_\mathbb{R} x^2 \mathrm{d}\mu(x) < \infty\}\) admits a closed form expression in terms of quantile functions, which gives rise to the sample Wasserstein mean

\[\begin{equation} \label{e:sWmean} \hat{f}_{\oplus}^{d_W} = \left[\hat{F}_\oplus^{d_W}\right]', \quad \left(\hat{F}_\oplus^{d_W}\right)^{-1}(t) = \frac{1}{n}\sum_{i = 1}^n \mathfrak{F}_i^{-1}(t), \end{equation}\] where \(\mathfrak{F}_i^{-1}\) is the random quantile function corresponding to \(\mathfrak{f}_i,\) \(i=1,\dots,n.\) Thus the evaluation of the sample Wasserstein mean density basically entails evaluating sample quantile functions, taking pointwise average, inverting the sample average quantile function and taking derivative over a suitable grid.

1.4. Fisher-Rao Metric

As mentioned in the main paper, the existence and uniqueness of population and sample Fréchet means under the Fisher-Rao metric have not been investigated. However, Srivastava, Jermyn, and Joshi (2007) proposed an iterative algorithm which leverages the exponential and logarithmic maps between the square-root density space and the tangent space at a given \(f \in \mathcal{D}\) to approximate the sample Karcher mean. The algorithm is implemented as follows

# obs: Matrix of sample densities, row: observation, column: grid.
# dSup: Vector of density support.
# tolerance: Tolerance for approximation error, a numeric variable.
# step_size: Descending/shooting step size, a numeric variable less than 1.
Karcher_mean = function(obs, dSup, tolerance, step_size) {
  obs = sqrt(obs)
  start = obs[1,] # Random initialization.
  tngt_mean = apply(apply(obs, 1, function(xx) fr_Log_Map(dSup, start, xx)), 1, mean)
  error = sqrt(trapzRcpp(dSup, tngt_mean^2))
  counter = 1
  
  print(paste("Initial error:", error))
  
  while (error > tolerance & counter < 5001) {
    start = fr_Exp_Map(dSup, step_size*tngt_mean, start)
    tngt_mean = apply(apply(obs, 1, function(xx) fr_Log_Map(dSup, start, xx)), 1, mean)
    error = sqrt(trapzRcpp(dSup, tngt_mean^2))
    counter = counter + 1
  }
  print(paste("Terminating error:", error))
  return(start)
}

More details of the implementations of the exponential map fr_Exp_Map() and the logarithmic map fr_Log_Map() can be found in Fisher-Rao.R.

1.5. Mean Densities

Figure 2 demonstrates the mean densities obtained under the three aforementioned approaches along with the cross-sectional mean (i.e., ordinary functional mean). As observed in the main paper, both the cross-sectional and Fisher-Rao means appear bimodal as they both attempt to average out pointwise behavior in the densities, or square-root densities. The unimodal Fréchet mean derived from the Bayes space and Wasserstein metrics are more representative of the sample.

Figure 2: Mean densities.

2. Modes of Variation Analysis

As mentioned in Section 3.2 of the main paper, functional principal component analysis (FPCA; e.g., Ramsay and Silverman (2005),Kokoszka and Reimherr (2017)) is a common tool in functional data analysis for performing exploratory analysis of the main sources of variability in the sample (Castro, Lawton, and Sylvestre 1986; Jones and Rice 1992) as well as dimension reduction. In this section, we will demonstrate the usage of the MdVar() function for modes of variation analysis on the Human Mortality Data (HMD).

2.1. The Human Mortality Data

The Human Mortality Database currently contains cross-sectional lifetables for 41 countries throughout the world, whereby the number of people who died at a given age during a fixed year are recorded. For the purpose of our analysis, we will focus on the data for \(n = 40\) countries in year 2008. The record for Belgium is discarded due to data quality issues (for more information, please visit Human Mortality Database). For a given country and year, one can obtain the death counts by running get_year_counts.R, which automatically outputs the age_counts.mat file under the current working directory. This file is further fed into MATLAB to generate density estimations by histogram smoothing (discarding ages below 20). Currently, the estimated densities are already stored under ./Data/HMD/dens_est. However, one can always reproduce these results by running the dens_est.m, either directly from MATLAB or from R using

run_matlab_script("dens_est.m")

The run_matlab_script() is a function in the matlabr package that allows users to run MATLAB scripts from R. Please follow the Overview and Initial Setup section to configure MATLAB to be compatible with this function.

We will regularize the obtained densities by \(\alpha = 0.01\) for all the following analyses, i.e., setting the smallest density function values to be at least \(0.01\) to avoid numerical issues in the subsequent analyses. This practice is particularly useful in analysis related the LQD transformation. Figure 3 illustrates the obtained and regularized densities.

Figure 3: HMD densities for 40 countries in year 2008.

It can be observed from the above plots that while a great portion of the sample densities are close to each other, there are certain shifts among individual densities. In particular, there are prominent variabilities in the location and sharpness of the mode. This observation leads naturally to our next topic — modes of variation and dimension reduction.

2.1. Transformation FPCA

FPCA is based on the Karhunen-Loève decomposition of a random element \(Y\) of a Hilbert space. Hence it is without surprise that the traditional FPCA can be applied to densities transformed into elements in Hilbert spaces. Let \(\psi\) be a generic transformation mapping densities into a Hilbert space of functions \(L^2(\mathcal{T})\), where \(\mathcal{T} \subset \mathbb{R}\) is an interval. One begins with the transformation \(Y = \psi(\mathfrak{f})\) and the corresponding mean and covariance functions \(\nu(t) = \mathbb EY(t)\) and \(G(s,t) = \hbox{Cov}(Y(s), Y(t))\) (\(s,t \in \mathcal{T}\)) of \(Y\), leading to the Karhunen-Loève decomposition \[\begin{equation} \label{e:KL} Y(t) = \nu(t) + \sum_{j = 1}^\infty \xi_{j}\phi_j(t), \quad \xi_j = \int_{\mathcal{T}}\left[Y(t) - \nu(t)\right]\phi_j(t)\mathrm{d}t, \end{equation}\] where the \(\phi_j\) arise from the Mercer decomposition \(G(s,t) = \sum_{j = 1}^\infty \lambda_j \phi_j(s)\phi_j(t),\) and thus form an orthonormal basis of \(L^2(\mathcal{T}).\)

Transformation \(\psi_{\mathrm{clr}}\) (also termed Simplicial Principal Component Analysis due to its relation to compositional data analysis) in Section 2 of the main paper certainly constitute a specific example. Petersen, Müller, and others (2016) proposed another instance of such transformation, namely the log quantile density (LQD) transformation, which has been found to substantially increase the effectiveness of classical FPCA in density analysis (Chen et al. 2019; Salazar et al. 2020; Petersen, Chen, and Müller 2019). The LQD transformation is given by \[\begin{equation} \label{e:LQD} \psi_{\textrm{LQD}}(f)(t) = -\log\{f \circ Q_f(t)\}, \quad t \in [0,1], \end{equation}\]

where density \(f \in \mathcal{D},\) and \(Q_f\) denotes its quantile function.

The \(\phi_j\) in the Karhunen-Loève decomposition reflect the most dominant directions of variability in \(Y\) about its mean, and suggest the modes of variability \[\begin{equation} \label{e:modesY} \nu(t) + \alpha \sqrt{\lambda_j}\phi_j(t), \quad \alpha \in \mathbb{R}, t \in \mathcal{T}, \end{equation}\] as a device for visualizing this variability. In our analysis, for a fixed direction of variability \(\phi_j\), we select \(\alpha = 0, \pm 1, \pm 2\), which is effectively visualizing the Hilbertian elements that are one and two standard deviations away from their mean along the chosen direction of variability. These modes of variation are then transformed back to the density space for interpretability. I.e., \[\begin{equation} \label{e:modesDens} g_j(x, \alpha) = \psi^{-1}\left(\nu + \alpha \sqrt{\lambda_j}\phi_j\right)(x), \quad \alpha, x \in \mathbb{R}. \end{equation}\]

The sample version of the above equation follows intuitively; we refer readers to the main paper for more discussions on the modes of variation analysis under transformation approaches. Dimension reduction is achieved by constructing a truncated version of the Karhunen-Loève decomposition using sample estimates. Using the first \(J\) functional principal component score estimates \(\hat{\xi}_{ij} = \int_{\mathcal{T}}(Y_i(t) - \hat{\nu}(t))\hat{\phi}_j(t)\mathrm{d}t,\) the \(J\)-dimensional representation of each density is \[\begin{equation} \label{e:KLtrunc} \hat{f}_{i,J}(x) = \psi^{-1}\left(\hat{\nu} + \sum_{j = 1}^J\hat{\xi}_{ij}\hat{\phi}_j\right)(x), \quad x\in \mathbb{R}, \, i = 1,\dots,n. \end{equation}\]

To implemented the modes of variation analysis with transformations \(\psi_{\mathrm{clr}}\) and \(\psi_{\textrm{LQD}}\), simply call the functions

# yr_2008_dens is the matrix of HMD densities of year 2008,
# and dSup is the numeric vector that contains the grid over which
# densities are evaluated.

MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "clr")
MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "lqd")

The MdVar() function will automatically write the modes of variation analysis plots under the current working directory. E.g., Figure 4 and 5 are generated by the above lines of code.

Figure 4: clr: The first mode of variation (left) and the second mode of variation (right)

Figure 5: LQD: The first mode of variation (left) and the second mode of variation (right)

2.2. Tangent Space Log-FPCA

In addition to the transformation FPCA, one can leverage the (pseudo) Riemannian structure of \(\mathcal{D}\) under specific metrics \(d\), such as the Wasserstein and Fisher-Rao metrics, to lift the densities to a tangent space (a Hilbert space) and implement classical FPCA as in the above transformation cases. A computationally simple method is to first compute the sample Fréchet mean \(\hat{f}_\oplus^d,\) followed by mapping the densities to the tangent space at \(\hat{f}_\oplus^d.\) Using \(\operatorname{Log}\) and \(\operatorname{Exp}\) as generic logarithmic and exponential maps, one computes \(Y_i = \operatorname{Log}_{\hat{f}_\oplus^{d}}(f_i).\) The \(j\)-th Log-FPCA sample mode of variation then becomes \[\begin{equation} \label{e:logModes} \hat{g}_j(x, \alpha) = \operatorname{Exp}_{\hat{f}_\oplus^{d}}\left(\hat{\nu} + \alpha\sqrt{\hat{\lambda}_j}\hat{\phi}_j\right)(x), \quad \alpha, x \in \mathbb{R}. \end{equation}\] Similarly, one can construct \(J\)-dimensional representations of the densities via \[\begin{equation} \label{e:logKLtrunc} \hat{f}_{i,J}(x) = \operatorname{Exp}_{\hat{f}_\oplus^d}\left(\hat{\nu} + \sum_{j = 1}^J \hat{\xi}_{ij}\hat{\phi}_j\right), \quad x \in \mathbb{R}. \end{equation}\]

The detailed expressions of the logarithmic and exponential maps under Wasserstein and Fisher-Rao metrics are discussed thoroughly in Section 2.3 of the main paper. Here, We emphasize that while this approach resembles the transformation approach, it is not an instance of the transformation approach. For example, under the Wasserstein metric, the logarithmic map is not a bijection; thus it fails to qualify as a transformation approach. However, the merit of the tangent space Log-FPCA approach is that the tangent space distance is related to the inherent metric \(d\) on the density space, which is a powerful tool for modeling density functions.

To showcase the tangent space log-PFCA with the Fisher-Rao metric, we call the MdVar() function as below

MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "fr")

Figure 6 is generated by the above line of code.

Figure 6: Fisher-Rao: The first mode of variation (left) and the second mode of variation (right)

2.3. Wasserstein Geodesic PCA

The Wasserstein Geodesic PCA (GPCA) approach (Bigot et al. 2017; Cazelles et al. 2018) aims to find the orthogonal directions, within the image of \(\mathcal{D}\) under the logarithmic map, along which the variability quantified by the tangent space norm (the usual Hilbert space norm) is maximized. In particular, Cazelles et al. (2018) proposed two algorithms to implement the Geodesic PCA. We will demonstrate the iterative version of the GPCA algorithm.

The GPCA is implemented in MATLAB; the MdVar() function will run MATLAB scripts from R and swap files between the two programs as the function runs. Readers can find more details about this process in MdVar.R. The file structures should remain unchanged as pulled from the GitHub repository, otherwise the communication between R and MATLAB might encounter problems. To implement the GPCA, simply call

MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "GPCA", wd = "/path/to/your/working/directory")

The following plots are generated by the code above.

Figure 7: GPCA: Left: The first mode of variation (left) and the second mode of variation (right)

Since we regularized densities (minimum density values are set to \(0.01\)), we modified the pushforward_density() function from GPCA package to prevent potential numerical issues near the boundaries of density support. However, this modification does not change core functionalities of the GPCA source code. Interested readers can refer to the next section for more details on this modification.

2.4. The `pushforward_density_new()` Function

The original pushuforward_density() function essentially takes the numerical derivative of the optimal transport function \(T\) (from the sample Wasserstein mean \(\hat{f}_\oplus\) to the target density \(f\)). Its reciprocal is then multiplied by the sample Wasserstein mean function and the target density is recovered by interpolating this quantity over a desired grid. To see the rational behind this procedure, note that \[\begin{equation} \frac{\mathrm{d}}{\mathrm{d} u} T(u) = \frac{\mathrm{d}}{\mathrm{d} u} F^{-1}\circ \hat{F}_{\oplus}(u) = \frac{\hat{f}_\oplus(u)}{f \circ T(u)} \Rightarrow \frac{\hat{f}_\oplus(u)}{\frac{\mathrm{d}}{\mathrm{d} u} T(u)} = f \circ T(u). \end{equation}\]

One thing to notice is that the precision of numerical differentiation, in most of implementations, decays near the boundaries of the density support. In the case where density values are evaluated over a dense grid and tends to zero, this is not a noticeable issue. However, in our analysis, densities are only evaluated at \(90\) grid points and do not vanish to zero due to regularization, which causes the boundary points to oscillate irregularly due to numerical inaccuracy. For the purpose of our analysis, we fixed this issue by forcing \(\frac{\mathrm{d}}{\mathrm{d} u} T(u)\) to converge to \(1\) near the boundaries of the support. This fix is based on the observation that near the boundary points, 1) all the sample densities behave similarly, and 2) the value of \(T(u)\) is close to \(1\).

2.4. Fraction of Variance Explained

In the main paper, we compared different approaches’ relative efficiency in explaining the variability in the data using the fraction of variance explained (FVE) metric (Petersen, Müller, and others 2016). In particular, the FVE metric was calculated with the Wasserstein metric. In this section, We will demonstrate the FVE analysis with more metrics, i.e., L\(^2\) distance, clr (Aitchison), Wasserstein distance, and Fisher-Rao metrics. The methods being compared are the clr, LQD transformation FPCA and the tangent space log-PCA based on the Fisher-Rao metric. The Geodesic PCA is not included as a general out-of-sample projection algorithms, to the best of our knowledge, have not been thoroughly investigated yet. Thus, in the spirit of a review paper, we will not include the GPCA in the FVE analysis.

The sample densities are randomly partitioned into a training set and testing set of equal size for \(100\) times. The training sets are used to estimate the components of the FPCA and the testing sets are then used to compute the FVE. Notice that in the case of the tangent space log-PCA (based on Fisher-Rao metric), the projection onto the principle components depends on the reference measure of the tangent space, which is estimated by the sample Fréchet/Karcher mean. Therefore, we treat each each pair of training and testing sets as they all live in the tangent space at the Karcher mean of the training set.

FVE boxplots presented in Figure 8 to 11 will be automatically written into the current working directory by sourcing Split_Sample_FVE.R.

Figure 8: Wasserstein metric

Figure 9: Fisher-Rao metric

Figure 10: Aitchison metric

Figure 11: L2 distance metric

Observe that the LQD captures more variability with \(1\) PC across all metrics. The efficiency of the other two approaches catches up as the number of PC increases. Overall, all the FPCA approaches are able to capture a decent amount of variabilities among the sample densities. With \(1\) PC, the median of FVE values of all methods are around \(90\%\) across all FVE metrics. Even the worst outliers are around \(75\%\) (under the L\(^2\) distance metric). As we go up with one more PC, the median FVE values of all methods rise to around \(97\%\).

2.5. Dimension Reduction

As all the FPCA approaches are able to achieve a decent FVE value with a small number of principle components, we proceed to investigate the dimension reduction for samples of density functions. Dimension reduction can be achieved by reconstructing density functions with a few number of eigenfunctions. Details on the reconstruction procedure are given in in Section 3.2 in the main paper, as well as Section 2.1 and 2.2 in this document. We will only demonstrate the reconstructed densities in this section.

The functionality of reconstructing densities is also implemented by the MdVar() function. Hence, one only needs to call the following functions to make plots of reconstructed densities.

MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "clr")
MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "lqd")
MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "fr")
MdVar(origin_dens = yr_2008_dens, dSup = dSup, method = "GPCA", wd = "/path/to/your/working/directory")

The output are presented in Figure 12 to 15.

Figure 12: LQD

Figure 13: clr

Figure 14: Aitchison metric

Figure 15: GPCA

3. Density Response Regression

In this section, we will illustrate the implementation of density response regression (Section 3.3 of the main paper). The same year 2008 age-at-death distributions (the human mortality data) will be used as response. We choose three economic indicators from 1998 as predictors to measure their effect on mortality after one decade. Following the notation from the main paper, let \(U\) be the vector of covariates. In our analysis, \(U_{i1}\) and \(U_{i2}\) are the percentage change in gross domestic product (gdp) and inflation rate (inf) from the previous year, and \(U_{i3}\) is the unemployment rate (uem). These predictors are used to fit three versions of the Fréchet regression model, corresponding to Bayes space regression (BS), the functional response model of Faraway (1997) applied to the LQD-transformed densities (LQD), and regression under the Wasserstein metric (W). We dropped the records of Taiwan in our regression analysis as the economic indicators pulled from the World Bank Open Data did not contain corresponding records.

3.1. Bayes Space Regression

The Bayes space regression model takes the form of \[\begin{equation} \label{e:BayesReg} \psi_{\mathrm{clr}}(\mathfrak{f}_i) = \beta_0 + \sum_{j = 1}^p U_{ij}\beta_j + \epsilon_i,\, i = 1,\dots, n, \end{equation}\] which is an instance of the Fréchet regression proposed by Petersen and Müller (2019). Cazelles et al. (2018) illustrated how to apply tools designed for the standard functional linear model to their B-spline representation of the clr-transformed data.

To implement the Bayes space regression, call the function Regression_Analysis() as below.

Regression_Analysis(predictors = pred_df, dens = yr_2008_dens_reg, trsfm = "clr_spline", dSup = dSup, CI = "bootstrap", new_df = effect_inf)

The pred_df argument is the matrix of predictors. In the case of BS and LQD regression, the function will automatically attach a column of 1's to indicate the intercept. CI parameter specifies the uncertainty estimation methods. Available options are "bootstrap" and "pw". The "pw" argument will produce pointwise \(95\%\) confidence intervals around the regression functions. We refer reader to Chapter 13 of Ramsay and Silverman (2005) for computational details. The "bootstrap" option will first fit the model, then resample the residuals \(200\) times to bootstrap the regression functions. While none of the aforementioned confidence bands are simultaneous across the argument of the regression functions, the bootstrap version is usually preferred for practical use as it does not explicitly impose any parametric assumption on the residual distribution.

By running the above code, one obtains the plots of regression functions along with their bootstrap confidence band as presented in Figure 16 and 17.

Figure 16: Regression functions with bootstrap confidence bands (BS): intercept (left), gdp (right)

Figure 17: Regression functions with bootstrap confidence bands (BS): inf (left), uem (right)

By a visual inspection, the above plots suggest that the indicator inf might have the most significant effect on the mortality density. The function Regression_Analysis() also produces effect plot based on the parameter new_df. In our case, we provide effect_inf as the argument to this parameter, where effect_inf is a matrix containing all three economic indicators with gdp, uem held at their mean level and inf ranging from its \(10^{th}\) to \(90^{th}\) sample quantile with constant increment. Figure 17 shows that the increase of inflation rate shifts the mode of the mortality curve to lower ages and reduces the height of the mode, which indicates a reduced longevity and high variability structure.

Figure 18: Effect plot of inf (BS).

3.2. LQD

Another approach to fit a density response regression model is to transform densities to the LQD-space so that the classical functional regression tools (Ramsay and Silverman 2005; Faraway 1997) can be applied. One can use regularized basis expansion to impose smoothness restrictions on the transformed densities and regression functions, or apply the direct pointwise minimization if there is no particular restrictions on how regression function should vary. We will adopt the latter approach as the LQD transformation effectively removes all the constrains on the density functions so that the regression functions can vary freely.

To implement the LQD regression, call the following line of code so that Figure 19 to 20 will be written to the current working directory.

Regression_Analysis(predictors = pred_df, dens = yr_2008_dens_reg, trsfm = "lqd", dSup = dSup, CI = "bootstrap", new_df = effect_inf)

Figure 19: Regression functions with bootstrap confidence bands (LQD): intercept (left), gdp (right)

Figure 20: Regression functions with bootstrap confidence bands (LQD): intercept (left), gdp (right)

Figure 21: Effect plot of inf (LQD).

We can observe similar patterns in Figure 19 to 21 as in the clr regression case.

3.3. Fréchet Global Regression

We will refer readers to section 3.3 for a detailed discussion of the Fréchet global regression. Here, we emphasize that global and partial \(F\) tests are made possible with the Fréchet global regression, and one can construct simultaneous confidence bands under this framework. The following line of code implements the Fréchet global regression with the same predictors as in section 3.1 and 3.2. Table 1 confirms our previous observations that the inf indicator indeed has a significant effect on mortality curve. In addition, uem is also a significant predictor. Figure 22 demonstrates that the effect of inf under the Fréchet regression is consistent with that in the BS and LQD cases. Moreover, Figure 23 shows that uem has a similar, even visually more prominent effect as that of inf.

Regression_Analysis(predictors = pred_df, dens = yr_2008_dens_reg, trsfm = "clr_spline", dSup = dSup, CI = "bootstrap", new_df = effect_inf)

Table1: Partial F test for individual effects

	F-statistic	p-value(truncated)	p-value(satterthwaite)
gpd	\(0.046\)	\(0.605\)	\(0.607\)
inf	\(4.708\)	\(0.002\)	\(<0.001\)
uem	\(1.602\)	\(0.014\)	\(0.008\)
Global F-statistic (by Satterthwaite method):
\(49.213\) on \(3.328\) DF, p-value:\(1.965\)e\(-10\)

Figure 21: Effect plot of inf (W).

Figure 22: Effect plot of uem (W).

Note that one powerful tool the Fréchet regression provides is the simualtaneous confidence band. In Figure 23, we will add those confidence bands to each level of the effects of uem as shown in Figure 22.

Figure 23: Fréchet simultaneous confidence bands at each inf level

Figure 23: Fréchet simultaneous confidence bands at each inf level

One can observe that the confidence bands become narrower as the indicator approaches its median/mean level, which coincides with the intuition behind the Fréchet as mentioned in the main paper.

3.4. Adding Eastern Europe (EE) Indicators

An exploratory plot as presented in Figure 24 appears to suggest that countries in Eastern Europe (EE) tended to have distributions indicative of shorter life spans. This observation motivated us to add a categorical variable (i.e., 1 for EE countries and 0 for others) to our regression analysis. The countries categorized as EE countries are

Belarus
Bulgaria
Czech Republic
Hungary
Poland
Russia
Slovakia
Ukraine
Estonia
Lithuania
Latvia

Figure 24: HMD densities with EE highlighted

First, we run the Fréchet regression with the matrix of predictors pred_df_E which includes the categorical variable EE. The effect_EE matrix has all four predictors with gdp, inf, uem held at their sample average and EE takes values of \(0\) and \(1\).

Regression_Analysis(predictors = pred_df_E, dens = yr_2008_dens_reg, trsfm = "Frechet", dSup = dSup, new_df = effect_EE, EE = TRUE)

Table 2 shows that adding the EE indicator substantially increase \(R^2\) value and the significance of the model. Figure 25 also shows the effect plot of EE with simultaneous confidence bands. The distinctive location and height of density modes and narrow confidence bands strongly suggest that there is a significant difference in mortality curve between EE countries and others, which confirms the output from Table 2.

Table2: Partial F test for individual effects

	F-statistic	p-value(truncated)	p-value(satterthwaite)
gpd	\(0.033\)	\(0.533\)	\(0.533\)
inf	\(0.445\)	\(0.002\)	\(<0.001\)
uem	\(0.153\)	\(0.130\)	\(0.145\)
EE	\(2.933\)	\(0.002\)	\(<0.001\)
Global F-statistic (by Satterthwaite method):
\(198.613\) on \(5.071\) DF, p-value:\(6.455\)e\(-41\)

Figure 25: Effect plot with confidence bands (W): EE

The improvement by adding the EE indicator can also been observed from the fitted densities in Figure 26.

Figure 26: Fitted densities (W): without EE (left), with EE (right).

The regression analysis with EE indicator for Bayes space and LQD regression can also be implemented by calling the Regression_Analysis() function. The \(R^2\) values under clr and Wasserstein metrics are also part of the Regression_Analysis() output. We present these \(R^2\) values of all the aforementioned regression approaches in Table 3 and 4.

Table 3: R Square Wasserstein

	BS	LQD	W
Without EE	\(0.539\)	\(0.536\)	\(0.539\)
With EE	\(0.803\)	\(0.805\)	\(0.805\)

Table 4: R Square clr

	BS	LQD	W
Without EE	\(0.501\)	\(0.499\)	\(0.308\)
With EE	\(0.761\)	\(0.757\)	\(0.547\)

4. Conclusion

We provide this vignette as a supplementary material to the main paper. In this vignette, we demonstrated the detailed usage of functions used in the main paper, as well as all the possible analysis these functions are capable of, part of which are presented in the main paper. This vignette and the functions developed with the main paper will be updated to include improvements and corrections as this project goes along.

References⁵

Amano, Kenji, Haruhiko Sago, Chiharu Uchikawa, Taishi Suzuki, Svetlana E Kotliarova, Nobuyuki Nukina, Charles J Epstein, and Kazuhiro Yamakawa. 2004. “Dosage-Dependent over-Expression of Genes in the Trisomic Region of Ts1Cje Mouse Model for Down Syndrome.” Human Molecular Genetics 13 (13): 1333–40.

Bigot, Jérémie, Raúl Gouet, Thierry Klein, and Alfredo López. 2017. “Geodesic PCA in the Wasserstein Space by Convex PCA.” Annales de L’Institut Henri Poincaré B: Probability and Statistics 53: 1–26.

Castro, P. E., W. H. Lawton, and E. A. Sylvestre. 1986. “Principal Modes of Variation for Processes with Continuous Sample Curves.” Technometrics 28: 329–37.

Cazelles, Elsa, Vivien Seguy, Jérémie Bigot, Marco Cuturi, and Nicolas Papadakis. 2018. “Geodesic PCA Versus Log-PCA of Histograms in the Wasserstein Space.” SIAM Journal on Scientific Computing 40 (2): B429–B456.

Chen, Zhicheng, Yuequan Bao, Hui Li, and Billie F Spencer Jr. 2019. “LQD-RKHS-Based Distribution-to-Distribution Regression Methodology for Restoring the Probability Distributions of Missing SHM Data.” Mechanical Systems and Signal Processing 121: 655–74.

Egozcue, Juan José, José Luis Dı́az–Barrero, and Vera Pawlowsky–Glahn. 2006. “Hilbert Space of Probability Density Functions Based on Aitchison Geometry.” Acta Mathematica Sinica 22 (4): 1175–82.

Faraway, Julian J. 1997. “Regression Analysis for a Functional Response.” Technometrics 39 (3): 254–61.

Jones, M. C., and John A. Rice. 1992. “Displaying the Important Features of Large Collections of Similar Curves.” The American Statistician 46: 140–45.

Kokoszka, P., and M. Reimherr. 2017. Introduction to Functional Data Analysis. Chapman; Hall/CRC.

Machalova, Jitka, Karel Hron, and Gianna Serafina Monti. 2016. “Preprocessing of Centred Logratio Transformed Density Functions Using Smoothing Splines.” Journal of Applied Statistics 43 (8): 1419–35.

Petersen, Alexander, Chun-Jui Chen, and Hans-Georg Müller. 2019. “Quantifying and Visualizing Intraregional Connectivity in Resting-State Functional Magnetic Resonance Imaging with Correlation Densities.” Brain Connectivity 9 (1): 37–47.

Petersen, Alexander, Xi Liu, and Afshin A Divani. 2020. “Wasserstein \(F\)-Tests and Confidence Bands for the Fréchet Regression of Density Response Curves.” To Appear in Annals of Statistics.

Petersen, Alexander, and Hans-Georg Müller. 2019. “Fréchet Regression for Random Objects with Euclidean Predictors.” The Annals of Statistics 47 (2): 691–719.

Petersen, Alexander, Hans-Georg Müller, and others. 2016. “Functional Data Analysis for Density Functions by Transformation to a Hilbert Space.” The Annals of Statistics 44 (1): 183–218.

Ramsay, James O., and B. W. Silverman. 2005. Functional Data Analysis. Second. Springer Series in Statistics. New York: Springer.

Salazar, Pascal, Mario Di Napoli, Mostafa Jafari, Alibay Jafarli, Wendy Ziai, Alexander Petersen, Stephan A Mayer, Eric M Bershad, Rahul Damani, and Afshin A Divani. 2020. “Exploration of Multiparameter Hematoma 3D Image Analysis for Predicting Outcome After Intracerebral Hemorrhage.” Neurocritical Care 32 (2): 539–49.

Srivastava, Anuj, Ian Jermyn, and Shantanu Joshi. 2007. “Riemannian Analysis of Probability Density Functions with Applications in Vision.” Proceedings from IEEE Conference on Computer Vision and Pattern Recognition 25 (1): 1–8.

Talská, R, Alessandra Menafoglio, Jitka Machalová, Karel Hron, and E Fišerová. 2018. “Compositional Regression with Functional Response.” Computational Statistics & Data Analysis 123: 66–85.

Van den Boogaart, Karl Gerald, Juan José Egozcue, and Vera Pawlowsky-Glahn. 2014. “Bayes Hilbert Spaces.” Australian & New Zealand Journal of Statistics 56 (2): 171–94.

Zhang, Zhen, and Hans-Georg Müller. 2011. “Functional Density Synchronization.” Computational Statistics & Data Analysis 55 (7): 2234–49.

Department of Statistics, Brigham Young University, UT, USA; Department of Statistics and Applied Probability, Uiversity of California, SantaBarbara, USA↩︎
Department of Statistics and Applied Probability, Uiversity of California, SantaBarbara, USA↩︎
Department of Statistics, Colorado State University, Fort Collins, CO, USA↩︎
Please contact the authors for source code as this repository is private as of this writing↩︎
Here, we only include the references that are directly cited in the vignette. Please refer to the main paper for the full list of cited works.↩︎

Data Analysis in Modeling Probability Density Functions as Data Objects

Alexander Petersen¹

Chao Zhang²

Piotr Kokoszka³

11/23/2020

Overview and Initial Setup

1. Representative Mean Density

1.1. Ts1Cje/2N Mouse Data Set

1.2. Bayes Space

1.3. Wasserstein Metric

1.4. Fisher-Rao Metric

1.5. Mean Densities

2. Modes of Variation Analysis

2.1. The Human Mortality Data

2.1. Transformation FPCA

2.2. Tangent Space Log-FPCA

2.3. Wasserstein Geodesic PCA

2.4. The `pushforward_density_new()` Function

2.4. Fraction of Variance Explained

2.5. Dimension Reduction

3. Density Response Regression

3.1. Bayes Space Regression

3.2. LQD

3.3. Fréchet Global Regression

3.4. Adding Eastern Europe (EE) Indicators

4. Conclusion

References⁵

Data Analysis in Modeling Probability Density Functions as Data Objects

Alexander Petersen1

Chao Zhang2

Piotr Kokoszka3

11/23/2020

Overview and Initial Setup

1. Representative Mean Density

1.1. Ts1Cje/2N Mouse Data Set

1.2. Bayes Space

1.3. Wasserstein Metric

1.4. Fisher-Rao Metric

1.5. Mean Densities

2. Modes of Variation Analysis

2.1. The Human Mortality Data

2.1. Transformation FPCA

2.2. Tangent Space Log-FPCA

2.3. Wasserstein Geodesic PCA

2.4. The pushforward_density_new() Function

2.4. Fraction of Variance Explained

2.5. Dimension Reduction

3. Density Response Regression

3.1. Bayes Space Regression

3.2. LQD

3.3. Fréchet Global Regression

3.4. Adding Eastern Europe (EE) Indicators

4. Conclusion

References5

Alexander Petersen¹

Chao Zhang²

Piotr Kokoszka³

2.4. The `pushforward_density_new()` Function

References⁵