Glossary

  • Distribution of missing data =
    • MCAR = missing completely at random
    • MAR = missing at random (when the presence of missing data for a given variable is related to the values of another variable)
    • MNAR = missing not at random

\(\Rightarrow\) in functional trait databases, missingness is related to the frequency of the species and their abundances, and taxonomic and phylogenetic bias is commonplace \(/!\setminus\)


  • Deletion method =
    • delete species with missing data for the calculation of diversity indices
    • acceptable for estimation of the community-weighted mean trait value (CWM)
      as long as it only concerns the minor species (should not exceed 20\(\%\) of the total biomass of the community)


  • Gower + PCoA =
    • compute Gower distance (with missing data) and project the distance with a Principal Coordinate Analysis,
      the axes being then used as functional traits
    • only relevant for functional diversity indices calculated from several traits
    • trait information gets lost and only multivariate approaches can be used


  • Imputation method = replace missing data with substituted values

  • Simple imputation = a single value is imputed for each missing datum

  • Multiple imputation = Monte Carlo technique in which the missing values are replaced by \(m > 1\) imputed values, combined to produce estimates, confidence intervals, missing data uncertainty

  • Likelihood-based imputation = imputation and analysis are conducted simultaneously



Specific traits


2014 : Tamme (Ecology)

Measure the predictive power of simple plant traits to estimate species’ maximum dispersal distances :

  • dispersal distance is related to a dispersal syndrome (Willson 1993, Pärtel & Zobel 2007, Vittoz & Engler 2007), that may generally be deduced from seed morphology, but difficult to establish a direct association (Nogales 2007)
  • plant height more important than seed mass in determining seed dispersal distances (Thomson et al. 2011)
  • maximum dispersal distance strongly correlated with mean dispersal distance (Thomson et al. 2011) :

\[ log_{10}(\text{ maximumDist }) = 0.795 + 0.984 * log_{10}(\text{ meanDist })\]



depending on the set of species used and the trait data available,
different traits are important in describing dispersal distance patterns


\(\Rightarrow\) (dispersal syndrome + growth form + terminal velocity)
explain \(\sim\) 60\(\%\) of the variation in maximum dispersal distances

\(\Rightarrow\) (dispersal syndrome + growth form)
explain \(\sim\) 50\(\%\) of the variation in maximum dispersal distances

with

  • on average, maximum dispersal distances increase from :

    wind (no special mechanisms) < ballistic < ant < wind (special mechanisms) < animal
    herb < shrub < tree

  • models with terminal velocity only apply to species with wind or ballistic dispersal syndrome
  • in general, short-distance dispersal overestimated, and long-distance dispersal underestimated

\(\Rightarrow\) dispeRsal package



Comparison of methods


2014 : Taugourdeau

Effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level :

None of these methods was able to perform best for all the traits and indices
at the species level, the most accurate imputation method is not the same for all traits 
and in all cases, but dissimilarity and/or relationships method always the most accurate 
among the single imputation methods
key parameter to choose the adequate imputation method 
= the distribution of the trait value in the dataset
applying a transformation method to improve the distribution of the trait values prior to  
using a imputation method could be useful in improving the quality of the replacement
30% of the data missing = percentage of missing data as a limit 
for the utilization of these single imputation methods
  • Average method = least accurate
  • Functional proximity between species =
    • best method for traits with unbalanced distribution
    • but not an appropriate choice for a study on functional distance between species,
      as functional distance would then be underestimated
    • less affected by the percentage of deleted data (but Gower dissimilarity cannot be calculated between two species if no trait for both species : not possible if missing data are too numerous)
  • Relationships between traits =
    • best method for traits with balanced distribution
    • but the most sensitive to the level of missing data
  • MICE = more accurate than all other methods for all traits except SLA
    • the correction model can be adapted to the distribution of the variable, but caution with unbalanced traits


2014 : Penone

Evaluate the performance of four approaches for estimating missing values in trait databases and test whether imputed datasets retain underlying allometric relationships among traits :

adding phylogenetic information reduced the error for all approaches except kNN 
this result was stronger for missForest than for mice

whithout phylo, mice gave better results than missForest
this difference was not significant whith phylo
bias was lower when missing data were imputed rather than deleted,
especially when more than 30% of the data were missing.

with phylo, all approaches had comparable bias, 
significantly lower than bias in datasets with missing data

phylo reduced the bias for missForest but not for mice
imputation approaches with phylogenetic variance-covariance matrix (Phylopars) 
and phylogenetic eigenvectors (missForest and mice) gave similar results

phylo did not improve estimation equally among traits,
being more important where phylogenetic signal is stronger 
and when there are no other traits with strong signal


2015 : Schrodt

A new approach (BHPMF) which imputes trait values based on the taxonomic hierarchy, structure within the trait matrix and trait–environment relationships at the same time as providing uncertainty estimates for each single trait prediction.

prediction accuracy is not related to the number of entries per trait
BHPMF outperforms the species MEAN baseline in all aspects: 
RMSE and R2 of predicted versus observed entries are smaller and larger, respectively, 
for each individual trait, and trait–trait correlations are better retrieved 

aHPMF provides a concept to extrapolate from point measurements to species ranges 
accounting for intraspecific variability
explicitly taking environmental constraints into account adds little or no improvement to BHPMF
Advancing BHPMF to account for phylogenetic distances instead of taxonomy will improve 
opportunities for efficient out-of-sample predictions, at least in well-resolved clades


2018 : Poyatos

Assess the performance of different imputation methods to fill simulated gaps at different missingness levels in a spatially explicit plant trait dataset

No method performed best consistently for the five studied traits, but, considering all traits 
and performance metrics, MICE informed by relevant ecological variables gave the best results.
However, at higher missingness (>30 %), species mean imputations and regression kriging 
tended to outperform MICE for some traits.
low missingness rates (10 %) = mice and kNN more accurate (NRMSE) than Mean
moderate/high missingness = mice and kNN comparable/outperformed by Mean, and specially by OrdKrig

trait covariation did not improve imputations at high missingness

results for OrdKrig, compared to mice and kNN, show that spatial structure, rather than
trait covariation, may provide more accurate trait imputations when gaps are frequent
Introducing auxiliary variables as predictors improved MICE performance substantially 
but these improvements were dependent on the specific predictor set and trait
Using imputed or incomplete datasets did not lead to large differences 
in the studied trait relationships when missingness was < 50%
At high missingness, using imputed datasets led to comparatively larger 
departures from the relationships obtained with the complete dataset
No imputation method appeared to perform consistently better than others in preserving trait
relationships at high missingness levels and, under these conditions, using incomplete datasets
appeared to correctly reproduce the observed trait relationships in the complete dataset


2020 : Johnson

Evaluate the performance of approaches for handling missing values when considering biased datasets

Rphylopars = most accurate estimate of missing values, best preserve response–trait slope

estimates of missing data were still inaccurate, even with only 5% of values missing

Under severe biases, errors were high with every approach.
complete-case analysis frequently outperform Mice imputation and, to a lesser degree, BHPMF
If the objective of the imputation is to produce estimates of missing values, 
single imputation is considered most effective
if the objective is to model imputed values against another variable, 
the added error in the multiple imputation is advantageous
Including phylogenetic information generally improved imputation performance in every method
Including the response in the imputation :
- when response–trait slope positive : decreased imputation error in all approaches, 
decreased slope error substantially in Mice (-> almost comparable with Rphylopars), 
increased slope error in Rphylopars and BHPMF
- when no relationship between trait and response : 
increased imputation and slope errors in every approach, but with a small effect

-> broadly advisable to include response with Mice, to exclude it with Rphylopars and BHPMF
Missingness, phylogenetic clustering and change in mean were important predictors of 
slope error, significant differences in slope error and imputation error.
there is no single best solution to deal with missing data
researchers need to assess the available data and consider the need for imputation versus 
limiting the scope of the study or completing analyses for separate groups. Use of data 
imputation should be scrutinized, checking for changes in the data before and after imputation, 
measuring: missingness, phylogenetic clustering, a change in mean and a change in slope
The threshold for deciding whether imputation is accurate depends on the research question.
  • Currently, at least 160 packages for handling missing data available on the R-CRAN repository (Josse et al., 2020) \(\rightarrow\) Testing all available imputation methods was not feasible
  • imputation can only be successful if it accounts for the mechanism by which data are missing
  • impact of phylogenetic signal strength on imputation performance already tested (Kim 2018, Molina-Venegas 2018); therefore, standardization of Pagel’s λ between phylogeny and traits at \(\sim\) one



Synthesis


Methods description


METHODS Group Precisions Use phylo PAPER SOFTWARE
mean/median simple replace with the mean/median of available trait values No
kNN simple k-Nearest Neighbour (eigenvectors from PCoA) Troyanskaya 2001 VIM pkg
random forest simple (eigenvectors from PCoA) missForest pkg
dissimilarity simple species with the same functional strategy have a similar set of functional traits
Gower distance + threshold
No Westoby 2002, Diaz 2004, Taugourdeau 2014 FD pkg
relationship simple/multiple regression models
but also : kNN, random forest, matrix factorization…
(eigenvectors from PCoA) Wrigth et al. 2004, 2006 dispeRsal function (Tamme 2014)
ordinary/regression kriging simple No automap pkg
(R)phylopars simple (but associated variance) likelihood-based approach (Phylopars) using both phylogeny and allometric relationships among traits (phylogenetic variance-covariance matrix) Yes Bruggeman 2009, Goolsby 2017 Phylopars soft, Rphylopars pkg
PMF simple Probabilistic Matrix Factorization, models a sparse matrix as the scalar product of 2 latent matrices to find a factorization minimizing the error between predicted and observed data No
HPMF simple Hierarchical Probabilistic Matrix Factorization, coupled with phylogenetic information (taxonomy) Shan 2012
BHPMF simple (but associated variance) Bayesian Hierarchical Probabilistic Matrix Factorization, coupled with a Gibbs sampler (taxonomy) Schrodt 2015 BHMPF pkg
MICE multiple Multivariate imputation by chained equations (eigenvectors from PCoA) Azur 2011 mice pkg
jomo multiple JOint MOdelling approach for multiple imputation of multilevel data Quartagno 2019 jomo pkg



Papers making comparisons


PAPERS Tamme (2014) Taugourdeau (2014) Penone (2014) Schrodt (2015) Poyatos (2018) Johnson (2020)
no fam 102 358 4
no sp 576 1054 273 14320 13 500 (sim)
no traits (resp) 1 9 4 13 5 1
no traits (used) 5 4
no methods + relationship + mean
+ median
+ dissimilarity
+ relationship
+ mice
+ kNN
+ mice
+ missForest
+ Phylopars
+ PMF
+ BHMPF
+ aHMPF
+ mean and species mean
+ ordinary and regression kriging
+ kNN
+ mice
+ BHMPF
+ Rphylopars
+ mice
missing data (sim) [10 prob of deletion \(0.01 \rightarrow 0.46\)] x [10 rep] (sim) [\(10\% \rightarrow 80\%\)] x [10 rep] (obs) mean/trait \(\sim 79.9\%\) (sim) [\(10\% \rightarrow 80\%\)] x [30 rep] (sim) [2 resp-trait slopes] x [2 cor levels] x [4 bias types] x [2 bias severity] x [\(5\% \rightarrow 80\%\)] x [10 rep]
evaluation \(R^2\) MRdAE - NRMSE
- effect of dataset/missing data/method on errors
- slope deviation
- RMSE
- \(R^2\)
- SMA regression
- procrustes analysis
- NRMSE, KGE
- correlation matrices deviation
- RMSE
- slope deviation
software dispeRsal function 2 author’s functions, mice pkg VIM, mice, missForest pkg, Phylopars soft MATLAB automap, VIM, mice pkg BHMPF, Rphylopars, mice pkg



Methods +/- (according to papers)


METHODS Positive points Negative points
mean/median + accurate for datasets with small percentages of missing values (Schafer 1999) - least accurate (Taugourdeau 2014)
- ignore the variance of the imputed variables (Blonder 2016)
- severely altered trait distributions, introduced larger errors in selected trait correlations, tended to cause larger deviations in the correlation matrix (Poyatos 2018)
kNN + can deal with categorical variables (nominal or ordinal) (Penone 2014) - produced larger errors and induced more bias in the allometric relationship
- must specify a value of the tuning parameter k which is difficult to determine a priori (Penone 2014)
- tends to introduce larger bias in bivariate trait relationships compared to MICE (Poyatos 2018)
random forest + performed better without including phylogeny
+ can deal with categorical variables (nominal or ordinal) (Penone 2014)
dissimilarity + Gower dissimilarity can be computed with missing data
+ most accurate when the trait distribution is unbalanced (Taugourdeau 2014) - cannot be calculated between two species if no trait is documented for both species (Taugourdeau 2014)
ordinary/regression kriging - altered distributions and trait correlations more than mice but they performed similarly in terms of delta cormat at all missingness levels (Poyatos 2018)
relationship - depending on the set of species used and traits available, different traits are important (Tamme 2014)
- requires several traits per plant/species to be documented
- in most case less accurate than the dissimilarity method
- does not perform well on very unbalanced traits (like SNP) because the multilinear model is strongly governed by extreme values
- very sensitive to the percentage of missing data (Taugourdeau 2014)
(R)phylopars + uses a phylogeny and a sparse trait matrix to estimate simultaneously the across-species (phylogenetic) and within-species (phenotypic) trait covariance (similar to a phylogenetic mixed model) to reconstruct the ancestral state and impute missing values (Goolsby 2017) - depends on the phylogenetic signal in a trait
- cannot deal with categorical traits
PMF - like PCA, efficient if the original matrix is of low rank, i.e. if the axes of the original matrix provide strong correlations
- accuracy worse than using species mean trait values to fill the gaps (Schrodt 2015)
HPMF + satisfactory to predict trait values when information at the genus level is available (Shan 2012)
+ needs only at least one trait value per plant
BHPMF + On average, across all traits, outperforms PMF, MEAN and aHPMF, with MEAN being significantly more accurate than PMF
+ advantage over MEAN largest for ‘physiological traits’(leaf N, leaf P), and smaller for more ‘structural traits’ (seed mass, plant height)
+ BHPMF and MEAN capture these general trait–trait correlations, but BHPMF reproduces extreme values more accurately than MEAN and is therefore generally better at capturing the shape of the scatter of observed trait data (Schrodt 2015)
- presence of strong trait–trait correlations is a prerequisite for the accuracy of BHPMF (Schrodt 2015)
MICE + more accurate than all other methods for all traits except for the specific leaf area (Taugourdeau 2014) - affected by the percentage of missing data for 6 (whole subdatabase) and 7 (herbaceous subdatabase) traits (Taugourdeau 2014)
+ smaller error and bias as compared to other multiple imputation approaches (Ambler, Omar & Royston 2007)
+ performed better without including phylogeny
+ can deal with categorical variables (nominal or ordinal) (Penone 2014)
- linear dependencies between variables cause fatal errors and should be eliminated before imputation (Penone 2014)
+ closely tracked observed trait distributions, introduced the least error in trait correlations under high missingness levels and yielded low delta cormat at extreme missingness levels (Poyatos 2018)
+ perform well when there is no true relationship between response and trait (Johnson 2020)
- perform poorly when there is a positive relationship between response and trait, even when including the response
- phylogenetic information should only be included when data are missing with no bias or a weak bias(Johnson 2020)
jomo



Citations

  • Azur M J, Stuart E A, Frangakis C and Leaf P J. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” International Journal of Methods in Psychiatric Research 20 (1): 40–49. https://doi.org/10.1002/mpr.329
  • Diniz-Filho, J A F, Villalobos F and Bini L M. 2015. “The Best of Both Worlds: Phylogenetic Eigenvector Regression and Mapping.” Genetics and Molecular Biology 38 (3): 396–400.
  • Goolsby, E W, Bruggeman J and Ané C. 2017. “Rphylopars: Fast Multivariate Phylogenetic Comparative Methods for Missing Data and within-Species Variation.” Methods in Ecology and Evolution 8 (1): 22–27. https://doi.org/10.1111/2041-210X.12612
  • Johnson T F, Isaac N J B, Paviolo A and González‐Suárez M. 2020. “Handling Missing Values in Trait Data.” Global Ecology and Biogeography, no. December 2019: 1–12. https://doi.org/10.1111/geb.13185
  • Molina-Venegas R, Moreno-Saiz J C, Parga I C, Davies T J, Peres-Neto P R and Rodríguez M. 2018. “Assessing Among-Lineage Variability in Phylogenetic Imputation of Functional Trait Datasets.” Ecography 41 (10): 1740–49. https://doi.org/10.1111/ecog.03480
  • Penone C, Davidson A D, Shoemaker K T, Di Marco M, Rondinini C, Brooks T M, Young B E, Graham C H and Costa G C. 2014. “Imputation of Missing Data in Life-History Trait Datasets: Which Approach Performs the Best?” Methods in Ecology and Evolution 5 (9): 961–70. https://doi.org/10.1111/2041-210X.12232
  • Poyatos R, Sus O, Badiella L, Mencuccini M and Martínez-Vilalta J. 2018. “Gap-Filling a Spatially Explicit Plant Trait Database: Comparing Imputation Methods and Different Levels of Environmental Information.” Biogeosciences 15 (9): 2601–17. https://doi.org/10.5194/bg-15-2601-2018
  • Quartagno M, Grund S and Carpenter J. 2019. “Jomo: A Flexible Package for Two-Level Joint Modelling Multiple Imputation.” R Journal 11 (2): 205–28. https://doi.org/10.32614/RJ-2019-028
  • Schrodt F, Kattge J, Shan H, Fazayeli F, Joswig J, Banerjee A, Reichstein M, et al. 2015. “BHPMF - a Hierarchical Bayesian Approach to Gap-Filling and Trait Prediction for Macroecology and Functional Biogeography.” Global Ecology and Biogeography 24 (12): 1510–21. https://doi.org/10.1111/geb.12335
  • Shan H, Kattge J, Reich P B, Banerjee A, Schrodt F and Reichstein M. 2012. “Gap Filling in the Plant Kingdom - Trait Prediction Using Hierarchical Probabilistic Matrix Factorization.” In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2:1303–10.
  • Tamme R, Götzenberger L, Zobel M, Bullock J M, Hooftman D AP, Kaasik A and Pärtel M. 2014. “Predicting Species’ Maximum Dispersal Distances from Simple Plant Traits.” Ecology 95 (2): 505–13. https://doi.org/10.1890/13-1000.1
  • Taugourdeau S, Villerd J, Plantureux S, Huguenin-Elie O and Amiaud B. 2014. “Filling the Gap in Functional Trait Databases: Use of Ecological Hypotheses to Replace Missing Data.” Ecology and Evolution 4 (7): 944–58. https://doi.org/10.1002/ece3.989