This document introduces two methods for the parameter estimation of lifetime models. Where Median Rank Regression (MRR) fits a straight line through transformed plotting positions (transformation is described precisely in vignette(topic = "Life_Data_Analysis_Part_I", package = "weibulltools")
), Maximum Likelihood (ML) strives to maximize a function of the parameters given the sample data. If the parameters are obtained a cumulative distribution function (CDF) can be computed and added to a probability plot.
In the theoretical part of this vignette we will focus on a two-parameter Weibull distribution. The second part is about the application of the provided estimation methods in weibulltools
. All implemented models can be found in the help pages of rank_regression()
and ml_estimation()
.
The Weibull distribution is a continuous probability distribution, which is specified by the shape parameter \(\beta\) and the scale parameter \(\eta\) (two-parameter representation). Its CDF and PDF (probability density function) are given by the following formula: \[F(t)=1-\exp\left[ -\left(\frac{t}{\eta}\right)^{\beta}\right]\] \[f(t)=\frac{\beta}{\eta} \cdot \left(\frac{t}{\eta}\right)^{\beta-1} \cdot \exp\left[-\left(\frac{t}{\eta}\right)^\beta\right]\] The practical benefit of the Weibull in the field of lifetime analysis is that the common profiles of failure rates, which are observed over the lifetime of a large number of technical products, can be described using this statistical distribution.
In the following, the estimation of the specific parameters \(\beta\) and \(\eta\) is explained.
In MRR the cumulative distribution function is linearized so that the true, unknown population is estimated by a straight line which is analytically placed among the plotting pairs.
The lifetime characteristic, entered on the x-axis, is displayed on a logarithmic scale. A double-logarithmic representation of the estimated failure probabilities is used for the y-axis. Using Ordinary Least Squares (OLS) we will determine a best-fit line in order that the sum of squared deviations between this fitted regression line and the plotted points is minimized.
In reliability analysis, it became prevalent that the line is placed in the probability plot so that the horizontal distances between the best-fit line and the points are minimized 1. This procedure is called x on y rank regression.
The formulas for estimating the slope and the intercept of the regression line according to the described method are given below.
Slope: \[\hat{b}=\frac{\sum_{i=1}^{n}(x_i-\bar{x})\cdot(y_i-\bar{y})}{\sum_{i=1}^{n}(y_i-\bar{y})^2}\]
Intercept:
\[\hat{a}=\bar{x}-\hat{b}\cdot\bar{y}\]
With
\[x_i=\log(t_i)\;;\; \bar{x}=\frac{1}{n}\cdot\sum_{i=1}^{n}\log(t_i)\;;\]
as well as
\[y_i=\log\left[-\log(1-F(t_i))\right]\;and \; \bar{y}=\frac{1}{n}\cdot\sum_{i=1}^{n}\log\left[-\log(1-F(t_i))\right].\]
In order to obtain the Weibull-specific parameters the slope and the intercept needs to be transformed 2.
\[\hat{\beta}=\frac{1}{\hat{b}}\]
\[\hat{\eta}=\exp(\hat{a})\] Using the location-scale parameterization, which is mentioned in vignette(topic = "Life_Data_Analysis_Part_I", package = "weibulltools")
, we can see that \(b\) equals \(\sigma\) and \(a\) equals \(\mu\).
The ML method of Ronald A. Fisher estimates the parameters by maximizing the likelihood function. Assuming a theoretical distribution, the idea of ML is that the specific parameters are chosen in such a way that the plausibility of obtaining the present sample is maximized.
The log-likelihood is given by the following equation:
\[\log L = n \cdot \log\left(\frac{\beta}{\eta}\right) - \sum_{i=1}^n\left(\frac{t_i}{\eta}\right)^\beta + \left(\beta - 1\right) \cdot \sum_{i = 1}^n\log\left(\frac{t_i}{\eta}\right)\]
Deriving and nullifying the log-likelihood function according to \(\beta\) results in: \[\frac{\partial \log L}{\partial \beta}=\frac{n}{\beta}+\sum_{i=1}^{n}\log(t_i)-\frac{n\cdot\sum_{i=1}^{n}(t_i^{\beta}\cdot \log(t_i))}{\sum_{i=1}^{n}(t_i)^\beta}=0\]
Since there is no closed-form expression \(\beta\) needs to be determined numerically. Once received, the parameter \(\eta\) can be calculated analytically. \[\hat{\eta}=[\frac{1}{n}\cdot\sum_{i=1}^{n}t_i^{\hat{\beta}}]^\frac{1}{\hat{\beta}}\]
In large samples, ML estimators have optimality properties. In addition, the simulation studies by Genschel and Meeker 3 have shown that even in small samples it is difficult to find an estimator that regularly has better properties than ML estimators.
To apply the introduced estimation methods and related visualizations the shock
data taken from SPREDA
package is used. In this dataset kilometer-dependent problems that occurred on shock absorbers are reported. In addition to failed items the dataset also contains non-defectives.
The data can be found in Statistical Methods for Reliability Data 4.
weibulltools
Where rank_regression()
determines a regression model (x ~ y
) based on the linearized CDF of a specified model, ml_estimation()
computes the parameters which maximize the log-likelihood of an underlying distribution. Both methods can be applied to complete failure data as well as failure and (multiple) right-censored data.
A special characteristic of these functions is that the implemented distributions are parameterized in such a way that they are part of a (log-)location-scale family with parameters \(\mu\) and \(\sigma\). This representation is often used in reliability analysis.
Subsequent transformations especially to get the Weibull-specific values in the scale-shape parameterization with \(\eta\) and \(\beta\) are included. This parameterization is also used in the stats
package (e.g., in the function pweibull
).
rank_regression()
and ml_estimation()
can also deal with models that have a threshold parameter \(\gamma\) like the three-parametric Weibull. If a three-parametric distribution is specified both methods call corresponding profiling functions.
Inside rank_regression()
the function r_squared_profiling()
is called and if ml_estimation()
is used loglik_profiling()
is executed.
To get information about the provided confidence intervals of the parameters the help pages of rank_regression()
and ml_estimation()
should be read carefully.
In the following we want to apply both methods to the dataset shock
. We assume that the data can be best described by a two-parametric Weibull distribution.
The estimated parameters will be used to calculate the population lines and for the purpose of comparison they will be visualized in one Weibull probability plot.
library(SPREDA) # for dataset shock
data(shock)
# generate random ids for units:
shock$id <- sample(c(letters, LETTERS), size = nrow(shock), replace = FALSE)
# Rank Regression:
# for rank_regression(), estimated failure probabilities are required:
df_shock <- johnson_method(id = shock$id, x = shock$Distance, event = shock$Censor)
# Using all models which are provided in rank_regression:
dists <- c("weibull", "lognormal", "loglogistic", "normal", "logistic", "sev",
"weibull3", "lognormal3", "loglogistic3")
mrr_list <- lapply(dists, rank_regression, x = df_shock$characteristic,
y = df_shock$prob, event = df_shock$status)
r_sq_vec <- sapply(mrr_list, "[[", "r_squared")
names(r_sq_vec) <- dists
r_sq_vec
#> weibull lognormal loglogistic normal logistic
#> 0.9901585 0.9641187 0.9820248 0.9846976 0.9720249
#> sev weibull3 lognormal3 loglogistic3
#> 0.9563309 0.9901585 0.9641187 0.9820248
The use of a two-parametric Weibull (weibull
) is acceptable since \(R^2\) is highest. It should also be noticed that \(R^2\) values for weibull
, lognormal
, loglogistic
are equal to their three-parametric models weibull3
, lognormal3
, loglogistic3
. This means that the estimate of threshold parameter is 0.
We will construct a probability plot for the Weibull distribution and add the estimated regression line.
# Again estimating weibull:
mrr_weibull <- rank_regression(x = df_shock$characteristic, y = df_shock$prob,
event = df_shock$status, distribution = "weibull")
mrr_weibull
#> $coefficients
#> eta beta
#> 28554.795629 2.753265
#>
#> $confint
#> 2.5 % 97.5 %
#> eta 23692.593465 36750.936313
#> beta 1.930218 3.927262
#>
#> $loc_sc_coefficients
#> mu sigma
#> 10.2595802 0.3632051
#>
#> $loc_sc_confint
#> 2.5 % 97.5 %
#> mu 10.0729178 10.5119190
#> sigma 0.2546304 0.5180763
#>
#> $r_squared
#> [1] 0.9901585
# Probability plot:
weibull_grid <- plot_prob(x = df_shock$characteristic, y = df_shock$prob,
event = df_shock$status, id = df_shock$id,
distribution = "weibull",
title_main = "Weibull Probability Plot",
title_x = "Mileage in km",
title_y = "Probability of Failure in %",
title_trace = "Defect Shock Absorbers")
library(plotly) # pipe operator
# Add regression line:
weibull_plot <- weibull_grid %>%
plot_mod(x = df_shock$characteristic, loc_sc_params = mrr_weibull$loc_sc_coefficients,
distribution = "weibull",
title_trace = "Median Rank Regression")
weibull_plot
# Using all models which are provided in ml_estimation:
ml_list <- lapply(dists, ml_estimation, x = df_shock$characteristic,
event = df_shock$status)
loglik_vec <- sapply(ml_list, "[[", "logL")
names(loglik_vec) <- dists
loglik_vec
#> weibull lognormal loglogistic normal logistic
#> -123.9954 -124.6085 -124.3654 -124.2301 -124.5476
#> sev weibull3 lognormal3 loglogistic3
#> -124.6229 -123.9954 -124.6085 -124.3654
We can see that the estimate for threshold parameter of the three-parametric distributions is 0, since the log-likelihood values of weibull
, lognormal
, loglogistic
are equal to their three-parametric representations weibull3
, lognormal3
, loglogistic3
. As the log-likelihood value of the two-parameter Weibull is highest, its usage is again justified.
For comparison Figure 1 will be extended by a straight line estimated with ML. Thus, we use the add_lines()
function which is part of package plotly
.
# Again estimating weibull:
ml_weibull <- ml_estimation(x = df_shock$characteristic, event = df_shock$status,
distribution = "weibull")
ml_weibull
#> $coefficients
#> eta beta
#> 27718.714218 3.160465
#>
#> $confint
#> 2.5 % 97.5 %
#> eta 22347.783283 34380.462178
#> beta 2.008731 4.972561
#>
#> $loc_sc_coefficients
#> mu sigma
#> 10.2298631 0.3164092
#>
#> $loc_sc_confint
#> 2.5 % 97.5 %
#> mu 10.0144824 10.4452437
#> sigma 0.2011036 0.4978267
#>
#> $loc_sc_vcov
#> mu sigma
#> mu 0.012075836 0.003990396
#> sigma 0.003990396 0.005353183
#>
#> $logL
#> [1] -123.9954
#>
#> $aic
#> [1] 251.9907
#>
#> $bic
#> [1] 255.2659
# Add ML estimation to weibull_plot:
## predict_prob to calculate CDF with ML-parameters:
ml_prob <- predict_prob(q = seq(6200, 30600, length.out = 100),
loc_sc_params = ml_weibull$loc_sc_coefficients,
distribution = "weibull")
weibull_both <- weibull_plot %>%
add_lines(x = seq(6200, 30600, length.out = 100), y = SPREDA::qsev(ml_prob),
name = "Maximum Likelihood", color = I("#006400"), hoverinfo = "text",
text = ~paste(paste("\u03B7<sub>ML</sub>", ":",
round(ml_weibull$coefficients[[1]], digits = 2)),
"<br>", paste("\u03B2<sub>ML</sub>", ":",
round(ml_weibull$coefficients[[2]], digits = 2))))
weibull_both
Finally we will use the dataset Alloy T7989
in which the cycles until a fatigue failure of a special alloy occurs are inspected. The data is also taken from Meeker and Escobar 5. The authors visualized the data in a Weibull and Log-normal probability plot and detected a right-curved pattern. Thus, two-parametric distributions won’t be adequate and they fitted three-parametric models.
In the following we want to compare the fit of a two- and three-parametric Log-normal using method ml_estimation()
.
# Data:
cycles <- c(300, 300, 300, 300, 300, 291, 274, 271, 269, 257, 256, 227, 226,
224, 213, 211, 205, 203, 197, 196, 190, 189, 188, 187, 184, 180,
180, 177, 176, 173, 172, 171, 170, 170, 169, 168, 168, 162, 159,
159, 159, 159, 152, 152, 149, 149, 144, 143, 141, 141, 140, 139,
139, 136, 135, 133, 131, 129, 123, 121, 121, 118, 117, 117, 114,
112, 108, 104, 99, 99, 96, 94)
state <- c(rep(0, 5), rep(1, 67))
id <- 1:length(cycles)
# Two-parameter Log-normal:
ml_lognormal <- ml_estimation(x = cycles, event = state,
distribution = "lognormal")
ml_lognormal
#> $loc_sc_coefficients
#> mu sigma
#> 5.1277862 0.3276406
#>
#> $loc_sc_confint
#> 2.5 % 97.5 %
#> mu 5.051722 5.2038508
#> sigma 0.275584 0.3895306
#>
#> $loc_sc_vcov
#> mu sigma
#> mu 1.506155e-03 3.700296e-05
#> sigma 3.700296e-05 8.365989e-04
#>
#> $logL
#> [1] -367.0069
#>
#> $aic
#> [1] 738.0138
#>
#> $bic
#> [1] 742.5672
# Three-parameter Log-normal:
ml_lognormal3 <- ml_estimation(x = cycles, event = state,
distribution = "lognormal3")
ml_lognormal3
#> $loc_sc_coefficients
#> mu sigma gamma
#> 4.5014798 0.6131893 72.0727338
#>
#> $loc_sc_confint
#> 2.5 % 97.5 %
#> mu 4.1253687 4.877591
#> sigma 0.4042741 0.930065
#> gamma 45.5887792 98.556689
#>
#> $loc_sc_vcov
#> mu sigma gamma
#> mu 0.03682443 -0.02092922 -2.399804
#> sigma -0.02092922 0.01698601 1.602684
#> gamma -2.39980443 1.60268418 182.586847
#>
#> $logL
#> [1] -364.2033
#>
#> $aic
#> [1] 734.4066
#>
#> $bic
#> [1] 741.2366
The two model selection criteria aic
and bic
are smaller for lognormal3
meaning that this model is preferable.
# Constructing probability plot:
df_alloy <- johnson_method(x = cycles, event = state, id = id)
lognormal_grid <- plot_prob(x = df_alloy$characteristic, y = df_alloy$prob,
event = df_alloy$status, id = df_alloy$id,
distribution = "lognormal",
title_main = "Log-normal Probability Plot",
title_x = "Cycles",
title_y = "Probability of Failure in %",
title_trace = "Failed Units")
# Add three-parametric model to grid:
lognormal_plot <- lognormal_grid %>%
plot_mod(x = df_alloy$characteristic, loc_sc_params = ml_lognormal3$loc_sc_coefficients,
distribution = "lognormal3",
title_trace = "Three-parametric Log-normal")
lognormal_plot
# Add two-parametric model to lognormal_plot:
## predict_prob to calculate CDF with ML-parameters:
ml_prob_lognormal <- predict_prob(q = seq(85, 325, length.out = 100),
loc_sc_params = ml_lognormal$loc_sc_coefficients,
distribution = "lognormal")
lognormal_both <- lognormal_plot %>%
add_lines(x = seq(85, 325, length.out = 100), y = qnorm(ml_prob_lognormal),
name = "Two-parametric Log-normal", color = I("#006400"), hoverinfo = "text",
text = ~paste(paste("\u03BC<sub>ML</sub>", ":",
round(ml_lognormal$loc_sc_coefficients[[1]], digits = 2)),
"<br>", paste("\u03C3<sub>ML</sub>", ":",
round(ml_lognormal$loc_sc_coefficients[[2]], digits = 2))))
lognormal_both
In Figure 4 we can see that the data is better described if the three-parametric model is used.
Berkson, J.: Are There Two Regressions?, Journal of the American Statistical Association 45 (250), DOI: 10.2307/2280676, 1950, pp. 164-180↩
ReliaSoft Corporation: Life Data Analysis Reference Book, online: ReliaSoft, accessed 09 January 2018↩
Genschel, U.; Meeker, W. Q.: A Comparison of Maximum Likelihood and Median-Rank Regression for Weibull Estimation, in: Quality Engineering 22 (4), DOI: 10.1080/08982112.2010.503447, 2010, pp. 236-255↩
Meeker, W. Q.; Escobar, L. A.: Statistical Methods for Reliability Data, New York, Wiley series in probability and statistics, 1998, p. 630↩
Meeker, W. Q.; Escobar, L. A.: Statistical Methods for Reliability Data, New York, Wiley series in probability and statistics, 1998, p. 131↩