October 16, 2016
Title Construction of Regular and Irregular Histograms with Different
Options for Automatic Choice of Bins
Author Thoralf Mildenberger [aut, cre],
Yves Rozenholc [aut],
David Zasada [aut]
Maintainer Thoralf Mildenberger
Description Automatic construction of regular and irregular histograms as described in Rozen-
License GPL (>= 2)
Date/Publication 2016-10-16 17:39:09
R topics documented:
histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
histogram with automatic choice of bins
Construction of regular and irregular histograms with different options for choosing the number
and widths of the bins. By default, both a regular and an irregular histogram using a data-dependent
penalty as described in detail in Rozenholc/Mildenberger/Gather (2009) are constructed. The ﬁnal
estimate is the one with the larger penalized likelihood.
histogram(y, type = "combined", grid = "data",
greedy = TRUE, right=TRUE, control = list(),
verbose = TRUE, plot = TRUE)
a vector of values for which the histogram is desired.
use "irregular" for an irregular and "regular" for a regular histogram. If
(default value) both a regular and an irregular histogram are
computed and the one with the larger penalized likelihood is chosen, see details
if type="irregular", grid chooses the set of possible partitions of the data
range. The default value "data" gives a set of partitions constructed from the
data points, "regular" uses a ﬁne regular grid of points as possible break points.
A regular quantile grid can be chosen using "quantiles". Has no effect for
controls the maximum number of bins allowed in a regular histogram, or the
or "quantiles". Usually not needed since the maximum bin number and the
size of the ﬁnest grid are calculated by a formula depending on the sample size
; the defaults for this can be changed using the parameters g1, g2 and g3 in
absolute upper bound bound on the number of bins if type="regular".
controls which penalty is used. See description of penalties below.
logical; if TRUE and type="irregular", a subgrid of the ﬁnest grid is con-
structed by a greedy step to make the search procedure feasible. Has no effect
for regular histograms.
logical; if TRUE, the histograms cells are right-closed (left open) intervals.
list of additional control parameters. Meaning and default values depend on
settings of type, penalty and grid. See below.
tion and the resulting histogram object is printed.
logical. If TRUE (default), the histogram is plotted.
The histogram procedure produces a histogram, i.e. a piecewise constant density estimate from a
univariate real-valued sample stored in a vector y. Let n denote the length of y. The range of the
data is partitioned into D intervals - called bins - and the density estimate on the i-th bin is given
/(n ∗ w
) where N
is the number of observations in the i-th bin and w
is its width. The
histogram thus deﬁned is the maximum likelihood estimate among all densities that are piecewise
constant w.r.t. this partition. The arguments of histogram given above determine the way the
partition is chosen. In a regular histogram, the partition consists of D bins of the same widths, and
the histogram is determined by the choice of D. Strategies based on different criteria can be chosen
using the penalty option. The maximum number of bins can be controlled by either the breaks
An irregular histogram allows for bins of different widths. In this case, not only the number D
of bins but also the breakpoints between the bins must be chosen. The set of allowed breakpoints
is given by the ﬁnest partition selected using the grid argument. At the moment a ﬁnest regular
grid is supported (grid="regular") as well as grids with possible breakpoints either equal to the
observations or between the observations (grid="data" and between in the control argument set
to FALSE or TRUE, respectively). Setting grid="quantiles" gives a grid based on regular sample
quantiles. If the breaks argument is NULL,
G(n) = g1 ∗ n
controls the grid in the following way: the smallest allowed bin width in a "data" grid is 1/G(n)
times the sample range, while for grid="regular" and grid="quantiles" the ﬁnest grid has
bins. The parameters g1, g2 and g3 can be changed by modifying the corresponding
components in the control argument. If breaks is a positive number, its integer part is used
instead of G(n). Different strategies for selection of D and the bin boundaries can be chosen using
the penalty option.
To reduce calculation time for irregular histograms, a subset of the breakpoints of the ﬁnest grid
can be chosen by starting from a one-bin histogram and then subsequently ﬁnding the split of an
existing bin that leads to the largest increase in the loglikelihood. The full optimization is then
performed only over all partitions with endpoints from the subset thus constructed. This is achieved
by setting greedy=TRUE. To reduce calculation time for regular histograms, the maxbin parameter
in the control argument gives an upper bound for the number of bins. The default value is 1000.
Using type="combined" (the default value), both a regular and an irregular histogram are con-
structed using a penalized likelihood approach and the one with the larger penalized likelihood is
chosen. In this case, the regular histogram is always constructed using the br penalty. The penalty
parameter and all other options control the construction of the irregular histogram. penalty must
be equal to "penA", "penB" or "penR", since otherwise comparison of penalized likelihood values
would not be meaningful.
an object of class "histogram" which is a list with the same components as in the
Most settings of penalty lead to a penalized maximum likelihood histogram. For a sample of size n
and a partition J that divides the sample range into D bins, deﬁne N
as the number of observations
as the width of the the i-th bin, i = 1, ..., D. In this section, the
penalized loglikelihood is deﬁned as
/(n ∗ w
)) − pen(J ).
The possible penalties are:
pen(J ) = c log
n − 1
D − 1
cα(D − 1)(log
where the default values are c = 1, α = 0.5 and k = 2. These can be changed using the c,
where the default values are c = 1 and α = 1. These can be changed using the c and alpha
components of control. Default penalty for irregular and combined histograms.
where the default values are c = 1 and α = 0.5. These can be changed using the c and alpha
and may be changed using the alpha parameter in the control argument.
Bayesian Information Criterion (BIC). Deﬁned by pen(J ) = α ∗ log(n) ∗ D, where α is 0.5
(2009). Only available for regular histograms.
Improved version of AIC for regular histograms as given in Birge and Rozenholc (2006). De-
Some settings of penalty do not lead to maximization of a penalized likelihood but optimzation of
different measures. These are:
Leave-p-out crossvalidation. Different variants can be chosen by setting the cvformula and p
regular and irregular histograms. These are different versions of leave-p-out L2-crossvalidation,
where choice of a partition is achieved by minimizing
− (n − p + 1)
respectively, see formulas (11) and (12) in Celisse and Robin (2008). Since formula 1 does
. Kullback-Leibler crossvalidation can be performed by setting cvformula=3. This is only
− 1) + n log(D),
Stochastic Complexity criterion, only available for regular histograms. Number of bins is cho-
(D − 1)!/(D + n − 1)!,
is chosen by maximizing
− 0.5) log(N
− 0.5) − (n − 0.5D) log(n − 0.5D) + n log D − 0.5D log n,
see formula (2.5) in Hall and Hannan (1988).
togram. Meaning and default values depend on setting of the other parameters.
Coefﬁcient of the number of bins in penalties penA, penB, aic, bic. Coefﬁcient of the
Controls the weight of the penalty component that corrects for the multiplicity of partitions with
determines the type of crossvalidation to be performed. Can take the values 1, 2 and
3. 1 and 2 correspond to different versions of L2 crossvalidation, while cvformula=3 per-
forms Kullback-Leibler crossvalidation, which is at the moment only available for regular
histograms. Note that cvformula=3 automatically forces every bin to include at least 2 obser-
vations. If p is set to a value greater than 1, cvformula=2 is used automatically.
The parameters g1, g2 and g3 control the maximum number of bins in a regular histogram as
G(n) = g1 ∗ n
The maximum number of bins allowed in a regular histogram is given by floor(G(n)), the
range into floor(G(n)) equisized bins, and if grid="quantiles", the ﬁnest grid is obtained
by dividing the interval [0, 1] into equisized intervals and using the sample quantiles corre-
sponding to the boundary points. For an irregular histogram with grid="data", a mimimum
allowed bin size of 1/G(n) is enforced. This can be disabled by setting g3 to Inf, causing
1/G(n) to be zero. Default settings are g1=1 and g2=1 for all grids. Default values for g3 are
for grid="regular" and grid="quantiles" and Inf for grid="data". Also see maxbin.
Tuning parameter that only has an effect if penalty="penA". Default value is 2.
Gives an absolute upper bound on the number of bins in order to keep the calculations
histograms. Defaults to 1000.
Controls the number p of data points left out in the crossvalidation. Can take integer values be-
to 2 since crossvalidation formula 1 does not depend on p and Kullback-Leibler crossvalida-
tion is only supported for p=1.
Determines the way the quantiles are calculated if grid="quantiles". Corresponds
to the type argument in
, whose default 7 is also the default here.
Thoralf Mildenberger, Yves Rozenholc, David Zasada.
Birgé, L. and Rozenholc, Y. (2006). How many bins should be put in a regular histogram? ESAIM:
Probability and Statistics, 10, 24-45.
Celisse, A. and Robin, S. (2008). Nonparametric density estimation by exact leave-p-out cross-
validation. Computational Statistics and Data Analysis 52, 2350-2368.
Davies, P. L., Gather, U., Nordman, D. J., and Weinert, H. (2009): A comparison of automatic
histogram constructions. ESAIM: Probability and Statistics, 13, 181-196.
Hall, P. and Hannan, E. J. (1988). On stochastic complexity and nonparametric density estimation.
Biometrika 75, 705-714.
Rozenholc, Y, Mildenberger, T. and Gather, U. (2009). Combining regular and irregular his-
tograms by penalized likelihood. Discussion Paper 31/2009, SFB 823, TU Dortmund.
Rozenholc, Y., Mildenberger, T., Gather, U. (2010). Combining regular and irregular histograms by
penalized likelihood. Computational Statistics and Data Analysis 54, 3313-3323.
## draw a histogram from a standard normal sample
## draw a histogram from a standard exponential sample
## draw a histogram from a normal mixture
## the same using a regular histogram with Kullback-Leibler CV