Histograms

David Gerbing

library("lessR")

Histogram

One of the most frequently encountered visualizations for continuous variables is the histogram.

Histogram: Bin similar values into a group, then plot the frequency of occurrence of the data values in each bin as the height of the corresponding bar.

A call to a function to create a histogram has to contain the name of the variable that creates the bins and then tabulates the counts. With the Histogram() function, that variable name is the first argument passed to the function, and often, as in this example, the only argument passed to the function.

First read the Employee data included as part of lessR.

d <- Read("Employee")
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------

To illustrate, consider the continuous variable Salary in the Employee data table. Use Histogram() to tabulate and display the number of employees in each department, here relying upon the default data frame (table) named d.

Histogram(Salary)
Histogram of tablulated counts for the bins of Salary.

Histogram of tablulated counts for the bins of Salary.

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## Density(Salary)  # smoothed density curves plus histogram 
## Plot(Salary)  # Violin/Box/Scatterplot (VBS) plot 
## 
## 
## --- Salary --- 
##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
## 
## 
## (Box plot) Outliers: 1 
##  
## Small      Large 
## -----      ----- 
##             134419.2 
## 
## 
## Bin Width: 10000 
## Number of Bins: 10 
##  
##              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## --------------------------------------------------------- 
##   40000 >  50000   45000      4    0.11        4     0.11 
##   50000 >  60000   55000      8    0.22       12     0.32 
##   60000 >  70000   65000      8    0.22       20     0.54 
##   70000 >  80000   75000      5    0.14       25     0.68 
##   80000 >  90000   85000      3    0.08       28     0.76 
##   90000 > 100000   95000      5    0.14       33     0.89 
##  100000 > 110000  105000      1    0.03       34     0.92 
##  110000 > 120000  115000      1    0.03       35     0.95 
##  120000 > 130000  125000      1    0.03       36     0.97 
##  130000 > 140000  135000      1    0.03       37     1.00

The Histogram() function provides a default color theme. The function also provides the corresponding frequency distribution, summary statistics, the table that lists the count of each category, from which the histogram is constructed, as well as an outlier analysis based on Tukey’s rules for box plots.

Customize the Histogram

The parameters bin_start, bin_width, and bin_end are available to customize the histogram.

Histogram(Salary, bin_start=35000, bin_width=14000)
Customized histogram.

Customized histogram.

## >>> Suggestions 
## bin_end: set the end of the last bin 
## Density(Salary)  # smoothed density curves plus histogram 
## Plot(Salary)  # Violin/Box/Scatterplot (VBS) plot 
## 
## 
## --- Salary --- 
##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
## 
## 
## (Box plot) Outliers: 1 
##  
## Small      Large 
## -----      ----- 
##             134419.2 
## 
## 
## Bin Width: 14000 
## Number of Bins: 8 
##  
##              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## --------------------------------------------------------- 
##   35000 >  49000   42000      1    0.03        1     0.03 
##   49000 >  63000   56000     14    0.38       15     0.41 
##   63000 >  77000   70000      9    0.24       24     0.65 
##   77000 >  91000   84000      4    0.11       28     0.76 
##   91000 > 105000   98000      5    0.14       33     0.89 
##  105000 > 119000  112000      2    0.05       35     0.95 
##  119000 > 133000  126000      1    0.03       36     0.97 
##  133000 > 147000  140000      1    0.03       37     1.00

Easy to change the color, either by changing the color theme with style(), or just change the fill color with fill. Can refer to standard R colors, as shown with lessR function showColors(), or implicitly invoke the lessR color palette generating function getColors(). Each 30 degrees of the color wheel is named, such as "greens", "rusts", etc, and implements a sequential color palette.

Use the color parameter to set the border color, here turned off.

Histogram(Salary, fill="reds", color="transparent")
Customized histogram.

Customized histogram.

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## Density(Salary)  # smoothed density curves plus histogram 
## Plot(Salary)  # Violin/Box/Scatterplot (VBS) plot 
## 
## 
## --- Salary --- 
##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
## 
## 
## (Box plot) Outliers: 1 
##  
## Small      Large 
## -----      ----- 
##             134419.2 
## 
## 
## Bin Width: 10000 
## Number of Bins: 10 
##  
##              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## --------------------------------------------------------- 
##   40000 >  50000   45000      4    0.11        4     0.11 
##   50000 >  60000   55000      8    0.22       12     0.32 
##   60000 >  70000   65000      8    0.22       20     0.54 
##   70000 >  80000   75000      5    0.14       25     0.68 
##   80000 >  90000   85000      3    0.08       28     0.76 
##   90000 > 100000   95000      5    0.14       33     0.89 
##  100000 > 110000  105000      1    0.03       34     0.92 
##  110000 > 120000  115000      1    0.03       35     0.95 
##  120000 > 130000  125000      1    0.03       36     0.97 
##  130000 > 140000  135000      1    0.03       37     1.00

Density Plot

The histogram portrays a continuous distribution with discrete bins.

Density plot: A smooth curve that estimates the underlying continuous distribution.

To invoke, add the density parameter. The result is the filled density curve superimposed on the histogram.

Histogram(Salary, density=TRUE)
Histogram with density plot.

Histogram with density plot.

## 
## 
## --- Salary --- 
##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    73795.557    21799.533    46124.970    69547.600   134419.230

VBS Plot

A more modern version of the density plot combines the violin plot, box plot, and scatter plot into a single visualization, called here the VBS plot.

Plot(Salary)
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE)  # Label two outliers ...
## Plot(Salary, box_adj=TRUE)  # Adjust boxplot whiskers for asymmetry 
## 
## --- Salary --- 
## Present: 37 
## Missing: 0 
## Total  : 37 
##  
## Mean         : 73795.557 
## Stnd Dev     : 21799.533 
## IQR          : 31012.560 
## Skew         : 0.190   [medcouple, -1 to 1] 
##  
## Minimum      : 46124.970 
## Lower Whisker: 46124.970 
## 1st Quartile : 56772.950 
## Median       : 69547.600 
## 3rd Quartile : 87785.510 
## Upper Whisker: 122563.380 
## Maximum      : 134419.230 
## 
## 
## (Box plot) Outliers: 1 
##  
## Small      Large           
## -----      -----           
##            Correll, Trevon 134419.23 
## 
## 
## Number of duplicated values: 0 
## 
## 
## Parameter values (can be manually set) 
## ------------------------------------------------------- 
## size: 0.61      size of plotted points 
## jitter_y: 0.45  random vertical movement of points 
## jitter_x: 0.00  random horizontal movement of points 
## bw: 9529.04     set bandwidth higher for smoother edges
VBS plot.

VBS plot.

Full Manual

Use the base R help() function to view the full manual for Histogram(). Simply enter a question mark followed by the name of the function.

?Histogram