Statistical Modeling in R: A Complete Guide with glmmTMB and tidyverse

Step 1

Introduction to Statistical Modeling

Statistical modeling is the process of using mathematical relationships to understand and predict real-world phenomena. In this tutorial, we'll explore how car weight affects fuel efficiency using the classic mtcars dataset.

What You'll Learn

Understand the basic workflow of statistical modeling in R
Fit a simple linear model using glmmTMB
Tidy and interpret model results with broom.mixed
Visualize model predictions and make new predictions
Appreciate model assumptions and advanced modeling potential

Before You Begin:

Basic R knowledge (data frames, functions)
R packages: tidyverse, glmmTMB, broom.mixed
Understanding of correlation concepts

We'll predict miles per gallon (mpg) from car weight (wt) using a linear relationship. This is a classic problem that demonstrates core statistical concepts while remaining intuitive.

Step 2

Data Exploration

Before modeling, we need to understand our data. The mtcars dataset contains information about 32 car models from the 1970s.

# Load required libraries
library(tidyverse)
library(glmmTMB)
library(broom.mixed)

# Load and examine the data
data(mtcars)
head(mtcars)
str(mtcars)

# Focus on our variables of interest
summary(mtcars[c('mpg', 'wt')])

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

      mpg              wt       
 Min.   :10.40   Min.   :1.513  
 1st Qu.:15.43   1st Qu.:2.581  
 Median :19.20   Median :3.325  
 Mean   :20.09   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:3.610  
 Max.   :33.90   Max.   :5.424

Key Observations

MPG range: 10.4 to 33.9 miles per gallon
Weight range: 1.5 to 5.4 (thousands of pounds)
Sample size: 32 observations
Expectation: Heavier cars should have lower fuel efficiency

Quick Check: Look at the data summary above. Do you notice any patterns? What would you expect the relationship between weight and mpg to be?

Step 3

Model Fitting with glmmTMB

Now we'll fit a linear model to quantify the relationship between car weight and fuel efficiency. We use glmmTMB because it provides a unified interface for many types of models.

# Fit a simple linear model
model1 <- glmmTMB(mpg ~ wt, data = mtcars)

# View model summary
summary(model1)

 Family: gaussian  ( identity )
Formula:          mpg ~ wt
Data: mtcars

      AIC       BIC    logLik -2*log(L)  df.resid 
    166.0     170.4     -80.0     160.0        29 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  37.2851     1.8180  20.509   <2e-16 ***
wt           -5.3445     0.5413  -9.873   <2e-16 ***

Understanding the Model

Our model equation is: mpg = 37.29 - 5.34 × weight

The negative coefficient (-5.34) confirms our intuition: for every 1000 pounds of additional weight, fuel efficiency decreases by about 5.3 miles per gallon. The tiny p-values (< 2e-16) indicate this relationship is statistically significant.

Step 4

Tidy Model Output with broom.mixed

The broom.mixed package provides clean, consistent output for statistical models, making them easier to work with in data analysis pipelines.

# Get tidy coefficient estimates
tidy_coefs <- tidy(model1)
print(tidy_coefs)

# Add predictions and residuals to original data
model_data <- augment(model1)
print(model_data)

# Get overall model statistics
model_stats <- glance(model1)
print(model_stats)

# Coefficients
# A tibble: 2 × 7
  effect component term        estimate std.error statistic  p.value
                                 
1 fixed  cond      (Intercept)    37.3      1.82      20.5  1.80e-93
2 fixed  cond      wt             -5.34     0.541     -9.87 5.48e-23

# Augmented data (first 10 rows)
# A tibble: 32 × 6
   .rownames           mpg    wt .fitted .se.fit .resid
                         
 1 Mazda RX4          21    2.62    23.3   0.613 -2.28 
 2 Mazda RX4 Wag      21    2.88    21.9   0.553 -0.920
 3 Datsun 710         22.8  2.32    24.9   0.713 -2.09 

# Model statistics
# A tibble: 1 × 7
   nobs sigma logLik   AIC   BIC deviance df.residual
                  
1    32  2.95  -80.0  166.  170.     278.          29

Key Components Explained

.fitted: Model predictions for each car
.resid: Difference between observed and predicted mpg
sigma: Standard deviation of residuals (2.95 mpg)
AIC: Model fit statistic (lower is better for model comparison)

Step 5

Visualization

Visualizing our model helps us understand the relationship and assess model quality.

# Create a visualization of the model
library(ggplot2)

# Plot observed data and model predictions
ggplot(model_data, aes(x = wt)) +
  geom_point(aes(y = mpg), color = 'steelblue', size = 2, alpha = 0.7) +
  geom_line(aes(y = .fitted), color = 'red', linewidth = 1) +
  labs(
    title = 'Relationship between Car Weight and Fuel Efficiency',
    subtitle = 'Observed data (blue points) and model predictions (red line)',
    x = 'Weight (1000 lbs)',
    y = 'Miles per Gallon (mpg)'
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = 'bold'),
    plot.subtitle = element_text(size = 12, color = 'gray60')
  )

The visualization should show a clear downward trend. Points close to the red line indicate good predictions, while points far from the line show where our simple model doesn't capture all the complexity.

Interpretation Challenge: Look for patterns in the residuals. Are there cars that consistently over-perform or under-perform relative to their weight? This might suggest other important factors we haven't included.

Step 6

Making Predictions

One of the primary uses of statistical models is making predictions for new data points.

# Create new data for predictions
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5, 4.0))

# Generate predictions
predictions <- predict(model1, newdata = new_cars)

# Combine with original data for easy viewing
prediction_results <- data.frame(
  weight_1000lbs = new_cars$wt,
  predicted_mpg = round(predictions, 1)
)

print(prediction_results)

  weight_1000lbs predicted_mpg
1            2.5          23.9
2            3.0          21.3
3            3.5          18.6
4            4.0          15.9

Real-World Application

These predictions tell us:

A 2,500 lb car should get about 24 mpg
A 4,000 lb car should get about 16 mpg
Each additional 500 lbs reduces efficiency by ~2.7 mpg

Remember: these predictions assume the relationship we observed in 1970s cars still holds. Modern cars have different engines, aerodynamics, and materials that might change this relationship.

Step 7

Interpreting Results

Model Coefficients

Intercept (37.29): Theoretical mpg for a weightless car (not realistic, but mathematically necessary)
Weight coefficient (-5.34): Each 1000 lb increase decreases mpg by 5.34
Statistical significance: P-values < 0.001 indicate very strong evidence for these relationships

Model Quality Assessment

# Calculate R-squared manually
total_variance <- var(mtcars$mpg)
residual_variance <- var(model_data$.resid)
r_squared <- 1 - (residual_variance / total_variance)

cat("R-squared:", round(r_squared, 3), "\n")
cat("This means", round(r_squared * 100, 1), "% of mpg variation is explained by weight")

R-squared: 0.753
This means 75.3% of mpg variation is explained by weight

Practical Implications

Our model suggests that car weight is a strong predictor of fuel efficiency, explaining about 75% of the variation. However, 25% remains unexplained, indicating other important factors like:

Engine type and displacement
Number of cylinders
Transmission type
Aerodynamics and design

Step 8

Model Assumptions and Limitations

Every statistical model makes assumptions. Understanding these helps us know when our model is reliable and when it might fail.

Key Assumptions of Linear Models

Linearity: We assume the relationship between weight and mpg is truly linear. Our visualization suggests this is reasonable for this data range.

# Quick diagnostic: plot residuals vs fitted values
ggplot(model_data, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, color = 'red', linetype = 'dashed') +
  labs(
    title = 'Residuals vs Fitted Values',
    subtitle = 'Random scatter around zero suggests good model fit',
    x = 'Fitted Values (Predicted MPG)',
    y = 'Residuals'
  ) +
  theme_minimal()

Why glmmTMB?

We used glmmTMB instead of basic lm() because it:

Handles more complex models (mixed effects, non-normal distributions)
Provides consistent syntax across model types
Integrates well with tidyverse workflows
Prepares you for advanced modeling techniques

Step 9

Summary and Next Steps

What You've Accomplished

✅ Completed Learning Objectives

Fitted your first statistical model using modern R tools
Learned the data → model → tidy results → visualization workflow
Made predictions and interpreted coefficients
Understood basic model assumptions and diagnostics
Gained foundation for more complex modeling

Immediate Next Steps

Practice Exercises:

Try predicting mpg using horsepower (hp) instead of weight
Fit a model with both weight and horsepower: mpg ~ wt + hp
Compare model performance using AIC values
Explore the relationship between engine displacement (disp) and mpg

Advanced Topics to Explore

Multiple regression: Including multiple predictors
Interaction effects: When predictors affect each other
Mixed-effects models: Handling grouped data
Non-linear relationships: Polynomial and spline models
Model validation: Cross-validation and bootstrap methods

Additional Resources

📖 R for Data Science: https://r4ds.had.co.nz/
📊 Statistical Inference via Data Science: ModernDive textbook
🔧 glmmTMB Documentation: https://glmmtmb.github.io/glmmTMB/
🧹 broom.mixed Package: CRAN documentation