Introduction to Statistical Modeling
Statistical modeling is the process of using mathematical relationships to understand and predict real-world phenomena. In this tutorial, we'll explore how car weight affects fuel efficiency using the classic mtcars dataset.
What You'll Learn
- Understand the basic workflow of statistical modeling in R
- Fit a simple linear model using glmmTMB
- Tidy and interpret model results with broom.mixed
- Visualize model predictions and make new predictions
- Appreciate model assumptions and advanced modeling potential
- Basic R knowledge (data frames, functions)
- R packages:
tidyverse,glmmTMB,broom.mixed - Understanding of correlation concepts
Data Exploration
Before modeling, we need to understand our data. The mtcars dataset contains information about 32 car models from the 1970s.
# Load required libraries
library(tidyverse)
library(glmmTMB)
library(broom.mixed)
# Load and examine the data
data(mtcars)
head(mtcars)
str(mtcars)
# Focus on our variables of interest
summary(mtcars[c('mpg', 'wt')])
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mpg wt
Min. :10.40 Min. :1.513
1st Qu.:15.43 1st Qu.:2.581
Median :19.20 Median :3.325
Mean :20.09 Mean :3.217
3rd Qu.:22.80 3rd Qu.:3.610
Max. :33.90 Max. :5.424
Key Observations
- MPG range: 10.4 to 33.9 miles per gallon
- Weight range: 1.5 to 5.4 (thousands of pounds)
- Sample size: 32 observations
- Expectation: Heavier cars should have lower fuel efficiency
Model Fitting with glmmTMB
Now we'll fit a linear model to quantify the relationship between car weight and fuel efficiency. We use glmmTMB because it provides a unified interface for many types of models.
# Fit a simple linear model
model1 <- glmmTMB(mpg ~ wt, data = mtcars)
# View model summary
summary(model1)
Family: gaussian ( identity )
Formula: mpg ~ wt
Data: mtcars
AIC BIC logLik -2*log(L) df.resid
166.0 170.4 -80.0 160.0 29
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 37.2851 1.8180 20.509 <2e-16 ***
wt -5.3445 0.5413 -9.873 <2e-16 ***
Understanding the Model
Our model equation is: mpg = 37.29 - 5.34 ร weight
Tidy Model Output with broom.mixed
The broom.mixed package provides clean, consistent output for statistical models, making them easier to work with in data analysis pipelines.
# Get tidy coefficient estimates
tidy_coefs <- tidy(model1)
print(tidy_coefs)
# Add predictions and residuals to original data
model_data <- augment(model1)
print(model_data)
# Get overall model statistics
model_stats <- glance(model1)
print(model_stats)
# Coefficients # A tibble: 2 ร 7 effect component term estimate std.error statistic p.value1 fixed cond (Intercept) 37.3 1.82 20.5 1.80e-93 2 fixed cond wt -5.34 0.541 -9.87 5.48e-23 # Augmented data (first 10 rows) # A tibble: 32 ร 6 .rownames mpg wt .fitted .se.fit .resid 1 Mazda RX4 21 2.62 23.3 0.613 -2.28 2 Mazda RX4 Wag 21 2.88 21.9 0.553 -0.920 3 Datsun 710 22.8 2.32 24.9 0.713 -2.09 # Model statistics # A tibble: 1 ร 7 nobs sigma logLik AIC BIC deviance df.residual 1 32 2.95 -80.0 166. 170. 278. 29
Key Components Explained
- .fitted: Model predictions for each car
- .resid: Difference between observed and predicted mpg
- sigma: Standard deviation of residuals (2.95 mpg)
- AIC: Model fit statistic (lower is better for model comparison)
Visualization
Visualizing our model helps us understand the relationship and assess model quality.
# Create a visualization of the model
library(ggplot2)
# Plot observed data and model predictions
ggplot(model_data, aes(x = wt)) +
geom_point(aes(y = mpg), color = 'steelblue', size = 2, alpha = 0.7) +
geom_line(aes(y = .fitted), color = 'red', linewidth = 1) +
labs(
title = 'Relationship between Car Weight and Fuel Efficiency',
subtitle = 'Observed data (blue points) and model predictions (red line)',
x = 'Weight (1000 lbs)',
y = 'Miles per Gallon (mpg)'
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = 'bold'),
plot.subtitle = element_text(size = 12, color = 'gray60')
)
Making Predictions
One of the primary uses of statistical models is making predictions for new data points.
# Create new data for predictions
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5, 4.0))
# Generate predictions
predictions <- predict(model1, newdata = new_cars)
# Combine with original data for easy viewing
prediction_results <- data.frame(
weight_1000lbs = new_cars$wt,
predicted_mpg = round(predictions, 1)
)
print(prediction_results)
weight_1000lbs predicted_mpg 1 2.5 23.9 2 3.0 21.3 3 3.5 18.6 4 4.0 15.9
Real-World Application
These predictions tell us:
- A 2,500 lb car should get about 24 mpg
- A 4,000 lb car should get about 16 mpg
- Each additional 500 lbs reduces efficiency by ~2.7 mpg
Interpreting Results
Model Coefficients
- Intercept (37.29): Theoretical mpg for a weightless car (not realistic, but mathematically necessary)
- Weight coefficient (-5.34): Each 1000 lb increase decreases mpg by 5.34
- Statistical significance: P-values < 0.001 indicate very strong evidence for these relationships
Model Quality Assessment
# Calculate R-squared manually
total_variance <- var(mtcars$mpg)
residual_variance <- var(model_data$.resid)
r_squared <- 1 - (residual_variance / total_variance)
cat("R-squared:", round(r_squared, 3), "\n")
cat("This means", round(r_squared * 100, 1), "% of mpg variation is explained by weight")
R-squared: 0.753 This means 75.3% of mpg variation is explained by weight
Practical Implications
Our model suggests that car weight is a strong predictor of fuel efficiency, explaining about 75% of the variation. However, 25% remains unexplained, indicating other important factors like:
- Engine type and displacement
- Number of cylinders
- Transmission type
- Aerodynamics and design
Model Assumptions and Limitations
Every statistical model makes assumptions. Understanding these helps us know when our model is reliable and when it might fail.
Key Assumptions of Linear Models
# Quick diagnostic: plot residuals vs fitted values
ggplot(model_data, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = 0, color = 'red', linetype = 'dashed') +
labs(
title = 'Residuals vs Fitted Values',
subtitle = 'Random scatter around zero suggests good model fit',
x = 'Fitted Values (Predicted MPG)',
y = 'Residuals'
) +
theme_minimal()
Why glmmTMB?
We used glmmTMB instead of basic lm() because it:
- Handles more complex models (mixed effects, non-normal distributions)
- Provides consistent syntax across model types
- Integrates well with tidyverse workflows
- Prepares you for advanced modeling techniques
Summary and Next Steps
What You've Accomplished
โ Completed Learning Objectives
- Fitted your first statistical model using modern R tools
- Learned the data โ model โ tidy results โ visualization workflow
- Made predictions and interpreted coefficients
- Understood basic model assumptions and diagnostics
- Gained foundation for more complex modeling
Immediate Next Steps
- Try predicting mpg using horsepower (hp) instead of weight
- Fit a model with both weight and horsepower:
mpg ~ wt + hp - Compare model performance using AIC values
- Explore the relationship between engine displacement (disp) and mpg
Advanced Topics to Explore
- Multiple regression: Including multiple predictors
- Interaction effects: When predictors affect each other
- Mixed-effects models: Handling grouped data
- Non-linear relationships: Polynomial and spline models
- Model validation: Cross-validation and bootstrap methods
Additional Resources
- ๐ R for Data Science: https://r4ds.had.co.nz/
- ๐ Statistical Inference via Data Science: ModernDive textbook
- ๐ง glmmTMB Documentation: https://glmmtmb.github.io/glmmTMB/
- ๐งน broom.mixed Package: CRAN documentation