library(tidyverse)
library(lmtest)
sales_data <- read_csv("MarketPlace_Sales_Data.csv")
summary(sales_data)
## Sales Ad_Spend Email_Campaigns Website_Traffic
## Min. : 11.5 Min. : 5.19 Min. : 0.000 Min. : 25.58
## 1st Qu.: 293.8 1st Qu.: 19.19 1st Qu.: 6.000 1st Qu.:109.84
## Median : 371.7 Median : 30.17 Median : 8.000 Median :155.77
## Mean : 385.6 Mean : 35.25 Mean : 8.008 Mean :172.93
## 3rd Qu.: 456.5 3rd Qu.: 45.74 3rd Qu.:10.000 3rd Qu.:216.99
## Max. :1078.3 Max. :166.29 Max. :20.000 Max. :645.62
## NA's :1
## Avg_Rating Discount_Pct Competitor_Price_Index Inventory_Level
## Min. :2.080 Min. : 0.09 Min. : 40.61 Min. : 117.0
## 1st Qu.:3.920 1st Qu.: 6.50 1st Qu.: 89.53 1st Qu.: 662.0
## Median :4.290 Median :10.72 Median : 99.78 Median : 916.0
## Mean :4.206 Mean :11.53 Mean :100.03 Mean : 988.3
## 3rd Qu.:4.580 3rd Qu.:15.66 3rd Qu.:110.19 3rd Qu.:1227.0
## Max. :4.990 Max. :33.49 Max. :151.87 Max. :3306.0
##
## Social_Media_Engagement Season Product_Type Category_ID
## Min. : 3.73 Length:3600 Length:3600 Min. : 1.00
## 1st Qu.:35.07 Class :character Class :character 1st Qu.: 50.75
## Median :49.35 Mode :character Mode :character Median :100.50
## Mean :49.42 Mean :100.50
## 3rd Qu.:63.98 3rd Qu.:150.25
## Max. :95.83 Max. :200.00
## NA's :1
## Month
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 9.5
## Mean : 9.5
## 3rd Qu.:14.0
## Max. :18.0
##
sales_data <- sales_data %>%
mutate(
Season = as.factor(Season),
Product_Type = as.factor(Product_Type),
log_Sales = log(Sales),
log_Ad_Spend = log(Ad_Spend)
)
ggplot(sales_data, aes(x = Ad_Spend, y = Sales)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", col = "red", se = FALSE) + # Linear line
geom_smooth(method = "loess", col = "blue", se = FALSE) + # Non-linear (Loess) curve
labs(title = "Sales vs. Ad Spend (in thousands of dollars)",
subtitle = "Red = Linear Fit, Blue = Non-Linear (Loess) Fit") +
theme_bw()
Visual Analysis:
Upon visual inspection, the relationship between
Ad_Spend and Sales appears
predominantly linear, particularly within the primary
operating range of $0K to $100K, which contains the vast majority of our
observations. In this dense region, the linear fit (Red) and non-linear
LOESS fit (Blue) are virtually indistinguishable.
While the trend lines begin to diverge slightly when Ad Spend exceeds $100K, this upper range is characterized by data sparsity (very few data points) and high variance. Therefore, the visual evidence for “diminishing returns” is weak; the slight curvature at the tail could be an artifact of noise rather than a definitive non-linear pattern.
model_A_linear <- lm(Sales ~ Ad_Spend, data = sales_data)
summary(model_A_linear)
##
## Call:
## lm(formula = Sales ~ Ad_Spend, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -314.15 -83.15 -14.77 60.71 630.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 295.96539 3.60511 82.10 <2e-16 ***
## Ad_Spend 2.54359 0.08698 29.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113.7 on 3597 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1921, Adjusted R-squared: 0.1919
## F-statistic: 855.3 on 1 and 3597 DF, p-value: < 2.2e-16
model_B_linlog <- lm(Sales ~ log(Ad_Spend), data = sales_data)
summary(model_B_linlog)
##
## Call:
## lm(formula = Sales ~ log(Ad_Spend), data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -296.21 -82.58 -15.20 63.71 633.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.481 10.733 9.362 <2e-16 ***
## log(Ad_Spend) 84.340 3.123 27.006 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.4 on 3597 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1686, Adjusted R-squared: 0.1683
## F-statistic: 729.3 on 1 and 3597 DF, p-value: < 2.2e-16
model_C_quad <- lm(Sales ~ Ad_Spend + I(Ad_Spend^2), data = sales_data)
summary(model_C_quad)
##
## Call:
## lm(formula = Sales ~ Ad_Spend + I(Ad_Spend^2), data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -314.77 -83.09 -14.67 60.77 631.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.974e+02 5.626e+00 52.861 <2e-16 ***
## Ad_Spend 2.467e+00 2.504e-01 9.851 <2e-16 ***
## I(Ad_Spend^2) 7.563e-04 2.305e-03 0.328 0.743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113.7 on 3596 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1921, Adjusted R-squared: 0.1917
## F-statistic: 427.6 on 2 and 3596 DF, p-value: < 2.2e-16
Statistical Analysis:
To rigorously test the CFO’s hypothesis of diminishing returns, we developed and compared three functional forms:
Sales ~ Ad_SpendSales ~ log(Ad_Spend)Sales ~ Ad_Spend + I(Ad_Spend^2)We reject the non-linear models. The Linear Model (Model A) is the superior choice based on model fit statistics and visual exploration. This indicates that within our current operating range, there is no statistical evidence of diminishing returns.
Optimal Point Analysis:
Because the relationship is linear and not curved, there is no single “optimal point” of maximum efficiency within the observed data range. We have not yet reached a saturation point where spending more becomes wasteful. Based on the plot, we should increase Ad Spend slowly beyond operating range (>$100K) to ensure linearity holds as spending is increased.
Business Recommendation
Recommendation: Maintain or Increase Advertising Spend
Based on this standalone analysis, we should not decrease advertising spend out of a fear of diminishing returns. The data shows a consistent, positive linear relationship between Ad Spend and Sales. Therefore, the current strategy is not demonstrably inefficient.
# Main model with all predictor variables
model_main <- lm(Sales ~ Ad_Spend + Email_Campaigns + Website_Traffic +
Avg_Rating + Discount_Pct + Competitor_Price_Index +
Inventory_Level + Social_Media_Engagement +
Season + Product_Type,
data = sales_data)
summary(model_main)
##
## Call:
## lm(formula = Sales ~ Ad_Spend + Email_Campaigns + Website_Traffic +
## Avg_Rating + Discount_Pct + Competitor_Price_Index + Inventory_Level +
## Social_Media_Engagement + Season + Product_Type, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.160 -16.390 0.177 16.939 75.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.741e+01 5.062e+00 7.390 1.81e-13 ***
## Ad_Spend 2.482e+00 1.860e-02 133.435 < 2e-16 ***
## Email_Campaigns 1.038e+00 1.425e-01 7.282 4.02e-13 ***
## Website_Traffic 1.201e+00 4.612e-03 260.385 < 2e-16 ***
## Avg_Rating 1.475e+01 8.360e-01 17.643 < 2e-16 ***
## Discount_Pct 2.066e-01 6.316e-02 3.271 0.00108 **
## Competitor_Price_Index 4.101e-01 2.668e-02 15.370 < 2e-16 ***
## Inventory_Level -4.879e-02 9.002e-04 -54.198 < 2e-16 ***
## Social_Media_Engagement 2.207e-01 2.136e-02 10.331 < 2e-16 ***
## SeasonSpring 1.283e-01 1.216e+00 0.105 0.91599
## SeasonSummer -1.044e+01 1.256e+00 -8.314 < 2e-16 ***
## SeasonWinter 1.989e+01 1.315e+00 15.125 < 2e-16 ***
## Product_TypeFashion -4.469e+01 1.018e+00 -43.901 < 2e-16 ***
## Product_TypeHome -2.972e+01 9.825e-01 -30.246 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.31 on 3585 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9632, Adjusted R-squared: 0.9631
## F-statistic: 7218 on 13 and 3585 DF, p-value: < 2.2e-16
To address the CMO’s strategic questions, we must investigate interaction effects.
Ad_Spend and Season. Specifically, the
marginal increase in Sales per dollar of Ad Spend is significantly
higher in Winter than in Summer.Avg_Rating and Product_Type. We
hypothesize that customer ratings are a stronger driver of Sales for
Fashion products compared to Electronics.Hypothesis 1:
Hypothesis 2:
model_interactions <- lm(Sales ~ Email_Campaigns + Website_Traffic +
Discount_Pct + Competitor_Price_Index +
Inventory_Level + Social_Media_Engagement +
(Ad_Spend * Season) + # Interaction 1
(Avg_Rating * Product_Type), # Interaction 2
data = sales_data)
summary(model_interactions)
##
## Call:
## lm(formula = Sales ~ Email_Campaigns + Website_Traffic + Discount_Pct +
## Competitor_Price_Index + Inventory_Level + Social_Media_Engagement +
## (Ad_Spend * Season) + (Avg_Rating * Product_Type), data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.868 -16.382 0.165 16.963 76.008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.941e+01 7.467e+00 5.278 1.38e-07 ***
## Email_Campaigns 1.029e+00 1.428e-01 7.200 7.30e-13 ***
## Website_Traffic 1.201e+00 4.615e-03 260.249 < 2e-16 ***
## Discount_Pct 2.117e-01 6.324e-02 3.347 0.000826 ***
## Competitor_Price_Index 4.100e-01 2.669e-02 15.361 < 2e-16 ***
## Inventory_Level -4.877e-02 9.007e-04 -54.140 < 2e-16 ***
## Social_Media_Engagement 2.216e-01 2.139e-02 10.362 < 2e-16 ***
## Ad_Spend 2.486e+00 4.756e-02 52.263 < 2e-16 ***
## SeasonSpring -1.925e-02 2.340e+00 -0.008 0.993436
## SeasonSummer -1.059e+01 2.432e+00 -4.356 1.36e-05 ***
## SeasonWinter 2.103e+01 2.533e+00 8.303 < 2e-16 ***
## Avg_Rating 1.423e+01 1.488e+00 9.560 < 2e-16 ***
## Product_TypeFashion -5.598e+01 8.978e+00 -6.236 5.02e-10 ***
## Product_TypeHome -2.612e+01 8.566e+00 -3.049 0.002312 **
## Ad_Spend:SeasonSpring 4.391e-03 5.692e-02 0.077 0.938509
## Ad_Spend:SeasonSummer 6.118e-03 5.974e-02 0.102 0.918436
## Ad_Spend:SeasonWinter -3.019e-02 6.171e-02 -0.489 0.624654
## Avg_Rating:Product_TypeFashion 2.681e+00 2.119e+00 1.265 0.205777
## Avg_Rating:Product_TypeHome -8.615e-01 2.022e+00 -0.426 0.670129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.31 on 3580 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9632, Adjusted R-squared: 0.9631
## F-statistic: 5212 on 18 and 3580 DF, p-value: < 2.2e-16
# Perform ANOVA test for model comparison
anova_test <- anova(model_main, model_interactions)
print(anova_test)
## Analysis of Variance Table
##
## Model 1: Sales ~ Ad_Spend + Email_Campaigns + Website_Traffic + Avg_Rating +
## Discount_Pct + Competitor_Price_Index + Inventory_Level +
## Social_Media_Engagement + Season + Product_Type
## Model 2: Sales ~ Email_Campaigns + Website_Traffic + Discount_Pct + Competitor_Price_Index +
## Inventory_Level + Social_Media_Engagement + (Ad_Spend * Season) +
## (Avg_Rating * Product_Type)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3585 2118679
## 2 3580 2116394 5 2285.1 0.7731 0.569
summary(model_main)$adj.r.squared
## [1] 0.9630679
summary(model_interactions)$adj.r.squared
## [1] 0.9630562
To test the CMO’s hypotheses, we extended our regression model to include interaction terms:
Sales ~ Ad_Spend + Avg_Rating + [Controls]Ad_Spend * Season and
Avg_Rating * Product_Type.Interpretation: The high p-values indicate that we cannot reject the null hypothesis. There is no statistical evidence that advertising effectiveness changes based on the season, nor that customer ratings impact sales differently across product categories.
Summary of Effects:
| Interaction Tested | Hypothesis | Statistical Result | Conclusion |
|---|---|---|---|
| Ads × Winter | Ads work better in Winter | Not Significant (p=0.62) | Winter ads are equally effective as other seasons. |
| Ratings × Fashion | Reviews matter more for Clothes | Not Significant (p=0.21) | Ratings have a consistent impact across all categories. |
Business Recommendation: While the interaction terms between Ad_Spend × Season and Avg_Rating × Product_Type were not statistically significant, this finding itself is important for strategic planning. It suggests that advertising effectiveness is relatively consistent across seasons, and customer ratings impact sales similarly across product categories. This challenges prior assumptions from the CMO and implies that a uniform approach to ad allocation and review strategy may be appropriate unless future data indicate otherwise. However, MarketPlace may want to revisit this analysis with more granular seasonal data or different product groupings to reassess these patterns.
vif_results <- car::vif(model_main)
print(vif_results)
## GVIF Df GVIF^(1/(2*Df))
## Ad_Spend 1.000732 1 1.000366
## Email_Campaigns 1.004947 1 1.002470
## Website_Traffic 1.001812 1 1.000906
## Avg_Rating 1.001692 1 1.000845
## Discount_Pct 1.001512 1 1.000756
## Competitor_Price_Index 1.001798 1 1.000899
## Inventory_Level 1.002762 1 1.001380
## Social_Media_Engagement 1.002230 1 1.001114
## Season 1.005701 3 1.000948
## Product_Type 1.003305 2 1.000825
Interpretation: All VIF scores are extremely low (all are ~1.0). This indicates that multicollinearity is not a problem in our main model. We will proceed with the full model, as removing variables at this stage would offer no statistical benefit and could potentially introduce omitted variable bias.
bp_test <- lmtest::bptest(model_main)
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: model_main
## BP = 25.733, df = 13, p-value = 0.01846
plot(model_main, which = 3)
Statistical Test (Breusch-Pagan): Since the p-value is less than 0.05, we reject the null hypothesis of homoscedasticity. There is statistically significant evidence that the variance of our errors is not constant.
Graphical Inspection: The “Scale-Location” plot (Residuals vs Fitted) shows a slight pattern where the spread of standardized residuals shifts across the range of fitted values.
To fix the issue without altering our model specification, we calculated Heteroscedasticity-Consistent (Robust) Standard Errors (specifically using the HC1 “sandwich” estimator).
library(sandwich)
robust_se_model <- coeftest(model_main, vcov = vcovHC(model_main, type = "HC1"))
print(robust_se_model)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7408e+01 4.9667e+00 7.5317 6.304e-14 ***
## Ad_Spend 2.4819e+00 1.7247e-02 143.9015 < 2.2e-16 ***
## Email_Campaigns 1.0378e+00 1.4221e-01 7.2977 3.588e-13 ***
## Website_Traffic 1.2010e+00 4.8149e-03 249.4280 < 2.2e-16 ***
## Avg_Rating 1.4749e+01 8.3000e-01 17.7702 < 2.2e-16 ***
## Discount_Pct 2.0662e-01 6.3110e-02 3.2739 0.001071 **
## Competitor_Price_Index 4.1012e-01 2.6763e-02 15.3240 < 2.2e-16 ***
## Inventory_Level -4.8786e-02 8.8744e-04 -54.9740 < 2.2e-16 ***
## Social_Media_Engagement 2.2069e-01 2.1405e-02 10.3101 < 2.2e-16 ***
## SeasonSpring 1.2833e-01 1.1493e+00 0.1117 0.911101
## SeasonSummer -1.0441e+01 1.2021e+00 -8.6855 < 2.2e-16 ***
## SeasonWinter 1.9895e+01 1.2624e+00 15.7595 < 2.2e-16 ***
## Product_TypeFashion -4.4693e+01 1.0220e+00 -43.7324 < 2.2e-16 ***
## Product_TypeHome -2.9716e+01 9.8467e-01 -30.1782 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
original_summary <- summary(model_main)$coefficients
robust_summary <- robust_se_model
# Create the comparison data frame
se_comparison <- data.frame(
Variable = rownames(original_summary),
Original_Std_Error = original_summary[, "Std. Error"],
Robust_Std_Error = robust_summary[, "Std. Error"],
Original_P_Value = original_summary[, "Pr(>|t|)"],
Robust_P_Value = robust_summary[, "Pr(>|t|)"]
)
print(se_comparison)
## Variable Original_Std_Error
## (Intercept) (Intercept) 5.0617595977
## Ad_Spend Ad_Spend 0.0186003002
## Email_Campaigns Email_Campaigns 0.1425091588
## Website_Traffic Website_Traffic 0.0046123202
## Avg_Rating Avg_Rating 0.8360035550
## Discount_Pct Discount_Pct 0.0631606498
## Competitor_Price_Index Competitor_Price_Index 0.0266833002
## Inventory_Level Inventory_Level 0.0009001567
## Social_Media_Engagement Social_Media_Engagement 0.0213626573
## SeasonSpring SeasonSpring 1.2164579475
## SeasonSummer SeasonSummer 1.2557638840
## SeasonWinter SeasonWinter 1.3153735496
## Product_TypeFashion Product_TypeFashion 1.0180456291
## Product_TypeHome Product_TypeHome 0.9824535338
## Robust_Std_Error Original_P_Value Robust_P_Value
## (Intercept) 4.9667033960 1.814475e-13 6.304129e-14
## Ad_Spend 0.0172474906 0.000000e+00 0.000000e+00
## Email_Campaigns 0.1422055502 4.019955e-13 3.587876e-13
## Website_Traffic 0.0048149280 0.000000e+00 0.000000e+00
## Avg_Rating 0.8300049784 7.181043e-67 8.947016e-68
## Discount_Pct 0.0631099121 1.080699e-03 1.070721e-03
## Competitor_Price_Index 0.0267629723 1.121458e-51 2.173446e-51
## Inventory_Level 0.0008874432 0.000000e+00 0.000000e+00
## Social_Media_Engagement 0.0214052949 1.131522e-24 1.393348e-24
## SeasonSpring 1.1493090801 9.159895e-01 9.111010e-01
## SeasonSummer 1.2020818073 1.295308e-16 5.632968e-18
## SeasonWinter 1.2623913347 3.803860e-50 3.733534e-54
## Product_TypeFashion 1.0219679356 0.000000e+00 0.000000e+00
## Product_TypeHome 0.9846737592 3.433305e-179 1.777185e-178
Conclusion: After applying the remedy, we compared the Robust Standard Errors to the Original OLS errors. While the standard errors shifted slightly, the statistical significance of our predictors did not change.
All variables that were significant in the original model remain highly significant, and insignificant variables remain insignificant. This confirms that our business conclusions are robust and safe to use for decision-making, despite the presence of mild heteroscedasticity.
# Perform Durbin-Watson Test for Autocorrelation
dw_test <- lmtest::dwtest(model_main)
print(dw_test)
##
## Durbin-Watson test
##
## data: model_main
## DW = 1.9492, p-value = 0.05709
## alternative hypothesis: true autocorrelation is greater than 0
Test Statistic: The DW statistic ranges from 0 to 4, where a value of 2.0 indicates zero autocorrelation. Our model produced a statistic of 1.9492. Significance: The p-value is 0.057, which is greater than the standard significance level (\(\alpha = 0.05\)).
Conclusion: We fail to reject the null hypothesis. The DW statistic is extremely close to the ideal value of 2.0. There is no statistically significant evidence of autocorrelation in the residuals. While the p-value is borderline (close to 0.05), the statistic is very close to 2.0 (the ideal value indicating zero autocorrelation). Therefore, we conclude that while there may be a weak temporal signal, there is no statistically significant evidence of serious autocorrelation that would invalidate the model.
Business Context:
Even though our statistical test passed, it was right for the CDO to be concerned. In a retail business context, autocorrelation is plausible due to:
We have successfully audited the model against all three concerns:
Thus, the model is statistically sound and safe for strategic decision-making.
sales_data <- sales_data %>%
mutate(log_Website_Traffic = log(Website_Traffic))
# Model 1: Linear-Linear (model_main)
summary(model_main)
##
## Call:
## lm(formula = Sales ~ Ad_Spend + Email_Campaigns + Website_Traffic +
## Avg_Rating + Discount_Pct + Competitor_Price_Index + Inventory_Level +
## Social_Media_Engagement + Season + Product_Type, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.160 -16.390 0.177 16.939 75.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.741e+01 5.062e+00 7.390 1.81e-13 ***
## Ad_Spend 2.482e+00 1.860e-02 133.435 < 2e-16 ***
## Email_Campaigns 1.038e+00 1.425e-01 7.282 4.02e-13 ***
## Website_Traffic 1.201e+00 4.612e-03 260.385 < 2e-16 ***
## Avg_Rating 1.475e+01 8.360e-01 17.643 < 2e-16 ***
## Discount_Pct 2.066e-01 6.316e-02 3.271 0.00108 **
## Competitor_Price_Index 4.101e-01 2.668e-02 15.370 < 2e-16 ***
## Inventory_Level -4.879e-02 9.002e-04 -54.198 < 2e-16 ***
## Social_Media_Engagement 2.207e-01 2.136e-02 10.331 < 2e-16 ***
## SeasonSpring 1.283e-01 1.216e+00 0.105 0.91599
## SeasonSummer -1.044e+01 1.256e+00 -8.314 < 2e-16 ***
## SeasonWinter 1.989e+01 1.315e+00 15.125 < 2e-16 ***
## Product_TypeFashion -4.469e+01 1.018e+00 -43.901 < 2e-16 ***
## Product_TypeHome -2.972e+01 9.825e-01 -30.246 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.31 on 3585 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9632, Adjusted R-squared: 0.9631
## F-statistic: 7218 on 13 and 3585 DF, p-value: < 2.2e-16
# Model 2: Log-Linear Model
model_log_linear <- lm(log_Sales ~ Ad_Spend + Email_Campaigns + Website_Traffic +
Avg_Rating + Discount_Pct + Competitor_Price_Index +
Inventory_Level + Social_Media_Engagement +
Season + Product_Type,
data = sales_data)
summary(model_log_linear)
##
## Call:
## lm(formula = log_Sales ~ Ad_Spend + Email_Campaigns + Website_Traffic +
## Avg_Rating + Discount_Pct + Competitor_Price_Index + Inventory_Level +
## Social_Media_Engagement + Season + Product_Type, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48312 -0.04702 0.01191 0.06667 0.23787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.988e+00 2.261e-02 220.669 < 2e-16 ***
## Ad_Spend 6.406e-03 8.307e-05 77.122 < 2e-16 ***
## Email_Campaigns 2.528e-03 6.364e-04 3.973 7.25e-05 ***
## Website_Traffic 2.965e-03 2.060e-05 143.918 < 2e-16 ***
## Avg_Rating 4.695e-02 3.734e-03 12.575 < 2e-16 ***
## Discount_Pct 7.932e-04 2.821e-04 2.812 0.00495 **
## Competitor_Price_Index 1.315e-03 1.192e-04 11.037 < 2e-16 ***
## Inventory_Level -1.467e-04 4.020e-06 -36.483 < 2e-16 ***
## Social_Media_Engagement 6.178e-04 9.540e-05 6.476 1.07e-10 ***
## SeasonSpring 1.372e-03 5.433e-03 0.253 0.80058
## SeasonSummer -3.460e-02 5.608e-03 -6.170 7.61e-10 ***
## SeasonWinter 5.623e-02 5.874e-03 9.572 < 2e-16 ***
## Product_TypeFashion -1.340e-01 4.547e-03 -29.476 < 2e-16 ***
## Product_TypeHome -8.342e-02 4.388e-03 -19.011 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1086 on 3585 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8939, Adjusted R-squared: 0.8935
## F-statistic: 2322 on 13 and 3585 DF, p-value: < 2.2e-16
# Model 3: Log-Log Model
model_log_log <- lm(log_Sales ~ log_Ad_Spend + Email_Campaigns + log_Website_Traffic +
Avg_Rating + Discount_Pct + Competitor_Price_Index +
Inventory_Level + Social_Media_Engagement +
Season + Product_Type,
data = sales_data)
summary(model_log_log)
##
## Call:
## lm(formula = log_Sales ~ log_Ad_Spend + Email_Campaigns + log_Website_Traffic +
## Avg_Rating + Discount_Pct + Competitor_Price_Index + Inventory_Level +
## Social_Media_Engagement + Season + Product_Type, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.17870 -0.06163 -0.00431 0.06032 0.44920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.410e+00 3.010e-02 80.061 < 2e-16 ***
## log_Ad_Spend 2.212e-01 2.946e-03 75.081 < 2e-16 ***
## Email_Campaigns 4.075e-03 6.377e-04 6.390 1.87e-10 ***
## log_Website_Traffic 5.101e-01 3.527e-03 144.628 < 2e-16 ***
## Avg_Rating 4.615e-02 3.741e-03 12.335 < 2e-16 ***
## Discount_Pct 3.702e-04 2.826e-04 1.310 0.190
## Competitor_Price_Index 1.323e-03 1.194e-04 11.076 < 2e-16 ***
## Inventory_Level -1.479e-04 4.028e-06 -36.714 < 2e-16 ***
## Social_Media_Engagement 5.750e-04 9.559e-05 6.015 1.98e-09 ***
## SeasonSpring 4.241e-03 5.443e-03 0.779 0.436
## SeasonSummer -3.092e-02 5.620e-03 -5.503 4.00e-08 ***
## SeasonWinter 5.213e-02 5.887e-03 8.855 < 2e-16 ***
## Product_TypeFashion -1.288e-01 4.557e-03 -28.265 < 2e-16 ***
## Product_TypeHome -8.201e-02 4.396e-03 -18.654 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1088 on 3585 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8934, Adjusted R-squared: 0.893
## F-statistic: 2312 on 13 and 3585 DF, p-value: < 2.2e-16
model_comparison <- data.frame(
Model = c("1. Linear-Linear (model_main)", "2. Log-Linear", "3. Log-Log"),
Adjusted_R_Squared = c(
summary(model_main)$adj.r.squared,
summary(model_log_linear)$adj.r.squared,
summary(model_log_log)$adj.r.squared
)
)
print(model_comparison)
## Model Adjusted_R_Squared
## 1 1. Linear-Linear (model_main) 0.9630679
## 2 2. Log-Linear 0.8934656
## 3 3. Log-Log 0.8930358
Based on the CEO’s request and standard economic modeling practices, we considered transforming the following variables:
log_Sales) can normalize the residuals and helps us model
percentage growth rather than absolute dollar
growth.log_Ad_Spend)
allows us to test the idea that a 1% increase in spend leads to a
certain percentage increase in sales, rather than a fixed dollar
amount.Analysis & Recommendation:
The Linear Model (Model 1) is superior. It explains 96.3% of the variation in sales, whereas both log-transformed models drop to 89.3%. This confirms our earlier finding in Objective 1: the relationship between our predictors and sales is fundamentally linear, not multiplicative or curved. Forcing a log transformation actually degrades the model’s performance.