Data Science: Weather and Happiness Analysis

3 minute read

Can Data Solve the Age-Old Question: Does Weather Influence Happiness?

Have you ever wondered if those sunny days actually boost your mood, or if rainy days truly bring you down? This age-old question has puzzled scientists, philosophers, and just about everyone else for generations. But what if we could use data to finally find an answer? In this post, I’ll dive into the fascinating intersection of meteorology and psychology to explore whether the weather really has a tangible effect on our happiness. By leveraging data science, I aim to uncover patterns and insights that could shed light on how our environment influences our well-being. So, let’s embark on this journey to see if we can decode the weather’s impact on happiness using the power of data.

For the full analysis and more details, visit this link.

Part 0: Data Cleaning and Preparation

To begin, I needed comprehensive weather data from various stations across the UK. I wrote a Python script to download and process the data from the MetOffice, ensuring it was ready for analysis. Here’s a glimpse of the data preparation process:

import os
import requests

folder_name = '/path/to/weather_data'
os.makedirs(folder_name, exist_ok=True)

def download_file(station_name):
    url = f'http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/{station_name}data.txt'
    response = requests.get(url)
    response.raise_for_status()
    save_path = os.path.join(folder_name, f'{station_name}data.txt')
    with open(save_path, 'wb') as file:
        file.write(response.content)

with open('data/stations.txt', 'r') as file:
    station_names = file.read().splitlines()

for station_name in station_names:
    download_file(station_name)

This script ensured I had all the necessary weather data, which was then processed into a structured format using pandas.

I then used R for more sophisticated data wrangling, converting coordinates and validating the data:

library(tidyverse)
library(sf)

S_weather <- read_csv('path/to/combined_weather_data.csv')

# Convert to spatial dataframe
weather_sf <- st_as_sf(S_weather, coords = c("easting", "northing"), crs = 27700) %>%
  st_transform(crs = 4326)

# Extract coordinates
coords <- st_coordinates(weather_sf)
S_weather <- S_weather %>% mutate(Longitude = coords[, 1], Latitude = coords[, 2])

# Data validation
validate_data <- function(df) {
  df %>% 
    filter(year >= 1850 & year <= 2024,
           month >= 1 & month <= 12,
           latitude >= 49.9 & latitude <= 60.9,
           longitude >= -8 & longitude <= 2,
           altitude >= -2.75 & altitude <= 1343,
           t_min >= -50 & t_min <= 50,
           t_max >= -50 & t_max <= 50,
           rain >= 0 & rain <= 1000,
           sun >= 0 & sun <= 1000)
}

S_weather <- validate_data(S_weather)

Part 1: Clustering by Weather Data

Once I had the data, I performed clustering to identify patterns. Using k-means clustering, I grouped weather stations based on key weather parameters like minimum temperature, maximum temperature, rainfall, sunshine duration, and air frost days.

library(tidyverse)
library(cluster)

set.seed(123)
scaled_data <- scale(S_weather %>% select(t_min, t_max, rain, sun, af_inmth))

wss <- sapply(2:15, function(k){
  sum(kmeans(scaled_data, centers=k, nstart=25)$withinss)
})

plot(2:15, wss, type='b', xlab='Number of Clusters', ylab='Within groups sum of squares')

I determined the optimal number of clusters using the elbow method and visualised the clusters with ggplot2:

library(ggpubr)

kmeans_result <- kmeans(scaled_data, centers=3, nstart=25)
S_weather <- S_weather %>% mutate(cluster = as.factor(kmeans_result$cluster))

ggscatter(S_weather, x = "sun", y = "af_inmth", color = "cluster", 
          ellipse = TRUE, legend = "right", palette = "jco") +
  ggtitle("Clusters based on Sunshine Duration and Air Frost Days")

Part 2: Predicting Geographic Regions

Next, I used the weather data to predict whether a station is located in the Northern, Middle, or Southern UK. This classification was done using a Random Forest model, implemented in the tidymodels framework.

library(tidymodels)

region_rec <- recipe(region ~ ., data = S_weather) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

rf_spec <- rand_forest(trees = 100) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

region_wf <- workflow() %>% 
  add_recipe(region_rec) %>% 
  add_model(rf_spec)

set.seed(234)
region_folds <- vfold_cv(S_weather, v = 10)
rf_rs <- region_wf %>% fit_resamples(region_folds)

region_pred <- rf_rs %>% collect_predictions() %>% mutate(correct = region == .pred_class)
accuracy <- accuracy(region_pred, truth = region, estimate = .pred_class)

My model showed an impressive accuracy, highlighting the potential of machine learning in interpreting geographic nuances based on meteorological data.

Conclusion

Through this data-driven exploration, I’ve seen that weather data can indeed reveal interesting patterns and even predict geographic locations based on climatic conditions. While this analysis doesn’t directly answer whether weather influences individual happiness, it lays the groundwork for understanding the broader environmental factors at play. Future research could integrate psychological data to directly link weather patterns with mood changes, providing deeper insights into this intriguing question.

Stay tuned as I continue to explore the fascinating connections between our environment and well-being! For the full analysis and more details, visit this link.

Share on

Twitter Facebook LinkedIn

U Bhalraam (Raam)