(i) Downloading the Data using the provided stations.txt
The python chunk described here is used to download all the weather data required for this task. It imports the provided stations.txt file. It then iterates through all the lines to download all the files by inserting the station name into the URL.
Each downloaded file is then stored in the path set in the variable folder_name.
If you are re-running this Quarto file in RStudio, be sure to:
set the folder_name variable to a path accessible by your local machine.
ensure your conda environment has the following libraries installed with pip: [os, requests, pandas, re, warnings]
set the conda environment that has the required libraries in the .Rprofile file.
# Make sure your reticulate package has these libraries installed in the PYENV.# Set your conda environment in .Rprofileimport osimport requests# SPECIFY THE FOLDER YOU WANT THE WEATHER FILES TO BE SAVED!!folder_name ='/home/rstudio/Encrypted-Data/R/R Projects/DSTI_Assignment3/weather_data'os.makedirs(folder_name, exist_ok=True) #creates folder if not created yet.def download_file(station_name):# Construct the URL url =f'http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/{station_name}data.txt'# Attempt to download the filetry: response = requests.get(url) response.raise_for_status() # Raises an HTTPError if the response was an error# Define the path for saving the file save_path = os.path.join(folder_name, f'{station_name}data.txt')# Save the content to a filewithopen(save_path, 'wb') asfile:file.write(response.content)print(f'Successfully downloaded: {station_name}')# Error Handlingexcept requests.HTTPError as e:print(f'Failed to download {url}. HTTPError: {e}')except requests.RequestException as e:print(f'Failed to download {url}. Error: {e}')# Open the text file and read lines# The file is split by lines, so that this can be fed into the function above.withopen('data/stations.txt', 'r') asfile: station_names =file.read().splitlines()# Download files for each stationfor station_name in station_names: download_file(station_name)
(ii) Parsing all the Datafiles downloaded from the MetOffice
The python script described here extracts and preserves most of the information presented in each data-file. For each datafile, these general information is extracted:
name = station name (line 1 of the file)
irish_grid is set to TRUE if the string “Irish Grid” is found anywhere in the file.
(This is important for downstream longitude latitude calculations)
location_info_1 = easting/northing/altitude/time-conditions for line 2
location_info_2 = easting/northing/altitude/time-conditions for line 3
The table is then extracted and processed line by line to create a pandas dataframe.
The .split() method was used to separate the table into its components.
If the script finds fewer than seven parts, it adds --- at the end of the line to keep things moving. This method isn’t perfect since it only fixes missing info in the sun column, not in any other spots. But, after looking over the data, it seems this problem mostly pops up in the sun column anyway, so this fix does the job well enough for now.
All denoted * for each of the columns are preserved and noted in a Boolean variable.
The sun_sensor used to log the sunshine data is also preserved as the script looks for * or # suffixes to sun.
If a line has the word Provisional in it, another Boolean is set for that line.
When the station’s dataframe is being concatenated, the appropriate easting and northing values are appended depending on the year and month of a particular location line.
import pandas as pdimport osimport reimport warnings# Filter out future warnings as this clutters the output.warnings.simplefilter(action='ignore', category=FutureWarning)def read_and_process_file(file_path):# Opens the datafiles and passes it to the process_weather_data functionwithopen(file_path, 'r', encoding='utf-8') asfile: data_str =file.read()return process_weather_data(data_str)def replace_blanks_with_dashes(line):# This function deals with datafiles that have "" instead of "---" as NaN in the sun column. I have attempted to code it such that it can replace blanks in any column, but I was unable to achieve this as I am approaching this problem in a line by line basis. If I had approached it differently, by removing the lines before the table and importing the table as a whole, I might have solved this appropriately. parts = line.split()iflen(parts) <7:# Basically, If there are less than 7 parts, add '---' to the end.# I have only observed this case when `sun` is blank.# A crude fix, but it works for all the files. line +=' ---'return line# This function finds the altitude in meters in line 2 and extracts it.def parse_altitude(altitude_str): match = re.search(r'(\d+)\s*(m|meter|metre)s?\b', altitude_str, re.IGNORECASE)returnint(match.group(1)) if match elseNone# This is a helper function which converts months to numbers if found.def month_to_number(month_str): months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"] month_str = month_str[:3].capitalize() # There have been some instances where Sept is noted instead of Sep. This deals with that. It converts the string to uppercase and takes the first 3.return months.index(month_str) +1if month_str in months elseNone# This is the most complicated function which looks for the following:# (MTH)YYYY to (MTH)YYYY# (MTH)YYYY - (MTH)YYYY# (MTH)YYYY onwards# from (MTH)YYYY# after (MTH)YYYY# before (MTH)YYYY# This function is necessary as some stations have changed location.def parse_location_info(line):try: easting, northing = re.findall(r'(\d+)E\s+(\d+)N', line)[0] altitude = parse_altitude(line)# Handle 'onwards' with just a year onwards_match = re.search(r'(\w+\s+)?(\d{4}) onwards', line)if onwards_match: start_month, start_year = onwards_match.groups() start_month_num = month_to_number(start_month.strip()) if start_month else1 start_year =int(start_year)# No end year, indicating this setup continues indefinitely from the start pointreturn (easting, northing, altitude, start_year, start_month_num, None)# Handle period defined by specific start and end years period_match = re.search(r'(?:\w+\s+)?(\d{4})\s*(?:to|\-)\s*(?:\w+\s+)?(\d{4})', line)if period_match: start_year, end_year = period_match.groups() start_month_num =1# Default to January if month is not specified end_month_num =12# Default to December if month is not specifiedreturn (easting, northing, altitude, int(start_year), start_month_num, int(end_year))# Additional logic to handle the "to Apr 2005" format# Extracting start and end months and years start_month_match = re.search(r'to (\w+)\s+(\d{4})', line)if start_month_match: start_month, end_year = start_month_match.groups() start_month_num = month_to_number(start_month) end_year =int(end_year)return (easting, northing, altitude, 0, start_month_num, end_year)# Handle the "to 2005" and "to Apr 2005" format to_match = re.search(r'to (\w+\s+\d{4})', line)if to_match: end_date = to_match.group(1)# If the end date includes a month, extract it end_month_match = re.search(r'(\w+) (\d{4})', end_date)if end_month_match: end_month, end_year = end_month_match.groups() end_month_num = month_to_number(end_month)else: end_month_num =12# Assuming December if month not specified end_year =int(end_date.split()[-1]) start_year =min(int(year.group()) for year in re.finditer(r'\b\d{4}\b', line))return (easting, northing, altitude, start_year, 1, int(end_year), end_month_num)# Handle "from Sept 2007" or "from Jan 2007" format from_match = re.search(r'from (\w+\s+)?(\d{4})', line)if from_match: start_month, start_year = from_match.groups() start_month_num = month_to_number(start_month.strip()) if start_month else1 start_year =int(start_year)# No end year, indicating this setup continues indefinitely from the start pointreturn (easting, northing, altitude, start_year, start_month_num, None)# Handle "after 1998" or "after Jan 1998" format after_match = re.search(r'after (\w+\s+)?(\d{4})', line)if after_match: start_month, after_year = after_match.groups() start_month_num = month_to_number(start_month.strip()) if start_month else1 after_year =int(after_year)# Treat it as "1998 onwards" or "Jan 1998 onwards"return (easting, northing, altitude, after_year, start_month_num, None)# Handle "before 1998" format before_match = re.search(r'before (\w+)\s+(\d{4})', line)if before_match: before_month, before_year = before_match.groups() before_month_num = month_to_number(before_month) if before_month else1 before_year =int(before_year)return (easting, northing, altitude, 1700, before_month_num, before_year)return (easting, northing, altitude, 1700, 1, 2029, 12) # Return None as a fallback if no recognized format is matched.except:returnNone# This function looks for * in values and sets the estimated variable to TRUE# It also removes it so that it can be parsed as a float.def extract_value_and_estimation(value_str):# Check if the value is estimated estimated =Trueif'*'in value_str elseFalse# Remove any "*" and "#" characters value_str = value_str.replace('*', '')# Convert the value to float value =float(value_str) if value_str !='---'elseNonereturn value, estimated# This function is similar to the previous one, except that this sets the sun_sensor variable so we can ensure that this data is also extracted.# I have optimised this function to deal with 5*#, 5#*, 5# and 5*def extract_sun_value_and_estimation(value_str):# First check if value_str is a placeholder for missing dataif value_str =='---':returnNone, False, None#Important line to preserve NA# Check if "#" is present, indicating new sun sensor use sun_sensor ='#'in value_str# Check if the value is estimated (presence of "*") estimated ='*'in value_str# Remove any "*" and "#" characters to isolate the numeric value. # Whitby has $ appended for no reason, so this section deals with that. clean_value_str = re.sub(r'[^\d.]+', '', value_str)# Convert the value to float value =float(clean_value_str)return value, estimated, sun_sensor#This function creates the dataframe for each stationdef process_weather_data(data_str): lines = data_str.strip().split('\n') name = lines[0]# Finds the "Irish Grid" in line 2 of the datafile. is_irish ="Irish Grid"in lines[1]# These are all the datapoints that will be extracted. + Provisional# I have decided to extract easting/northing as line 3 usually omits longitiude and latitide data. These can be calculated from easting/northing with greater accuracy. columns = ['name', 'year', 'month', 'easting', 'northing', 'altitude', 'tmax', 'tmin', 'af', 'rain', 'sun', 'sun_sensor'] data_table = pd.DataFrame(columns=columns)# Lines 2 and 3 can contain location information, depending if it is a 'complex' case or not. A complex case is defined as a station that moved partways in the datastream. location_info_1 = parse_location_info(lines[1]) #Line 2 analysis location_info_2 = parse_location_info(lines[2]) #Line 3 analysis# Dynamically find the line number where 'yyyy' is encountered yyyy_line_number =Nonefor i, line inenumerate(lines):if'yyyy'in line: yyyy_line_number = ibreak# Use the yyyy_line_number to adjust the starting index of the loopfor line in lines[yyyy_line_number+2:]: #data starts at yyyy line + 2 preprocessed_line = replace_blanks_with_dashes(line) parts = preprocessed_line.split()try: year =int(parts[0]) month =int(parts[1])exceptValueError:continue# Skip header or malformed lines#Using parts, we can extract these values easily. tmax, tmax_estimated = extract_value_and_estimation(parts[2]) tmin, tmin_estimated = extract_value_and_estimation(parts[3]) af, af_estimated = extract_value_and_estimation(parts[4]) rain, rain_estimated = extract_value_and_estimation(parts[5]) sun, sun_estimated, sun_sensor = extract_sun_value_and_estimation(parts[6]) sensor_used ='Kipp & Zonen'if sun_sensor ==Trueelse ('Campbell Stokes'if sun_sensor ==FalseelseNone) provisional ='provisional'in line.lower()# This section deals with the logic,# What to do if location_info_1 & location_info_2 are filled,# What to do if only location_info_1 is filled,# What to do if only location_info_2 is filled. easting, northing, altitude =None, None, None#Initialise vars.if location_info_1 isnotNone: easting, northing, altitude = location_info_1[:3]# If location_info_2 is available, use it to override valuesif location_info_2 isnotNone:if year >= location_info_2[3] and (location_info_2[5] isNoneor year <= location_info_2[5]): easting, northing, altitude = location_info_2[:3]# Ensure that easting, northing, and altitude are defined before using themif easting isnotNoneand northing isnotNoneand altitude isnotNone:# Create a row DataFrame from the dictionary row_df = pd.DataFrame([{'name': name,'year': year,'month': month,'tmax': tmax,'tmin': tmin,'af': af,'rain': rain,'sun': sun,'tmax_est': tmax_estimated,'tmin_est': tmin_estimated,'af_est': af_estimated,'rain_est': rain_estimated,'sun_est': sun_estimated,'sun_sensor': sensor_used,'easting': easting,'northing': northing,'altitude': altitude,'prov_est': provisional }])# Append the row to the data_table data_table = pd.concat([data_table, row_df], ignore_index=True)return data_table, is_irish, namedata_directory ='/home/rstudio/Encrypted-Data/R/R Projects/DSTI_Assignment3/weather_data'data_frames = []irish_files = [] # List to keep track of files based on the Irish Gridfor file_name in os.listdir(data_directory): file_path = os.path.join(data_directory, file_name)if os.path.isfile(file_path) and file_name.endswith('.txt'):try: df, is_irish, name = read_and_process_file(file_path) data_frames.append(df)print(f"Parsed {file_name}.")if is_irish: # If irish_grid, save the name of the station. irish_files.append(name)exceptExceptionas e:print(f"Error processing file {file_name}: {e}")
# The combined dataframe is stored in combined_weather_data.csv in the current directory.if data_frames: combined_data = pd.concat(data_frames, ignore_index=True) combined_data.to_csv(os.path.join(data_directory, 'combined_weather_data.csv'), index=False)
(iii) Calculating Longitude and Latitude in R
The script now permanently moves to R as it is superior for data wrangling. The pandas dataframe captured in the previous chunk can be called directly into the R environment via the reticulate package with the py$ suffix.
If this fails, one can easily bring the saved combined station combined_weather_data.csv file into R with the readxl package.
Here we use the sp and sf packages to convert the easting and northing values from the British National Grid and the Irish Grid into latitudes and longitudes.
The final weather dataframe is created. An interactive reactable table is displayed after the code chunk for the user to be able to browse through the data of the 37 stations.
In the table, I have implemented a quick filter function for the station, year, and month column using javascript so the user can quickly find the data required.
suppressPackageStartupMessages(library(tidyverse)) # for the tidyversesuppressPackageStartupMessages(library(sp)) #required for lon/lat calculationssuppressPackageStartupMessages(library(sf)) # required for lon/lat calculationssuppressPackageStartupMessages(library(reactable))# To allow reactive TablesS_weather <- py$combined_data %>%tibble() # extract pandas df from pyenvS_weather[S_weather =='NaN'] <-NA#pandas marks its NA as NaNS_weather[S_weather =='NaN'] <-NA# If access to Python Environment from R fails, you can import from CSV.## ENABLE AND SET THIS LINE IF PYTHON IMPORT DOES NOT WORK ### S_weather <- read_csv("data/combined_weather_data.csv")S_weather_irish <- S_weather %>%filter(name %in% py$irish_files) # Step 1/3 : df just the irish stationsS_weather_rest <- S_weather %>%filter(!(name %in% py$irish_files)) # Step 2/3 : df without the irish stations# Step 3/3 : Reorder S_weather so rbind later works properly.S_weather_ordered <-rbind(S_weather_irish, S_weather_rest)# Convert Irish stations into a spatial dataframe with Irish Gridirish_stations_sf <-st_as_sf(S_weather_irish, coords =c("easting", "northing"), crs =29902)# Convert the entire dataset into a spatial dataframe with British National Gridnon_irish_stations_sf <-st_as_sf(S_weather_rest, coords =c("easting", "northing"), crs =27700)# Transform Irish stations to WGS84irish_stations_wgs84 <-st_transform(irish_stations_sf, crs =4326)# Transform the rest to WGS84non_irish_stations_wgs84 <-st_transform(non_irish_stations_sf, crs =4326)# Combine the Irish and rest stationsall_stations_wgs84 <-rbind(irish_stations_wgs84, non_irish_stations_wgs84)# Extract coordinatescoords <-st_coordinates(all_stations_wgs84)# Add Longitude and Latitude back to S_weatherS_weather_ordered$Longitude <- coords[, 'X']S_weather_ordered$Latitude <- coords[, 'Y']# This process extracts many more decimal points for longitude/latitude.S_weather <- S_weather_ordered# Create active working dataframe (with all columns handled)weather <- S_weather %>%transmute(station = name,year =as.integer(year),month =as.integer(month),latitude =as.numeric(Latitude),longitude =as.numeric(Longitude),altitude =as.integer(altitude),t_min =as.numeric(tmin),tmin_est =factor(tmin_est, levels =c("FALSE", "TRUE"), labels =c("No", "Yes")),t_max =as.numeric(tmax),tmax_est =factor(tmax_est, levels =c("FALSE", "TRUE"), labels =c("No", "Yes")),af_inmth =as.numeric(af),af_est =factor(af_est, levels =c("FALSE", "TRUE"), labels =c("No", "Yes")),rain =as.numeric(rain),rain_est =factor(rain_est, levels =c("FALSE", "TRUE"), labels =c("No", "Yes")),sun =as.numeric(sun),sun_est =factor(sun_est, levels =c("FALSE", "TRUE"), labels =c("No", "Yes")),sun_sensor =factor(sun_sensor, levels =c("Campbell Stokes", "Kipp & Zonen"),labels =c("Campbell Stokes", "Kipp & Zonen")),provisional =factor(prov_est, levels =c("FALSE", "TRUE"), labels =c("No", "Yes")), )set.seed(123)# Review the full extracted weather_data in a dataframe.# You can see that different lon/lat is attributed to Lowestoft depending on the year. (venue changed from 09/2007)# This is an interactive filterable table which you can use to find the station/year/month you desire.weather %>%reactable(columns =list(station =colDef(filterable =TRUE,# Using JavaScript to apply logic for filteringfilterMethod =JS("function(rows, columnId, filterValue) { return rows.filter(function(row) { var cellValue = row.values[columnId].toString().toLowerCase(); var filterText = filterValue.toLowerCase(); return cellValue.indexOf(filterText) !== -1; }) }") ),year =colDef(filterable =TRUE ),month =colDef(filterable =TRUE ) ), defaultPageSize =5)
(iv) Data Validation
The validate package is used here to ensure that there aren’t any glaring outliers or problems needing to be sorted.
Here are the set of rules passed onto the validate package:
year must be between 1850 and the current year
month can only be integers between 1 and 12
latitude can only be between 49.9 and 60.9
longitude can only be between -8 and 2
altitude can only be between -2.75 and 1343
temperature can only be between -50 and 50
af_inmth can only be an integer between 1 and 31
sun and rain can only be float between 0 and 1000
sun_sensor can only be one of 2 possible sensors
suppressPackageStartupMessages(library(validate)) # To perform quick data validation# Validation rulesrules <-validator(# Check station name length and typeif (any(!is.na(station))) is.character(station),# Year constraints year >=1850, year <=as.numeric(format(Sys.Date(), "%Y")), # Current year# Month constraints month >=1, month <=12,# Latitude and Longitude constraints (UK grid limits) latitude >=49.9, latitude <=60.9, longitude >=-8, longitude <=2,# Altitude constraints (Lowest and highest points in the UK) altitude >=-2.75, altitude <=1343,# Temperature, af_inmth, rain, and sun constraints t_min >=-50, t_min <=50, t_max >=-50, t_max <=50, af_inmth >=1, af_inmth <=31, rain >=0, rain <=1000, sun >=0, sun <=1000,# sun_sensor and provisional/estimated checksis.element(sun_sensor, c("Campbell Stokes", "Kipp & Zonen")),# Boolean checksis.logical(tmin_est),is.logical(tmax_est),is.logical(af_est),is.logical(rain_est),is.logical(sun_est),is.logical(provisional),# NA checks for sun and sun_sensorif (any(is.na(sun))) is.na(sun_sensor))# Validate the dataframeresults <-confront(weather, rules)results_summary <-summary(results)# Calculate the total number of failstotal_fails <-sum(results_summary$violations)# Print the total number of failsprint(paste("Total number of validation rule failures:", total_fails))
[1] "Total number of validation rule failures: 0"
Following the data validation process, it was essential to confirm the accuracy of the longitude and latitude coordinates. To achieve this, I plotted all coordinate values on a map for visual verification against their corresponding station names, ensuring their positions aligned precisely with real-world locations.
suppressPackageStartupMessages(library(leaflet))# for interactive maps# This section aims to create an interactive map which shows the rainfall for all stations in 2023yearly_averages <- weather %>%filter(year ==2023) %>%### SWITCH THE DESIRED YEAR HERE AND RERUN THIS CHUNK ###group_by(station, year, latitude, longitude) %>%summarise(average_rain =round(mean(rain, na.rm =TRUE),2),.groups ='drop' )# Define a custom color palettecustom_palette <-colorNumeric(palette =rev(c("#566f97","#566f97","#cbd2c0", "#b77d34")), domain =c(min(yearly_averages$average_rain, na.rm =TRUE), max(yearly_averages$average_rain, na.rm =TRUE)))legend_breaks <-pretty(range(yearly_averages$average_rain, na.rm =TRUE), n =5)legend_labels <-sprintf("%.2f mm", legend_breaks)# Create a basic leaflet mapleaflet_map <-leaflet(data = yearly_averages) %>%setView(lng =mean(yearly_averages$longitude), lat =mean(yearly_averages$latitude), zoom =5) %>%# Add default OpenStreetMap tilesaddTiles() %>%# Add circle markersaddCircleMarkers(lng =~longitude, lat =~latitude, # Apply custom color palette based on average_raincolor =~custom_palette(average_rain), radius =7,stroke =FALSE, fillOpacity =0.8,# Popup text for each pointpopup =~paste("<b>", station, "</b><br>","Year:", year, "<br>Avg Rainfall (mm):", average_rain) ) %>%# Add a color legendaddLegend(position ="bottomright",pal = custom_palette,values =~average_rain,title ="Average Rainfall (mm)",opacity =1,labFormat =labelFormat(suffix =" mm"),labels = legend_labels )#Its interactive! Try clicking on the circles.leaflet_map
Part 1 : Clustering by Weather Data
Look at the weather data for each of the weather stations.
Looking only at the weather data (that is, don’t include the latitude/longitude of the weather station) can you use a clustering algorithm to see if these stations can be clustered into groups that have “similar” weather? Note that you will have multiple weather readings for each station. You could pick the most recent, an average, an extreme, or you could consider all the points individually and look for clusters in the individual observations. You should try to justify your choice.
I have decided to analyse data averaged across the operational lifespan of each station, clustering the stations based on key weather parameters: minimum temperature (t_min), maximum temperature (t_max), rainfall (rain), sunshine duration (sun), and air frost days (af_inmth). The k-means clustering algorithm will facilitate this process. To determine the most effective number of clusters, I will employ the elbow-plot method. This approach seeks an optimal balance between the number of clusters and the within-cluster sum of squares, aiming to minimise variance within clusters without over-fitting.
Next, I planned to create a scatter plot to show the relationship between sunshine duration and air frost days, as this combination most effectively highlights the clusters. The results will be integrated into the previously mentioned map, employing color coding for each cluster to visualize their distribution across the UK. Such analysis could provide critical insights, potentially revealing how well a machine learning model can predict geographic location based solely on meteorological data. This, in turn, could significantly enhance our understanding of the complex interplay between climate variables.
(i) Finding the Optimal Number of Clusters via the Elbow Method
suppressPackageStartupMessages(library(cluster)) # for wss (within sum of squares) calculationset.seed(123) # For reproducibility# Create a dataframe which has one line per station,# And summarises [mean] all the weather variables:# (Min Temperature, Max Temperature, Rainfall, Sunshine, Air Frost)selected_data <- weather %>%group_by(station) %>%summarise(across(c(t_min, t_max, rain, sun, af_inmth), ~mean(., na.rm =TRUE))) %>%ungroup() %>%drop_na()# This dataframe is basically the same but retaining the (longitude, latitude)selected_data_wLocation <- weather %>%group_by(station) %>%summarise(across(c(t_min, t_max, rain, sun, af_inmth, longitude, latitude), ~mean(., na.rm =TRUE))) %>%# Longitude/Latitude is averaged, as some stations have two locations.ungroup() %>%drop_na()# Scale the data# The first column is 'station', which we don't scalescaled_data <-scale(selected_data[-1]) # Elbow method (show 15 levels)wss <-sapply(2:15, function(k){sum(kmeans(scaled_data, centers=k, nstart=25)$withinss) #Best of 25 runs.})# Elbow method shows the optimal number of clusters == 3.# Plot to determine the optimal number of clustersplot(2:15, wss, type='b', xlab='Number of Clusters', ylab='Within groups sum of squares')
(ii) Cluster Plot of Sunshine (hours) against Air Frost (days)
suppressPackageStartupMessages(library(ggpubr)) # For better scatter plotsset.seed(123)# Apply K-means with the chosen number of clusters, In this example, its 3.#### CHANGE `centers=` IF YOU WOULD LIKE TO SET MORE/LESS CLUSTERS. ####kmeans_result <-kmeans(scaled_data, centers=3, nstart=25)# Add the cluster results to the original dataselected_data_wLocation <- selected_data_wLocation %>%mutate(cluster =as.factor(kmeans_result$cluster))# Analyze the clusterscluster_centers <-as.data.frame(kmeans_result$centers)custom_colors <-c("#003c82", "#b9945c", "#506c3f")# Plotting clusters with ggplot2 (`rain` vs `sun` in this example)cluster <-ggscatter(selected_data_wLocation, x ="sun", y ="af_inmth",color ="cluster", palette = custom_colors,shape ="cluster", ellipse =TRUE, # Draw ellipses around clusterslegend ="right") +# Adjust legend position as neededggtitle("2D Scatter Plot of Clusters based on sunshine against air frost days") +theme_minimal() +labs(color ="Cluster")cluster
Too few points to calculate an ellipse
(iii) Updating the Map of the Weather Stations, Now Coloured by Cluster.
# Function to return the color based on the cluster numbergetClusterColor <-function(cluster_number) {custom_colors[cluster_number]}# Create a leaflet map using the custom color paletteog_cluster_map <-leaflet(selected_data_wLocation) %>%addTiles() %>%addCircleMarkers(lng =~longitude, lat =~latitude,color =~getClusterColor(cluster),popup =~paste("<b>", station, "</b><br>","Cluster: ", cluster, "<br>","Min Temp: ", sprintf("%.2f",t_min), "°C<br>","Max Temp: ", sprintf("%.2f",t_max), "°C<br>","Rain: ", sprintf("%.2f",rain), " mm<br>","Sun: ", sprintf("%.2f",sun), " hours<br>","Air Frost Days: ", sprintf("%.2f", af_inmth)," days<br>","Longitude: ", longitude, "<br>","Latitude: ", latitude ), opacity =1, fillOpacity =1, radius =4 ) %>%addLegend('bottomright',colors = custom_colors, labels =c("1", "2", "3"), title ='Cluster', opacity =1)# Another interactive map. Click the circles to see MetOffice Data.og_cluster_map
(iv) Weather Characteristics of Each Cluster
In order to quantify the characteristics of each cluster, here is the tabular breakdown of where the cluster centers lie with regards to t_min, t_max, rain, sun and af_inmth.
Cluster 1 (Blue)
Geographical Spread: Predominantly covering Scotland and parts of Northern England.
Temperature: Exhibits cooler temperatures, with lower minimum (t_min) and maximum (t_max) temperature values.
Rainfall: Moderate amounts of rainfall, slightly higher than the southern cluster but less than the forested areas.
Sunshine: The least amount of sunshine compared to the other clusters, which is consistent with regions that are farther north.
Air Frost: Occurrences of air frost (af_inmth) are relatively infrequent but not as rare as in the southern cluster.
Cluster 2 (Brown)
Geographical Spread: Mainly located in the south of the UK, including London and surrounding areas.
Temperature: Warmer temperatures with the highest t_min and t_max values, which is characteristic of the more temperate southern regions.
Rainfall: This cluster experiences the least rainfall, indicating drier conditions.
Sunshine: Receives the most sunshine, aligning with southern areas typically enjoying more sunny days.
Air Frost: The occurrence of air frost is the least within this cluster, fitting the pattern of milder winters in the south.
Cluster 3 (Green)
Geographical Spread: Scattered, but likely represents weather stations in forested regions, possibly in areas of variable elevation.
Temperature: This cluster has the most significant deviation in both minimum and maximum temperatures, suggesting a wide range of conditions, possibly due to varied terrain and canopy cover in forests.
Rainfall: Substantially higher rainfall, the highest among the clusters, which is typical for forested and potentially upland areas.
Sunshine: Less sunshine, which could be due to factors such as increased cloud cover and shading from trees.
Air Frost: A very high frequency of air frost incidents, indicating that these stations might be in areas prone to colder conditions, especially at night or during certain months.
Cluster 1 represents the cooler, moderately wet, and less sunny conditions typical of the north. Cluster 2 showcases the milder, drier, and sunnier climate of the southern UK.
Lastly, Cluster 3 indicates areas with significant rainfall, varied temperatures, and more frequent frost, likely associated with forested and diverse terrain.
# This section breaks down the descriptives of each cluster.cluster_centers
Part 2 : Multi-Class Classification Problem (Predicting North, Middle South from Weather Data)
You should exclude 5 of the weather stations from this set.
(You could do this by picking the last 5 stations alphabetically).
Can you predict whether they fall in the Northern third of the UK, Central third of the UK or the Southern third of the UK using only the weather data? You have latitude data for all the weather stations, so that can give you the classification for each of the weather stations in your training set. To determine the latitude of the lines dividing the UK into three, you should note that the most northerly point has a latitude of 60.9 and the most southerly point has a latitude of 49.9.
(i) Updating the Cluster Map to show the latitude split lines: (North/Middle/South)
It’s crucial to note how the clusters align with the divisional latitude lines of North, Middle, and South. We enhance the cluster map by incorporating these lines for a clearer geographical context.
To streamline the process, I’ve optimised for automation; modifying the number_of_equal_regions variable allows for easy adjustment of the divisional criteria.
The resultant map reveals a machine learning model’s potential proficiency in distinguishing between Southern and Northern stations, yet indicates a possible challenge in differentiating Middle region stations from those in adjacent areas. This insight underscores the nuanced ability of the model to interpret geographical nuances based on only meteorological data.
##SET THE NUMBER OF EQUAL REGIONS REQUIRED HERE## number_of_equal_regions =3##SET THE NUMBER OF EQUAL REGIONS REQUIRED HERE### Ensure number_of_equal_regions is within 2 to 9number_of_equal_regions <-max(min(number_of_equal_regions, 9), 2)# Define the most southerly and northerly latitudessoutherly <-49.9northerly <-60.9# Calculate the total range and the size of each grouptotal_range <- northerly - southerlygroup_size <- total_range / number_of_equal_regions# Create dynamic breaks and labels based on number_of_equal_regionsbreaks <-seq(from = southerly, to = northerly, by = group_size)# Generate dynamic labels. if 2 regions, N/S, if 3, N/M/S, if >3, numeric labels.if(number_of_equal_regions ==2) { labels <-c("South", "North")} elseif(number_of_equal_regions ==3) { labels <-c("South", "Middle", "North")} else { labels <-paste0("Region ", seq(1, number_of_equal_regions))}#Draws the northerly and southerly latitude linescluster_map <- og_cluster_map %>%# Add lines at specific latitudesaddPolylines(lng =c(-10, 2), lat =c(southerly, southerly), color ="blue", weight =2) %>%addPolylines(lng =c(-10, 2), lat =c(northerly, northerly), color ="blue", weight =2)# Loop through each region to add the break polylines at the appropriate place.for(i in1:length(breaks)) { lat <- breaks[i] color <-ifelse(i ==1| i ==length(breaks), "blue", "red") cluster_map <- cluster_map %>%addPolylines(lng =c(-10, 2),lat =c(lat, lat),color = color,weight =2)}cluster_map
(ii) Finalising the Dataframe; and Splitting the Dataset into the Training and Testing set.
We start by finalising the dataframe that will serve as the foundation for training our model. Recognising that a single data point per station would insufficiently support the development of a statistically robust model, I have opted to include one record for each station per year instead. However, in alignment with the assignment’s guidelines, the year variable will not be incorporated into the training process, focusing solely on weather data as predictors.
Following the assignment’s recommendation, the last five stations, when sorted alphabetically, will be set aside for model validation. The data designated for training will be stored in the region_train dataframe, while the data for the stations allocated for testing will be maintained in the region_test dataframe. This approach ensures a clear separation between training and testing datasets, facilitating a more accurate assessment of the model’s predictive capabilities.
suppressPackageStartupMessages(library(tidymodels)) # For training a multiclass classification rf model#Creates a df which has one line per station and year. (3310 obs.)weather_stations <- weather %>%group_by(station, year) %>%reframe(t_min =mean(t_min, na.rm = T),t_max =mean(t_max, na.rm = T),rain =mean(rain, na.rm = T),sun =mean(sun, na.rm = T),af =mean(af_inmth, na.rm = T),longitude =mean(longitude, na.rm = T),latitude =mean(latitude, na.rm = T)) %>%distinct(station, year, .keep_all = T) %>%ungroup() %>%select(-year)# Use mutate and cut to create a new column 'region'# This process is fully automated based on the calculations above.weather_stations_prep <- weather_stations %>%mutate(region =cut(latitude, breaks = breaks,labels = labels,include.lowest =TRUE)) %>%select(-longitude, -latitude)# Sort stations in alphabetical orderstations <-sort(unique(weather_stations_prep$station))# Select the last 5 stations [alphabetically] for the test set (as recommended)stations_test <-tail(stations, 5)# Use the remaining stations for the training setstations_train <-setdiff(stations, stations_test)# Split the main dataframesregion_train <- weather_stations_prep %>%filter(station %in% stations_train)region_test <- weather_stations_prep %>%filter(station %in% stations_test)
(iii) Using the TidyModels Library to streamline the Model Training Process.
I will be utilising the tidymodels package to make this model training process as streamlined as possible. Typically, the rsample::initial_split function is used to proportionately create training and testing sets. However, as we are specifying custom, pre-determined groups for the training and testing sets, the rsample::make_splits function is used instead:
# This is the way to create an rsplit object with custom train/test splitsregion_split <-make_splits(x = region_train, assessment = region_test)region_split
To mitigate the risk of overfitting and enhance the model’s generalisability, implementing a 10-fold cross-validation is an industry standard. The rsample::vfold_cv function within the tidymodels framework facilitates this process. By specifying the training dataset, this function automatically partitions it into 10 subsets, or “folds”. During the cross-validation process, the model will be trained on 9 folds and tested on the remaining fold. This cycle repeats 10 times, each time with a different fold serving as the test set, ensuring that every data point is used for both training and validation at least once.
This approach not only helps in reducing the likelihood of overfitting but also provides a more accurate estimate of the model’s performance on unseen data.
set.seed(234)# Setting up 10 fold cross validation to prevent overfitting.region_folds <-vfold_cv(region_train, v =10)region_folds
A recipe defines preprocessing steps for both training and testing datasets, avoiding data leakage:
The station variable is marked as an identifier, not a predictor.
Missing data is imputed using K-nearest-neighbours (default = 3).
Variables with zero variance are removed.
All variables are scaled for uniformity.
“Juicing” the recipe applies these steps, readying the data for model training.
region_rec <-recipe(region ~ ., data = region_train) %>%# formulaupdate_role(station, new_role ="ID") %>%# ensure station name used as IDstep_impute_knn(all_predictors()) %>%# imputing missing data with knnstep_zv(all_predictors()) %>%# removing variables with zero variancestep_normalize(all_predictors()) # scaling the variablesregion_prep <-prep(region_rec) # preparing the recipejuice(region_prep) %>%# What the df looks like after pre-processing stepsgroup_by(station) %>%sample_n(size =2, replace =FALSE) %>%# Show 2 per stationungroup()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_impute_knn()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
trees = 100
Computational engine: ranger
This is the final step which fits the training data to the model.
rf_rs <- region_wf %>%fit_resamples(resamples = region_folds,control =control_resamples(save_pred =TRUE) )rf_rs # results of the split for each cross-validation set
(iv) Evaluating the Trained RandomForest Model (with the training data).
This section assesses the model’s accuracy in predicting the stations involved in the training phase. The model predicts each line in a line by line basis. Using dplyr, I reframe the dataframe to show the “per station” average predictions for simplicity.
It reveals that the model misclassified only two stations, indicating a high level of predictive accuracy for the training dataset.
region_pred <- rf_rs %>%collect_predictions() %>%# Get the predictionsmutate(correct = region == .pred_class,pred_class = .pred_class) %>%# Create new column which reports if the prediction was correct or not.left_join(weather_stations_prep %>%mutate(.row =row_number()), by =c(".row", "region")) %>%# Adds the original dataframe onto the prediction dfselect(-id, -.row, -.config, -.pred_class) %>%select(station, region, pred_class, correct, t_min, t_max, rain, sun, af, starts_with(".pred")) region_labels <-if(number_of_equal_regions ==2) {c("South", "North")} elseif(number_of_equal_regions ==3) {c("South", "Middle", "North")} else {paste0("Region ", seq(1, number_of_equal_regions))}region_pred_summary <- region_pred %>%group_by(station) %>%reframe(region = region,pred_class = pred_class,across(starts_with(".pred_"), ~mean(.x, na.rm =TRUE))) %>%ungroup() %>%distinct(station, .keep_all =TRUE) %>%rowwise() %>%mutate(prediction = region_labels[which.max(c_across(starts_with(".pred_")))],prediction =factor(prediction, levels = region_labels)) %>%ungroup() %>%mutate(region =factor(region, levels = region_labels),correct = region == prediction) %>%arrange(correct) %>%select(station, prediction, correct, everything(), -pred_class)region_pred_summary %>%print(n=32)
# A tibble: 32 × 7
station prediction correct region .pred_South .pred_Middle .pred_North
<chr> <fct> <lgl> <fct> <dbl> <dbl> <dbl>
1 Cwmystwyth Middle FALSE South 0.399 0.526 0.0750
2 Nairn there… Middle FALSE North 0.216 0.495 0.289
3 Aberporth South TRUE South 0.785 0.194 0.0206
4 Armagh Middle TRUE Middle 0.371 0.579 0.0505
5 Ballypatrick … Middle TRUE Middle 0.0999 0.473 0.428
6 Bradford Middle TRUE Middle 0.353 0.555 0.0926
7 Braemar Middle TRUE Middle 0.0446 0.926 0.0294
8 Camborne South TRUE South 0.906 0.0843 0.00957
9 Cambridge NIAB South TRUE South 0.864 0.114 0.0218
10 Cardiff Bute … South TRUE South 0.829 0.165 0.00646
11 Chivenor South TRUE South 0.866 0.114 0.0201
12 Dunstaffnage Middle TRUE Middle 0.170 0.737 0.0926
13 Durham Middle TRUE Middle 0.292 0.565 0.143
14 Eastbourne South TRUE South 0.977 0.0134 0.00929
15 Eskdalemuir Middle TRUE Middle 0.108 0.870 0.0221
16 Heathrow (Lon… South TRUE South 0.938 0.0539 0.00761
17 Hurn South TRUE South 0.906 0.0883 0.00580
18 Lerwick North TRUE North 0.0528 0.0671 0.880
19 Leuchars Middle TRUE Middle 0.295 0.600 0.105
20 Lowestoft / L… South TRUE South 0.892 0.0775 0.0303
21 Manston South TRUE South 0.941 0.0341 0.0247
22 Newton Rigg Middle TRUE Middle 0.212 0.718 0.0696
23 Oxford South TRUE South 0.820 0.159 0.0212
24 Paisley Middle TRUE Middle 0.265 0.689 0.0462
25 Ringway South TRUE South 0.614 0.359 0.0275
26 Ross-On-Wye South TRUE South 0.808 0.168 0.0240
27 Shawbury South TRUE South 0.696 0.267 0.0371
28 Sheffield South TRUE South 0.605 0.344 0.0513
29 Southampton South TRUE South 0.915 0.0811 0.00431
30 Stornoway North TRUE North 0.0512 0.339 0.610
31 Sutton Boning… South TRUE South 0.748 0.203 0.0488
32 Tiree Middle TRUE Middle 0.195 0.686 0.119
The model’s accuracy is excellent at 93.8%, with a sensitivity of 87.0% and a specificity of 96.8%.
# Calculate accuracy and store it as a data frameaccuracy_df <-accuracy(region_pred_summary, truth = region, estimate = prediction) %>%as.data.frame() %>%mutate(metric ="accuracy")# Calculate positive predictive value (PPV) and store it as a data framesensitivity_df <-sensitivity(region_pred_summary, truth = region, estimate = prediction) %>%as.data.frame() %>%mutate(metric ="sensitivity")specificity_df <-specificity(region_pred_summary, truth = region, estimate = prediction) %>%as.data.frame() %>%mutate(metric ="specificity")# Combine both data frames into one tablecombined_metrics <-bind_rows(accuracy_df, sensitivity_df, specificity_df) %>%select(metric, -.metric, .estimator, .estimate)combined_metrics
Here is the confusion matrix of the stations used in the training set, when applied to the model, showing excellent classification ability.
conf_mat(region_pred_summary, truth = region, estimate = prediction)
Truth
Prediction South Middle North
South 17 0 0
Middle 1 11 1
North 0 0 2
Running a quick variable importance analysis with the vip package shows that the key factors which help the model make its decision is the maximum temperature value and the minimum temperature values.
The updated map shows the model accurately classifies the training stations, which is a good start. However, to truly measure its effectiveness, we need to test it on unseen data and see if it can maintain this accuracy outside its training environment.
# Function to determine the color based on the prediction being correct, incorrect, or not foundgetPredictionColor <-function(station_name) {# Check if the station exists in the region_pred_summaryif(station_name %in% region_pred_summary$station) {# Get the correct value for the station correct_value <- region_pred_summary$correct[region_pred_summary$station == station_name]# Return green if true, red if falseif(correct_value) {return("lightgreen") } else {return("red") } } else {# Return grey if not foundreturn("black") }}colors_vector <-lapply(selected_data_wLocation$station, getPredictionColor)trained_cluster_map <- cluster_map %>%addTiles() %>%addCircleMarkers(lng =~longitude, lat =~latitude,color = colors_vector, # Using the colors_vectoropacity =1, fillOpacity =1, radius =0.2 ) %>%addLegend('bottomleft',colors =c("lightgreen", "red", "black"), labels =c("Correct", "Incorrect", "TEST SET"), title ='Prediction Accuracy', opacity =1)# Display the maptrained_cluster_map
(vi) Evaluating the Trained RandomForest Model (with the unseen testing data).
region_final <- region_wf %>%# Applying the model on the testing data is a simple process in TidyModels.last_fit(region_split)
Here, we can see that out of the 5 stations, the model managed to accurately determine the region of 4 of these stations.
region_final_pred <- region_final %>%collect_predictions() %>%mutate(correct = region == .pred_class,pred_class = .pred_class) %>%# Create new column which reports if the prediction was correct or not.left_join(weather_stations_prep %>%mutate(.row =row_number()), by =c(".row", "region")) %>%# Adds the original dataframe onto the prediction dfselect(-id, -.row, -.config, -.pred_class) %>%select(station, region, pred_class, correct, t_min, t_max, rain, sun, af, starts_with(".pred")) region_labels <-if(number_of_equal_regions ==2) {c("South", "North")} elseif(number_of_equal_regions ==3) {c("South", "Middle", "North")} else {paste0("Region ", seq(1, number_of_equal_regions))}region_final_pred_summary <- region_final_pred %>%group_by(station) %>%reframe(region = region,pred_class = pred_class,across(starts_with(".pred_"), ~mean(.x, na.rm =TRUE))) %>%ungroup() %>%distinct(station, .keep_all =TRUE) %>%rowwise() %>%mutate(prediction = region_labels[which.max(c_across(starts_with(".pred_")))],prediction =factor(prediction, levels = region_labels)) %>%ungroup() %>%mutate(region =factor(region, levels = region_labels),correct = region == prediction) %>%arrange(correct) %>%select(station, prediction, correct, everything(), -pred_class)region_final_pred_summary
# A tibble: 5 × 7
station prediction correct region .pred_South .pred_Middle .pred_North
<chr> <fct> <lgl> <fct> <dbl> <dbl> <dbl>
1 Whitby Coastgu… South FALSE Middle 0.696 0.175 0.129
2 Valley South TRUE South 0.920 0.0638 0.0166
3 Waddington South TRUE South 0.666 0.273 0.0603
4 Wick Airport North TRUE North 0.173 0.307 0.520
5 Yeovilton South TRUE South 0.917 0.0742 0.00871
The model’s performance on the testing data, as indicated by an accuracy of 80%, with a sensitivity of 67% and a specificity of 83.3%, confirms its generalisability. These metrics suggest the model is likely to remain effective in classifying other unseen stations, maintaining a balance between correctly identifying true positives (sensitivity) and true negatives (specificity).
accuracy_df <-accuracy(region_final_pred_summary, truth = region, estimate = prediction) %>%as.data.frame() %>%mutate(metric ="accuracy")# Calculate positive predictive value (PPV) and store it as a data framesensitivity_df <-sensitivity(region_final_pred_summary, truth = region, estimate = prediction) %>%as.data.frame() %>%mutate(metric ="sensitivity")specificity_df <-specificity(region_final_pred_summary, truth = region, estimate = prediction) %>%as.data.frame() %>%mutate(metric ="specificity")# Combine both data frames into one tablecombined_final_metrics <-bind_rows(accuracy_df, sensitivity_df, specificity_df) %>%select(metric, -.metric, .estimator, .estimate)combined_final_metrics
conf_mat(region_final_pred_summary, truth = region, estimate = prediction)
Truth
Prediction South Middle North
South 3 1 0
Middle 0 0 0
North 0 0 1
(v) Updating the Cluster Map with Predictions
The prediction diagram shows the model continues to accurately classify the stations in the testing set (unseen stations), proving it can handle new data well.
# Function to determine the color based on the prediction being correct, incorrect, or not foundgetPredictionColor <-function(station_name) {# Check if the station exists in the region_pred_summaryif(station_name %in% region_final_pred_summary$station) {# Get the correct value for the station correct_value <- region_final_pred_summary$correct[region_final_pred_summary$station == station_name]# Return green if true, red if falseif(correct_value) {return("lightgreen") } else {return("red") } } else {# Return grey if not foundreturn("black") }}colors_vector <-lapply(selected_data_wLocation$station, getPredictionColor)testing_cluster_map <- cluster_map %>%addTiles() %>%addCircleMarkers(lng =~longitude, lat =~latitude,color = colors_vector, # Using the colors_vectoropacity =1, fillOpacity =1, radius =1 ) %>%addLegend('bottomleft',colors =c("lightgreen", "red", "black"), labels =c("Correct", "Incorrect", "TRAINING SET"), title ='Prediction Accuracy', opacity =1)# Display the maptesting_cluster_map
Part 3 : Real World Happiness Data from ONS
The Office for National Statistics collects and publishes data on how happy people think they are. This is self-reported data, so it may well have some inherent biases, but it is good enough for us to draw some general conclusions. Happiness is measured on a scale of 0-10. Let’s try to answer the question as to whether the weather affects how happy we are.
Try to join the weather station data with the happiness data. You will only be able to do this in a coarse-grained way, because the places where there are weather stations do not correspond directly to the census areas used in the happiness surveys. You could use bands of latitude to do this, or you could put a grid over the country and look at which grid cell a weather station or census region falls in. Don’t worry too much about the fact that weather data and census data don’t cover exactly the same time periods; for the purposes of this assessment, we can assume that to a first approximation the general climate at a given weather station is fairly constant over time and that happiness also does not change too much from year to year.
(i) Importing Happiness Data from ONS into R
To streamline the process with automation, crafting a function in R that directly imports data from a specified URL is an efficient approach. This function will take the URL as an input parameter and return the imported dataset, ready for further analysis.
library(httr)library(readxl)# Define the function to download and read an Excel file directly from a URLread_excel_from_url <-function(url) {# Use httr to handle the HTTP GET request response <-GET(url)# Check if the request was successfulif (status_code(response) !=200) {stop("Failed to download the file: HTTP status code ", status_code(response)) }# Use httr's content function to access the file content# and readxl's read_excel to read the content into R# The tempfile() function is used to create a temporary file on disk# This way, its not dependent on storage, making it more automated. temp_file <-tempfile(fileext =".xls")writeBin(content(response, "raw"), temp_file) data <-suppressMessages(read_excel(temp_file, sheet ="Happiness", skip =6, # Assumes all Data starts on line 7col_names =FALSE, na ="x",progress =FALSE))unlink(temp_file)# Return the data framereturn(data)}
Next, all the data is imported from the specified URLs in: S_Happiness1415, S_Happiness1314, S_Happiness1213 and S_Happiness1112.
(ii) Use the Nominatim API to query longitude/latitude for each region in the 2014/15 sheet.
Due to request limits from Nominatim, this section of code should take at least 7 minutes to execute.
I recognise that the Happiness data from other years have more locations, but I have resigned to simply use the latest (14/15) dataset as a source of locations for simplicity.
library(tidygeocoder)# Using Nominatim API to get longitude/latitude from addresses.# This section will take approximately 7 minutes to run due to request limits.lat_longs_B1415 <- S_Happiness1415 %>%select(areaB) %>%drop_na() %>%geocode(areaB, method ='osm', lat = latitude , long = longitude)
Passing 92 addresses to the Nominatim single address geocoder
Query completed in: 92.3 seconds
lat_longs_C1415 <- S_Happiness1415 %>%select(areaC) %>%drop_na() %>%geocode(areaC, method ='osm', lat = latitude , long = longitude)
Passing 335 addresses to the Nominatim single address geocoder
Query completed in: 335.9 seconds
(iii) Data Cleaning to Produce the Final Working Dataframe with All Variables of Interest.
#Merging the Longitudes and Latitudes to the main dataframe.S_Happiness1415 <-left_join(S_Happiness1415, lat_longs_B1415, by =c("areaB")) S_Happiness1415 <-left_join(S_Happiness1415, lat_longs_C1415, by =c("areaC"))# Ensure that there are no entries without 'mean_rating'S_Happiness1415 <- S_Happiness1415 %>%distinct() %>%drop_na(mean_rating1415)Mean_S_Happiness1415_areaA <- S_Happiness1415 %>%group_by(areaA) %>%# For each areaA,# I am calculating the mean center of all the subregions myself.# As the area is regional areas are small, we will assume the Earth is flat here.summarise(longitude.a =mean(longitude.y, na.rm = T),latitude.a =mean(latitude.y, na.rm = T)) %>%filter(longitude.a !="NaN")S_Happiness1415 <-left_join(S_Happiness1415, Mean_S_Happiness1415_areaA, by =c("areaA"))S_Happiness1415 <- S_Happiness1415 %>%mutate(latitude =case_when(!is.na(latitude.y) ~ latitude.y,is.na(latitude.y) &!is.na(latitude.x) ~ latitude.x,is.na(latitude.y) &is.na(latitude.x) &!is.na(latitude.a) ~ latitude.a,TRUE~as.numeric(NA) # This line is actually redundant, as NA would be the default, but included for clarity. ),longitude =case_when(!is.na(longitude.y) ~ longitude.y,is.na(longitude.y) &!is.na(longitude.x) ~ longitude.x,is.na(longitude.y) &is.na(longitude.x) &!is.na(longitude.a) ~ longitude.a,TRUE~as.numeric(NA) # Same as above, included for clarity. )) %>%select(-latitude.a, -latitude.x, -latitude.y, -longitude.a, -longitude.x, -longitude.y)S_Happiness <-left_join(S_Happiness1415, S_Happiness1314, by ="area_code")S_Happiness <-left_join(S_Happiness, S_Happiness1213, by ="area_code")S_Happiness <-left_join(S_Happiness, S_Happiness1112, by ="area_code")
#This allows you to better understand the columns which are present in the dfHappiness <- S_Happiness %>%transmute(area_code =as.character(area_code), #area codeareaA =as.character(areaA), # Widest area zoneareaB =as.character(areaB), # Narrower zone than areaAareaC =as.character(areaC), # Narrower zone than areaBlow_perc1415 =as.numeric(low_perc1415), # Percentage who scored Lowmed_perc1415 =as.numeric(med_perc1415), # Percentage who scored Medhigh_perc1415 =as.numeric(high_perc1415), # Percentage who scored Highv_high_perc1415 =as.numeric(v_high_perc1415), # Percentage who scored Very Highlow_perc1314 =as.numeric(low_perc1314), # Percentage who scored Lowmed_perc1314 =as.numeric(med_perc1314), # Percentage who scored Medhigh_perc1314 =as.numeric(high_perc1314), # Percentage who scored Highv_high_perc1314 =as.numeric(v_high_perc1314), # Percentage who scored Very Highlow_perc1213 =as.numeric(low_perc1213), # Percentage who scored Lowmed_perc1213 =as.numeric(med_perc1213), # Percentage who scored Medhigh_perc1213 =as.numeric(high_perc1213), # Percentage who scored Highv_high_perc1213 =as.numeric(v_high_perc1213), # Percentage who scored Very Highlow_perc1112 =as.numeric(low_perc1112), # Percentage who scored Lowmed_perc1112 =as.numeric(med_perc1112), # Percentage who scored Medhigh_perc1112 =as.numeric(high_perc1112), # Percentage who scored Highv_high_perc1112 =as.numeric(v_high_perc1112), # Percentage who scored Very Highmean_rating1415 =as.numeric(mean_rating1415), # Mean Happiness Rating 1415mean_rating1314 =as.numeric(mean_rating1314), # Mean Happiness Rating 1314mean_rating1213 =as.numeric(mean_rating1213), # Mean Happiness Rating 1213mean_rating1112 =as.numeric(mean_rating1112), # Mean Happiness Rating 1112low_cv =as.numeric(low_cv),low_cv =as.numeric(low_cv), # Variation for Lowlow_llim =as.numeric(low_llim),low_ulim =as.numeric(low_ulim),med_cv =as.numeric(med_cv), # Variation for Medmed_llim =as.numeric(med_llim),med_ulim =as.numeric(med_ulim),high_cv =as.numeric(high_cv), # Variation for Highhigh_llim =as.numeric(high_llim),high_ulim =as.numeric(high_ulim),vhigh_cv =as.numeric(vhigh_cv), # Variation for Very Highvhigh_llim =as.numeric(vhigh_llim),vhigh_ulim =as.numeric(vhigh_ulim),mean_cv1415 =as.numeric(mean_cv1415), # Mean Variation 1415mean_cv1314 =as.numeric(mean_cv1314), # Mean Variation 1314mean_cv1213 =as.numeric(mean_cv1213), # Mean Variation 1213mean_cv1112 =as.numeric(mean_cv1112), # Mean Variation 1112mean_llim =as.numeric(mean_llim),mean_ulim =as.numeric(mean_ulim),sample_size =as.numeric(sample_size), # Sample Sizelatitude =as.numeric(latitude), # Latitude based on the narrowest Arealongitude =as.numeric(longitude) # Longitude based on the narrowest Area ) %>%# Calculating the mean happiness rating and mean cv of the happiness ratingmutate(mean_rating =rowMeans(select(., mean_rating1415, mean_rating1314, mean_rating1213, mean_rating1112), na.rm =TRUE),mean_cv =rowMeans(select(., mean_cv1415, mean_cv1314, mean_cv1213, mean_cv1112), na.rm =TRUE))
(iv) Calculating Weighted Weather Data for Each Location; from all 37 Weather Stations.
First we subset the weather dataframe to use only weather data from 5 years before and after the Happiness data median of 2013. [Only include weather data from 2008 to 2018]
weather_byStation <- weather %>%#Filter for 5 years before and 5 years after survey median point (2013).filter(year >=2008& year <=2018) %>%group_by(station) %>%summarise(across(c(t_min, t_max, rain, sun, af_inmth, longitude, latitude), ~mean(., na.rm =TRUE))) %>%#Lon/Lat is averaged, as some stations have 2 locations (depending on the year).ungroup() %>%drop_na()
This chuck now calculates the distances between each location in the happiness dataset and each weather station using the Haversine formula, which considers the earth’s curvature to provide accurate distance measurements. These distances are used to compute weights inversely proportional to the distance, ensuring that weather stations closer to a given location have a greater influence on the weather data attributed to that location. This approach will ensure that weather conditions experienced by each area will be more closely represented by nearby weather stations.
After computing these weights, the script calculates weighted averages for several weather variables (e.g., minimum temperature, maximum temperature, rainfall, sunshine, and days with air frost) for each location in the happiness dataset. This is achieved by multiplying each weather variable by the corresponding weight and summing these products to get an average that accounts for the proximity and thus the relevance of each weather station’s data. A high power of 5 is set here to ensure an exponentially greater weight for stations closer to the survey locations.
These weighted averages are then appended to the happiness dataset as new columns. This “weather enriched” dataset Happiness_www, now includes location-specific, weighted weather data, provides a robust way to further analysing the relationship between weather conditions and happiness.
suppressPackageStartupMessages(library(geosphere)) # To calculate Haversinedistances happiness_df <- Happiness %>%drop_na(latitude, longitude)weather_df <- weather_byStation# Function to calculate distances between locations in 'happiness_df' and 'weather_df' using the Haversine formula.calculate_distances <-function(happiness_df, weather_df) {# Initialize an empty matrix with dimensions based on the number of rows in both dataframes. distance_matrix <-matrix(nrow =nrow(happiness_df), ncol =nrow(weather_df))# Loop through each row of 'happiness_df'for (i in1:nrow(happiness_df)) {# And for each row of 'weather_df'for (j in1:nrow(weather_df)) {# Calculate the distance using the Haversine formula between the two points. distance_matrix[i, j] <-distHaversine(c(happiness_df$longitude[i], happiness_df$latitude[i]), c(weather_df$longitude[j], weather_df$latitude[j])) } }return(distance_matrix)}# Calculate the distance matrix using the function defined above.distance_matrix <-calculate_distances(happiness_df, weather_df)# Function to calculate and normalize weights based on distances.# A default weight power of 5 is set here. This can be changed as desired.calculate_weights <-function(distance_matrix, power =5) {# Calculate weights as the inverse of distance, with a small constant added to avoid division by zero. weights_matrix <-1/ (distance_matrix +1e-4)^power# Normalize the weights for each location in 'happiness_df' such that they sum to 1. weights_matrix <- weights_matrix /rowSums(weights_matrix)return(weights_matrix)}# Calculate the weights matrix using the function defined above.weights_matrix <-calculate_weights(distance_matrix)# Function to calculate weighted averages for a given weather variable.calculate_weighted_average <-function(weather_df, weights_matrix, variable) {# Use 'apply' to iterate over each row of the weights matrix. weighted_averages <-apply(weights_matrix, 1, function(weights, variable) {# Calculate the weighted average of the specified variable using the corresponding weights.sum(weights * weather_df[[variable]]) }, variable = variable)return(weighted_averages)}# Define a list of weather variables for which to calculate weighted averages.variables <-c('t_min', 't_max', 'rain', 'sun', 'af_inmth')# Loop over each variable to calculate weighted averages and add them to 'happiness_df' as new columns.for (variable in variables) {# For each variable, calculate the weighted average and create a new column in 'happiness_df'. happiness_df[[paste0('weighted_', variable)]] <-calculate_weighted_average(weather_df, weights_matrix, variable)}# Assign the modified 'happiness_df' with added weighted weather variables to a new variable for further use.Happiness_www <- happiness_df #Happiness with Weighted Weather#The closer the survey long/lat is to a weather station, the more that station is weighted in that location's calculated weather data.
These weighted averages are then appended to the happiness dataset as new columns. This “weather enriched” dataset Happiness_www, now includes location-specific, weighted weather data, provides a robust way to further analysing the relationship between weather conditions and happiness.
(v) Validation of Weighted Weather Values for Eastbourne
This code segment performs a comparison, calculating the percentage differences between the actual weather data for Eastbourne and the calculated weighted weather values associated with an Eastbourne Happiness survey address.
The objective is to verify that the weather data aligns closely with the true values.
# These are the real values for Eastbourneweather_data <- weather_byStation %>%filter(station =="Eastbourne") %>%select(t_min, t_max, rain, sun, af_inmth)# Given this survey is in Eastbourne, the values should be very close to the above.weighted_data <- Happiness_www %>%filter(areaC =="Eastbourne, United Kingdom") %>%select(weighted_t_min, weighted_t_max, weighted_rain, weighted_sun, weighted_af_inmth)# Calculate the percentage differences between all weather data.percentage_differences <-sapply(names(weather_data), function(var) { weather_value <- weather_data[[var]] weighted_value <- weighted_data[[paste0("weighted_", var)]] percentage_diff <- ((weighted_value - weather_value) / weather_value) *100return(percentage_diff)})percentage_differences_formatted <-sapply(percentage_differences, function(x) sprintf('%.10f%%', x))# View the results.print(percentage_differences_formatted)
t_min t_max rain sun
"-0.0000000003%" "0.0000000000%" "-0.0000000004%" "-0.0000000003%"
af_inmth
"0.0000000016%"
The observation that most values fall within a +/- 0.0001% range suggests the employed method is highly effective in approximating the actual weather conditions, thereby proving the approach’s accuracy and reliability.
(vi) Updating the Station Cluster Map with Happiness Data Locations
This refined map overlays Happiness data across various locations, whose sizes represent sample_size, visualize the data points. Weather stations stand out in black for clarity.
The color gradient from red to blue illustrates happiness levels: red for lower happiness and blue for higher.
scale_radius <-function(size) { #This is a function to scale the sample_size scaled_size <- (size)^(1/4) # 4 root scalingreturn(scaled_size)}Happiness_www_grouped <- Happiness_www %>%group_by(longitude, latitude, areaC) %>%reframe(across(everything(), mean, na.rm =TRUE)) %>%ungroup() # Group by longitude and latitude, then summarize by calculating the mean# Find the range of your mean_ratingrating_min <-min(Happiness_www_grouped$mean_rating, na.rm =TRUE)rating_max <-max(Happiness_www_grouped$mean_rating, na.rm =TRUE)# Create a function to map mean_rating to colors# Adjusting the scale to spread colors more across the entire rangecustomPalette <-colorNumeric(palette =c("darkred","white", "blue", "darkblue"), domain =c(rating_min, rating_max))leaflet_map <-leaflet() %>%addTiles() %>%# Add Happiness_wscore markers with color varying by mean_ratingaddCircleMarkers(data = Happiness_www_grouped,~longitude, ~latitude, popup =~paste("<b>", areaC, "</b><br>","Sample Size: ", sample_size, "<br>","Mean Score: ", mean_rating, "<br>","Min Temp: ", sprintf("%.2f",weighted_t_min), "°C<br>","Max Temp: ", sprintf("%.2f",weighted_t_max), "°C<br>","Sunshine: ", sprintf("%.2f", weighted_sun), " hours<br>","Rainfall: ", sprintf("%.2f", weighted_rain), " mm<br>","Air Frost Days: ", sprintf("%.2f", weighted_af_inmth), " days<br>"),color =~customPalette(mean_rating),opacity =0, fillOpacity =0.8, radius =~scale_radius(sample_size) ) %>%# Add weather data markersaddCircleMarkers(data = selected_data_wLocation,~longitude, ~latitude,color ="black",popup =~paste("<b>", station, "</b><br>","Cluster: ", cluster, "<br>","Min Temp: ", sprintf("%.2f", t_min), "°C<br>","Max Temp: ", sprintf("%.2f", t_max), "°C<br>","Rain: ", sprintf("%.2f", rain), " mm<br>","Sun: ", sprintf("%.2f", sun), " hours<br>","Air Frost Days: ", sprintf("%.2f", af_inmth), " days<br>","Longitude: ", longitude, "<br>","Latitude: ", latitude ),opacity =1, fillOpacity =1, radius =1 ) %>%addProviderTiles(providers$Stamen.TonerLite, options =providerTileOptions(opacity =0.5)) %>%setView(lng =mean(Happiness_www_grouped$longitude), lat =mean(Happiness_www_grouped$latitude), zoom =6)# Click on the circles to see the calculated weather data for each happiness survey locationleaflet_map
(vi) Restructuring the Dataframe. (One Row for Each Location for Each Year)
# First pivot the table for happiness scoresHappiness_cut_score <- Happiness_www %>%select(areaA, areaB, areaC, mean_rating1112, mean_rating1213, mean_rating1314, mean_rating1415, starts_with("weighted_")) %>%pivot_longer(cols =c(mean_rating1112, mean_rating1213, mean_rating1314, mean_rating1415), names_to ="year_rating", values_to ="mean_rating_year") %>%select(areaA, areaB, areaC, year_rating, mean_rating_year)#Secondly, pivot the table for happines score cvs Happiness_cut_cv <- Happiness_www %>%select(areaA, areaB, areaC, mean_cv1112, mean_cv1213, mean_cv1314, mean_cv1415,starts_with("weighted_")) %>%pivot_longer(cols =c(mean_cv1112, mean_cv1213, mean_cv1314, mean_cv1415), names_to ="year_cv", values_to ="mean_cv_year") %>%select(areaA, areaB, areaC, year_cv, mean_cv_year)# Pivot the table for low happiness score as a proportionHappiness_cut_low <- Happiness_www %>%select(areaA, areaB, areaC, low_perc1112, low_perc1213, low_perc1314, low_perc1415,starts_with("weighted_")) %>%pivot_longer(cols =c(low_perc1112, low_perc1213, low_perc1314, low_perc1415), names_to ="year_low", values_to ="low_perc_year") %>%select(areaA, areaB, areaC, year_low, low_perc_year)# Pivot the table for medium happiness score as a proportionHappiness_cut_med <- Happiness_www %>%select(areaA, areaB, areaC, med_perc1112, med_perc1213, med_perc1314, med_perc1415,starts_with("weighted_")) %>%pivot_longer(cols =c(med_perc1112, med_perc1213, med_perc1314, med_perc1415), names_to ="year_med", values_to ="med_perc_year") %>%select(areaA, areaB, areaC, year_med, med_perc_year)# Pivot the table for high happiness score as a proportionHappiness_cut_high <- Happiness_www %>%select(areaA, areaB, areaC, high_perc1112, high_perc1213, high_perc1314, high_perc1415,starts_with("weighted_")) %>%pivot_longer(cols =c(high_perc1112, high_perc1213, high_perc1314, high_perc1415), names_to ="year_high", values_to ="high_perc_year") %>%select(areaA, areaB, areaC, year_high, high_perc_year)# Pivot the table for very high happiness score as a proportionHappiness_cut_vhigh <- Happiness_www %>%select(areaA, areaB, areaC, v_high_perc1112, v_high_perc1213, v_high_perc1314, v_high_perc1415,starts_with("weighted_")) %>%pivot_longer(cols =c(v_high_perc1112, v_high_perc1213, v_high_perc1314, v_high_perc1415), names_to ="year_vhigh", values_to ="v_high_perc_year") %>%select(areaA, areaB, areaC, year_vhigh, v_high_perc_year)#Finally, put it all togetherHappiness_cut_pre <-suppressWarnings(left_join(Happiness_cut_score, Happiness_cut_cv, by =c("areaA", "areaB", "areaC")))Happiness_cut_pre <-suppressWarnings(left_join(Happiness_cut_pre, Happiness_cut_low, by =c("areaA", "areaB", "areaC")))Happiness_cut_pre <-suppressWarnings(left_join(Happiness_cut_pre, Happiness_cut_med, by =c("areaA", "areaB", "areaC")))Happiness_cut_pre <-suppressWarnings(left_join(Happiness_cut_pre, Happiness_cut_high, by =c("areaA", "areaB", "areaC")))Happiness_cut_pre <-suppressWarnings(left_join(Happiness_cut_pre, Happiness_cut_vhigh, by =c("areaA", "areaB", "areaC"))) %>%filter(str_sub(year_cv, -4, -1) ==str_sub(year_rating, -4, -1) &str_sub(year_cv, -4, -1) ==str_sub(year_low, -4, -1) &str_sub(year_cv, -4, -1) ==str_sub(year_med, -4, -1) &str_sub(year_cv, -4, -1) ==str_sub(year_high, -4, -1) &str_sub(year_cv, -4, -1) ==str_sub(year_vhigh, -4, -1) ) %>%# This filtering process gets rid of all the duplicate rowsmutate(year =str_sub(year_cv, -4, -1)) %>%select(-year_cv, -year_rating, -year_low, -year_med, -year_high, -year_vhigh)Happiness_cut <-suppressWarnings(left_join(Happiness_www, Happiness_cut_pre, by =c("areaA", "areaB", "areaC")))
# Pivot Longed dataframe which makes it easy to run scatter plotsHappiness_cut %>%select(areaC,year, mean_rating_year, mean_cv_year, low_perc_year, med_perc_year, high_perc_year, v_high_perc_year, starts_with("weighted_")) %>%drop_na(areaC) %>%reactable(defaultPageSize =8)
(vii) Creating a “Good Weather Score” and Generating Scatter Plots with Regression Lines to Investigate Relationships
The good_weather_score is and arbitrarily determined score which is comprised of 20% minimum temperature, 20% maximum temperature, 30% sunshine, 20% inversed rainfall and 10% inversed air frost days in the month.
# CLICK THE OTHER TABS TO SEE THE PLOTS FOR EACH OF THE WEATHER VARIABLES.# Scatter plot and regression line of mean rating against calculated t_min.# Spearman method is used as it is less suseptable to leveraged points.ggscatter(Happiness_wscore, x ="mean_rating_year", y ="good_weather_score",size =0.3, color ="year", palette ="jco",add ="reg.line", conf.int =TRUE) +stat_cor(aes(color = year), method ="spearman", label.x =5.5)
This plot shows that there is no correlation between the “good weather score” and Happiness ratings for each year.
ggscatter(Happiness_wscore, x ="mean_rating_year", y ="weighted_t_min", size =0.3, color ="year", palette ="jco",add ="reg.line", conf.int =TRUE) +stat_cor(aes(color = year), method ="spearman", label.x =5.5)
This plot shows that there is no correlation between minimum temperature and Happiness rating for all years.
ggscatter(Happiness_wscore, x ="mean_rating_year", y ="weighted_t_max", size =0.3, color ="year", palette ="jco",add ="reg.line", conf.int =TRUE) +stat_cor(aes(color = year), method ="spearman", label.x =5.5)
This plot shows that there is no correlation between maximum temperature and Happiness rating for all years.
ggscatter(Happiness_wscore, x ="mean_rating_year", y ="weighted_sun", size =0.3, color ="year", palette ="jco",add ="reg.line", conf.int =TRUE) +stat_cor(aes(color = year), method ="spearman", label.x =5.5)
This plot shows that there is no correlation between sunshine and Happiness ratings for all years.
ggscatter(Happiness_wscore, x ="mean_rating_year", y ="weighted_rain", size =0.3, color ="year", palette ="jco",add ="reg.line", conf.int =TRUE) +stat_cor(aes(color = year), method ="spearman", label.x =5.5)
This plot shows that there is no correlation between rainfall and Happiness rating for all years.
ggscatter(Happiness_wscore, x ="mean_rating_year", y ="weighted_af_inmth", size =0.3, color ="year", palette ="jco",add ="reg.line", conf.int =TRUE) +stat_cor(aes(color = year), method ="spearman", label.x =5.5)
This plot shows that there is no correlation between air frost days in a month and Happiness rating for all years.
Upon examining these plots, it appears that the weather variables may not significantly impact happiness. However, a more nuanced analysis is necessary to draw a conclusive inference. It is plausible that the effects of the weather variables are subtle and may not be adequately captured through simple correlation analysis. Further investigation is required to ascertain the true extent of their influence on happiness.
(viii) Which Weather Variables Contributes the Greatest Impact to Happiness Score? - A Multivariate Analysis
Here I create a multivariate generalised linear regression model, including all the weather data variables into the model to see which variable affects happiness ratings the most. I first begin by normalising the variables so the estimates can be directly compared.
# Standardizing the variables using Z-score standardizationHappiness_wscore_normalized <- Happiness_wscore %>%mutate(across(c(weighted_t_max, weighted_t_min, weighted_sun, weighted_rain, weighted_af_inmth, longitude, latitude),~scale(.) %>% as.vector))
This model only includes the normalised weighted weather variables.
# GLM using normalized variables without locationrating_weather <-glm(mean_rating ~ weighted_t_max + weighted_t_min + weighted_sun + weighted_rain + weighted_af_inmth,data = Happiness_wscore_normalized)# Displaying Model Summarysummary(rating_weather)
Call:
glm(formula = mean_rating ~ weighted_t_max + weighted_t_min +
weighted_sun + weighted_rain + weighted_af_inmth, data = Happiness_wscore_normalized)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.400857 0.004237 1746.775 < 2e-16 ***
weighted_t_max 0.052493 0.014509 3.618 0.000305 ***
weighted_t_min -0.118738 0.024462 -4.854 1.32e-06 ***
weighted_sun 0.005282 0.006961 0.759 0.448094
weighted_rain 0.030960 0.008792 3.522 0.000440 ***
weighted_af_inmth -0.086377 0.018473 -4.676 3.15e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.0313749)
Null deviance: 56.044 on 1747 degrees of freedom
Residual deviance: 54.655 on 1742 degrees of freedom
(4 observations deleted due to missingness)
AIC: -1082.5
Number of Fisher Scoring iterations: 2
This model includes the normalised weighted weather variables and the survey coordinates.
Call:
glm(formula = mean_rating ~ weighted_t_max + weighted_t_min +
weighted_sun + weighted_rain + weighted_af_inmth + longitude +
latitude, data = Happiness_wscore_normalized)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.401137 0.004163 1777.872 < 2e-16 ***
weighted_t_max 0.043031 0.014389 2.991 0.00282 **
weighted_t_min -0.061980 0.028437 -2.180 0.02942 *
weighted_sun 0.023660 0.007331 3.227 0.00127 **
weighted_rain 0.013522 0.008910 1.518 0.12929
weighted_af_inmth -0.046404 0.021574 -2.151 0.03162 *
longitude -0.048291 0.006023 -8.017 1.97e-15 ***
latitude 0.015589 0.009907 1.574 0.11577
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.0302854)
Null deviance: 56.044 on 1747 degrees of freedom
Residual deviance: 52.697 on 1740 degrees of freedom
(4 observations deleted due to missingness)
AIC: -1142.3
Number of Fisher Scoring iterations: 2
(ix) Conclusion
The warmer days (t_max) will have a positive association with a tendency to remain happy, since generally, people like warmth. On the other hand, the most potent negative effect of lower minimum temperatures (t_min) points out a marked aversion to cold among the UK population, and hence, the happiness score would rather be more influenced by feeling discomfort from the lower temperatures than by pleasure from warm temperatures.
Research has already proven sunlight’s benefits. It is beneficial for the mood because of its essential role in vitamin D synthesis and regulation of circadian rhythm. The fact that it became a statistically significant variable only after location was inserted into the model indicates that people living in different locations throughout the UK; value sunshine differently as a contributing factor to their overall happiness.
The initial model suggested a positive link between rainfall and higher happiness scores, hinting at rainfall’s favorable influence on well-being. Yet, the introduction of geographical location into the analysis illuminated a critical nuance: location is a significant confounding factor. It emerges that individuals residing in the western regions, where rainfall is more frequent due to prevailing weather patterns, report higher levels of happiness. This observation suggests that the apparent correlation between rainfall and happiness is not as it seems. Instead, it’s the geographic location—with its unique environmental, social, and cultural contexts—that appears to play a pivotal role in shaping happiness scores, rather than rainfall per se.
A negative relationship with longitude exposes an interesting east-west gradient in happiness across the UK, which might reflect changes in climatic, socio-economic, or cultural attitudes towards weather. This would, therefore, indicate that despite the comparison of such a relatively small area like the UK, regional differences still do matter in affecting the experience of weather conditions on happiness.
The absence of any significant latitude effect is extremely interesting. This means that the north-south positioning does not influence the mood across the UK. It is intriguing to note that it is the longitude, or the east-west positioning of the populous which plays a significant influence on happiness. This signal remains, even after correcting for all included weather variables. Assuming no other confounding factors, policies and interventions in well-being should be set taking into account the longitude (east-west) rather than the latitude of the populous.
While it is clear that the models provide strong contributions toward the understanding of UK happiness… Even with the inclusion of location, the effect sizes of these variables remain comparatively small. The unexplained variance must point toward a myriad other factors. These could beindividual characteristics, indoor environments, and seasonal patterns. This highlights the need for further datapoints to fully understand the drivers of happiness within the UK.
On the whole, this evidence suggests that weather does have a measurable impact on happiness, but this impact is relatively small (given the effect sizes). This is even after taking geographical position into account. The significance of temperature (both max and min), sunshine in the second model, and frost days, indicates that weather plays a role in determining happiness, but is by far not the dominant factor.
Part 4(b) : Automate the Analysis Workflow
Try to automate parts 1 to 3 as much as possible. Write scripts and/or programs that follow each of the steps you did to find your answers to parts 1 to 3. Try to parameterise the scripts so that the whole process could be easily re-run with a small difference, say, by dividing the UK into four equal regions instead of three.
Submit your programs/scripts in a zip file (or similar) and include in your report a brief description of how they can be run and what they do.
I trust that this entire document sufficiently answers this part in full. It contains all the scripts used and is explained thoroughly in the main text, as well as with in-line comments of the Quarto file.