Tukey’s Test
Tukey’s test is a post-hoc analysis often employed after performing an analysis of variance (ANOVA). It helps to identify which specific pairs of groups have significant differences in their means.
To perform this test in python, use the below code:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
data_for_tukey = [dataset.loc[dataset.index.year == year, ‘Processing Time’].values for year in years]
flattened_data = np.concatenate(data_for_tukey)
group_labels = np.concatenate([[year] * len(data) for year, data in zip(years, data_for_tukey)])
tukey_results = pairwise_tukeyhsd(flattened_data, group_labels)
Project-2: Resubmission
Descriptive Statistics of building permits dataset
declared_valuation | total_fees | sq_feet | property_id | lat | long | issued_day | issued_month | issued_year | expiration_day | expiration_month | expiration_year | Processing Time | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5.559180e+05 | 5.559180e+05 | 5.559180e+05 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | |
mean | 1.166402e+05 | 1.789933e+03 | 1.931383e+04 | 119792.063490 | 42.327782 | -71.085179 | 15.802295 | 6.611777 | 2016.522924 | 15.775057 | 6.397911 | 2017.041150 | 181.814998 | |
std | 4.087588e+06 | 3.199031e+05 | 1.341221e+07 | 99903.616517 | 0.033817 | 0.035732 | 8.727168 | 3.344205 | 3.881170 | 8.686594 | 3.532444 | 3.914937 | 108.693400 | |
min | -1.000000e+06 | -3.352000e+01 | 0.000000e+00 | 4.000000 | 42.230291 | -71.185400 | 1.000000 | 1.000000 | 2006.000000 | 1.000000 | 1.000000 | 1805.000000 | -79190.000000 | |
25% | 1.500000e+03 | 3.300000e+01 | 0.000000e+00 | 49350.000000 | 42.299150 | -71.108261 | 8.000000 | 4.000000 | 2013.000000 | 8.000000 | 3.000000 | 2014.000000 | 181.000000 | |
50% | 5.500000e+03 | 7.000000e+01 | 0.000000e+00 | 101083.000000 | 42.337731 | -71.076729 | 16.000000 | 7.000000 | 2017.000000 | 16.000000 | 6.000000 | 2017.000000 | 182.000000 | |
75% | 2.000000e+04 | 1.900000e+02 | 0.000000e+00 | 150352.750000 | 42.353086 | -71.059224 | 23.000000 | 9.000000 | 2020.000000 | 23.000000 | 10.000000 | 2020.000000 | 183.000000 | |
max | 2.100000e+09 | 2.250152e+08 | 1.000000e+10 | 459501.000000 | 42.395173 | -70.994860 | 31.000000 | 12.000000 | 2023.000000 | 31.000000 | 12.000000 | 2024.000000 | 3838.000000 |
Interpretations:
- The mean declared valuation is approximately 116,640, with a standard deviation of 4,087,588. The minimum declared valuation is negative, potentially indicating errors or outliers. The 25th, 50th (median), and 75th percentiles are 1,500, 5,500, and 20,000, respectively. The maximum declared valuation is 2,100,000,000.
- The mean total fees are approximately $1,789.93, with a standard deviation of 319,903.10. The minimum total fees are negative, potentially indicating errors or outliers. The 25th, 50th (median), and 75th percentiles are 33, 70, and 190, respectively. The maximum total fees are 225,015,200.
- The mean square footage is approximately 19,313, with a wide standard deviation of approximately 13,412. The minimum square footage is 0, and the 25th, 50th (median), and 75th percentiles are also 0.
- The mean property ID is approximately 119,792, with a standard deviation of approximately 99,903. The minimum property ID is 4, and the 25th, 50th (median), and 75th percentiles are 49,350, 101,083, and 150,352.75, respectively. The maximum property ID is 459,501.
- The mean latitude is approximately 42.33, with a small standard deviation of approximately 0.03. The minimum latitude is 42.23, and the 25th, 50th (median), and 75th percentiles are 42.30, 42.34, and 42.35, respectively. The maximum latitude is 42.40.
- The mean longitude is approximately -71.09, with a small standard deviation of approximately 0.04. The minimum longitude is -71.19, and the 25th, 50th (median), and 75th percentiles are -71.11, -71.08, and -71.06, respectively. The maximum longitude is -70.99.
- The dataset includes information on the day, month, and year of permit issuance and expiration.
- The mean processing time is approximately 181.81 days, with a standard deviation of approximately 108.69. The minimum and maximum processing times are -79,190 and 3,838 days, respectively, potentially indicating errors or outliers. However, there are outliers with negative and extremely large processing times.
Chosen dataset: Approved Building Permits
The dataset provides information on building permits issued by the City of Boston, spanning from 2015 to the present. It excludes permits in various states such as those being processed, denied, deleted, voided, or revoked. The dataset encompasses details related to several types of building permits, including but not limited to:
- Short Form Building Permit
- Electrical Permit
- Plumbing Permit
- Gas Permit
- Electrical Low Voltage
- Long Form/Alteration Permit
- Electrical Fire Alarms
- Certificate of Occupancy
- Excavation Permit
- Electrical Temporary Service
- Amendment to a Long Form
- Erect/New Construction
- Use of Premises
- Foundation Permit
Dataset consists of columns such as:
-
- object_id: Unique identifier for each record.
- permitnumber: Identification number assigned to the permit.
- Permit Information:
- worktype: Type of work associated with the permit (e.g., construction, electrical, plumbing).
- permittypedescr: Description of the permit type.
- Project Details:
- description: Description of the project associated with the permit.
- comments: Additional comments or notes related to the permit.
- Applicant and Owner Information:
- applicant: Name or information about the applicant for the permit.
- owner: Name or information about the owner of the property.
- Financial Details:
- declared_valuation: The declared valuation or estimated value of the project.
- total_fees: Total fees associated with the permit.
- Status and Type of Construction:
- status: Current status of the permit (e.g., open, closed).
- occupancytype: Type of occupancy associated with the construction project.
- Location Information:
- address: Street address of the construction project.
- city: City where the construction is taking place.
- state: State where the construction is taking place.
- zip: Zip code of the location.
- Geospatial Information:
- gpsy, gpsx: Geospatial coordinates of the location.
- lat, long: Latitude and longitude of the location.
- geom_2249, geom_4326: Geospatial information in different coordinate systems.
- Temporal Information:
- issued_day, issued_month, issued_year: Components of the issuance date.
- expiration_day, expiration_month, expiration_year: Components of the expiration date.
- Property Details:
- property_id, parcel_id: Identification numbers related to the property.
link to the dataset: https://data.boston.gov/dataset/approved-building-
Steps and methods commonly used in time series forecasting:
- Data Collection and Exploration:
- Gather historical time-ordered data.
- Explore and visualize the data to understand its patterns and characteristics.
- Stationarity:
- Check for stationarity in the time series data. Stationary time series have constant statistical properties over time, making them easier to model.
- Decomposition:
- Decompose the time series into its components, such as trend, seasonality, and noise, to better understand its structure.
- Model Selection:
- Choose a suitable forecasting model based on the characteristics of the time series data. Common models include:
- ARIMA (AutoRegressive Integrated Moving Average): A popular model that combines autoregression, differencing, and moving averages.
- Exponential Smoothing (ETS): Another approach that considers error, trend, and seasonality components.
- Prophet: Developed by Facebook, it’s designed for forecasting with daily observations that display patterns on different time scales.
- Choose a suitable forecasting model based on the characteristics of the time series data. Common models include:
- Training the Model:
- Split the data into training and testing sets.
- Train the selected forecasting model using the training set.
- Validation:
- Validate the model’s performance using the testing set.
- Evaluate the model’s accuracy using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or Root Mean Squared Error (RMSE).
- Hyperparameter Tuning:
- Adjust the model’s hyperparameters to improve its performance.
- Forecasting:
- Use the trained model to make predictions for future time points.
- Evaluation:
- Evaluate the forecasting accuracy on new, unseen data.
- Iterate:
- Refine the model and repeat the process if necessary, especially if the characteristics of the time series change over time.
SARIMA
The components of a SARIMA model include:
- Seasonal Component (S): Represents the repeating patterns in the data.
- AutoRegressive Component (AR): Captures the relationship between an observation and several lagged observations in the same time series.
- Integrated Component (I): Represents the differencing needed to make the time series stationary.
- Moving Average Component (MA): Captures the relationship between an observation and a residual error from a moving average model.
The notation for a SARIMA model is SARIMA(p, d, q)(P, D, Q)m, where:
- p, d, q are the non-seasonal parameters.
- P, D, Q are the seasonal parameters.
- m is the number of time steps in each season.
Applications of SARIMA models include:
- Demand Forecasting: SARIMA models are commonly used in retail and supply chain management for forecasting demand, especially when there are clear seasonal patterns.
- Financial Time Series Analysis: SARIMA models can be applied to predict stock prices, exchange rates, and other financial indicators that may exhibit seasonality.
- Energy Consumption Forecasting: SARIMA models are used to forecast energy consumption, considering the seasonality in energy usage patterns.
- Weather Forecasting: SARIMA models can be applied to time series data in meteorology to forecast temperature, precipitation, and other weather-related variables with seasonal patterns.
- Economic Indicators: SARIMA models can be used to forecast economic indicators such as unemployment rates, GDP, and inflation, which often exhibit seasonal patterns.
- Healthcare: SARIMA models can be applied to predict patient admission rates or disease outbreaks, as many health-related phenomena show seasonal patterns.
- Web Traffic Prediction: SARIMA models can be used to forecast website traffic, considering daily or weekly patterns in user visits.
ARIMA
- Components of ARIMA: ARIMA is an acronym for AutoRegressive Integrated Moving Average. It consists of three main components: AutoRegressive (AR), Integrated (I), and Moving Average (MA). The AR component represents the autoregressive part, the I component represents differencing to make the time series stationary, and the MA component represents the moving average part.
- Stationarity Requirement: ARIMA models assume that the time series is stationary, meaning that its statistical properties (such as mean and variance) do not change over time. If the time series is not stationary, differencing is applied to make it stationary before fitting the ARIMA model.
- ARIMA(p, d, q): The notation ARIMA(p, d, q) represents the order of the ARIMA model, where “p” is the order of the autoregressive part, “d” is the degree of differencing, and “q” is the order of the moving average part. For example, ARIMA(1, 1, 1) indicates a model with an autoregressive order of 1, a differencing of order 1, and a moving average order of 1.
- Seasonal ARIMA (SARIMA): SARIMA is an extension of ARIMA that incorporates seasonality. It includes additional seasonal terms (P, D, Q) to account for periodic patterns in the time series. The notation for SARIMA is SARIMA(p, d, q)(P, D, Q)s, where “s” represents the seasonal period.
- Model Selection: Choosing the appropriate values for the ARIMA parameters (p, d, q) can be done through model selection techniques such as grid search, where different combinations of parameters are evaluated based on their performance in terms of model fit and prediction accuracy. Additionally, diagnostic tools like ACF (AutoCorrelation Function) and PACF (Partial AutoCorrelation Function) plots are often used to identify the order of the ARIMA model.
Applications of SARIMA models include
Demand ForecApplications of SARIMA models include:
Demand Forecasting: SARIMA models are commonly used in retail and supply chain management for forecasting demand, especially when there are clear seasonal patterns.
Financial Time Series Analysis: SARIMA models can be applied to predict stock prices, exchange rates, and other financial indicators that may exhibit seasonality.
Energy Consumption Forecasting: SARIMA models are used to forecast energy consumption, considering the seasonality in energy usage patterns.
Weather Forecasting: SARIMA models can be applied to time series data in meteorology to forecast temperature, precipitation, and other weather-related variables with seasonal patterns.
Economic Indicators: SARIMA models can be used to forecast economic indicators such as unemployment rates, GDP, and inflation, which often exhibit seasonal patterns.
Healthcare: SARIMA models can be applied to predict patient admission rates or disease outbreaks, as many health-related phenomena show seasonal patterns.
Web Traffic Prediction: SARIMA models can be used to forecast website traffic, considering daily or weekly patterns in user.
Tips for analyzing time series data
Few tips on analyzing time-series data:
- Data Preprocessing:
- Clean and preprocess your data thoroughly. Handle missing values and outliers appropriately.
- Ensure that your time series is stationary, as many time series models assume this. If not, consider differencing the data.
- Visualize the Time Series:
- Plot your time series data to visually identify trends, seasonality, and any noticeable patterns.
- Decompose the time series into its components to better understand the underlying structure.
- Autocorrelation Analysis:
- Examine autocorrelation and partial autocorrelation functions to identify lags where the series correlates with itself.
- This analysis helps in choosing the appropriate lag order for autoregressive and moving average components in models like ARIMA.
- Model Selection and Validation:
- Choose an appropriate time series model based on the characteristics of your data. Consider ARIMA, SARIMA, or machine learning models like LSTM.
- Split your data into training and testing sets for model validation. Validate the model’s performance using metrics like MSE, RMSE, and MAE.
Understanding the dataset
Tourism
1. Passenger Traffic at Logan:
– Variable: logan_passengers
– Description: Number of domestic and international passengers at Logan Airport
2. International Flights at Logan
– Variable: logan_intl_flights
– Description: Total international flights at Logan Airport
Hotel Market
3. Occupancy Rate
– Variable: hotel_occup_rate
– Description: Hotel occupancy for Boston
4. Average Daily Rate
– Variable: hotel_avg_daily_rate
– Description: Hotel average daily rate for Boston
Labor Market
5. Total Jobs
– Variable: total_jobs
– Description: Total number of jobs
6. Unemployment Rate
– Variable: unemp_rate
– Description: Unemployment rate for Boston
7. Labor Force Participation Rate
– Variable: labor_force_part_rate
– Description: Labor force participation rate for Boston
Real Estate: Board Approved Development Projects (Pipeline)
8. Units
– Variable: pipeline_unit
– Description: Number of units approved
9. Total Development Cost
– Variable: pipeline_total_dev_cost
– Description: Total development cost of approved projects
10. Square Feet
– Variable: pipeline_sqft
– Description: Square feet of approved projects
11. Construction Jobs
– Variable: pipeline_const_jobs
– Description: Number of construction jobs
Real Estate Market: Housing
12. Foreclosure Petitions
– Variable: foreclosure_pet
– Description: Number of foreclosure house petitions
13. Foreclosure Deeds
– Variable: foreclosure_deeds
– Description: Number of foreclosure house deeds
14. Median Housing Sales Price
– Variable: med_housing_price
– Description: Median housing sales price
15. Housing Sales Volume
– Variable: housing_sales_vol
– Description: Number of houses sold
16. New Housing Construction Permits
– Variable: new_housing_const_permits
– Description: Number of new housing construction permits
17. New Affordable Housing Unit Permits
– Variable: new-affordable_housing_permits
– Description: Number of new affordable construction permits
Project 2: The toll of police shootings in the united states
GLMs Model (11-10-23)
GLMs are extensions of linear regression and are commonly used for regression and classification tasks. The main idea behind GLMs is to model the relationship between a response variable and one or more predictor variables while allowing for different PDFs and link functions.
The link function is used to connect the linear predictor to the mean of the response variable. It can be chosen based on the nature of the data and the relationships that we want to model. For this data, I have chosen “Identity”. IRLS (Iteratively Reweighted Least squares) is the optimization method I used to estimate the coefficients.
Interpretations:
Intercept (0.5468): This is the expected value of the response variable “yll” when all predictor variables are zero. In this context, it represents the baseline or starting value of “yll” when all other factors are absent or equal to zero.
age (-1.0039): For each one-unit increase in the “age” of the individual, the expected value of “yll” is expected to decrease by approximately 1.0039 units. This suggests that older individuals tend to have lower values of “yll.”
latitude (0.0007): A one-unit increase in “latitude” is associated with an increase of 0.0007 units in the expected value of “yll.” This means that moving north (increasing latitude) is associated with a slight increase in “yll.”
gender_M (-0.4839): If the gender of the individual is male (“gender_M” is 1), the expected value of “yll” is expected to be approximately 0.4839 units lower compared to a female (“gender_M” is 0), holding all other variables constant.
race_A (0.6247): Individuals of race “A” are expected to have a “yll” value that is approximately 0.6247 units higher than individuals of other races (e.g., compared to “race_W,” “race_H,” and “race_B”), holding all other variables constant.
body_camera_False (0.0010): If “body_camera_False” is true (1), it’s associated with an expected increase of 0.0010 units in “yll” compared to when “body_camera_False” is false (0), holding all other variables constant. This suggests that when body cameras are not used, “yll” is slightly higher.
Degrees of Freedom Residual (df residual = 5169): It represents the number of data points that can vary freely once our model’s parameters have been estimated. For this model 5169 degrees of freedom are left after fitting the model, which suggests that the model is relatively simple in terms of parameter estimation, given the large dataset size. Degrees of Freedom Model (df model = 24): It indicates the number of parameters or predictor variables in your GLM model. We have 24 predictor variables included in the model.
Resubmission of final report for project 1
Reducing the cardinality in the varibale “armed”
Here’s a quick tip on reducing the categories in the feature “armed”.
Cardinality refers to the number of distinct values in a feature. We can take the unique values in every categorical feature and plot a bar plot to see the cardinality. This is important because we want to avoid the cruse of dimensionality problems while modeling.
Cardinality of Armed is as follows:
From the above plot, it is evident that we have more than 80+ categories in the feature armed. By doing one hot encoding to that variable we increase the number of columns by 80+ in our dataset and it is very inefficient to do that. We cannot drop that feature because it might have important information while modeling the data.
To encounter this problem, we can categorize the categories in the feature to reduce the dimensions significantly while maintaining the right information.
I have grouped the categories in the following manner.
From 80+ unique categories, I have reduced to 8: “Firearms”, “Edged_Weapons”, “Blunct_Objects”, “Tools_and_Construction_Items”, “Improvised_Weapons”, “Non_lethal_Weapons”, “Not_Weapons”, and “Miscellaneous_Weapons”.
Now we can check the cardinality of our entire categorial features.
By doing things in this way, we can reduce the dimensions generated while converting categorical to numerical data while maintaining all the important information.
About clustering
- Clustering is a technique in machine learning and data analysis that groups similar data points based on their features without requiring labeled data.
- The goal of clustering is to create distinct clusters where data points within the same cluster are more similar to each other than to those in other clusters.
- Common clustering algorithms include K-Means, Hierarchical Clustering, and DBSCAN, each with its approach to measuring similarity between data points.
- Various evaluation methods and metrics are used to assess clustering results, including Silhouette Score, Davies-Bouldin Index, and Dunn Index for internal evaluation, as well as metrics like Adjusted Rand Index, Normalized Mutual Information, and Fowlkes-Mallows Index for external evaluation if ground truth is available.
- Visual inspection techniques, such as scatter plots, t-SNE, and PCA, can help provide insights into the quality of clusters.
- In some cases, domain-specific evaluation may be necessary to assess the clusters’ utility for solving real-world problems.
- The choice of evaluation metrics should consider the data’s characteristics and the analysis goals, often involving a combination of metrics and visual inspection for a comprehensive assessment.
declared_valuation | total_fees | sq_feet | property_id | lat | long | issued_day | issued_month | issued_year | expiration_day | expiration_month | expiration_year | Processing Time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5.559180e+05 | 5.559180e+05 | 5.559180e+05 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 | 555918.000000 |
mean | 1.166402e+05 | 1.789933e+03 | 1.931383e+04 | 119792.063490 | 42.327782 | -71.085179 | 15.802295 | 6.611777 | 2016.522924 | 15.775057 | 6.397911 | 2017.041150 | 181.814998 |
std | 4.087588e+06 | 3.199031e+05 | 1.341221e+07 | 99903.616517 | 0.033817 | 0.035732 | 8.727168 | 3.344205 | 3.881170 | 8.686594 | 3.532444 | 3.914937 | 108.693400 |
min | -1.000000e+06 | -3.352000e+01 | 0.000000e+00 | 4.000000 | 42.230291 | -71.185400 | 1.000000 | 1.000000 | 2006.000000 | 1.000000 | 1.000000 | 1805.000000 | -79190.000000 |
25% | 1.500000e+03 | 3.300000e+01 | 0.000000e+00 | 49350.000000 | 42.299150 | -71.108261 | 8.000000 | 4.000000 | 2013.000000 | 8.000000 | 3.000000 | 2014.000000 | 181.000000 |
50% | 5.500000e+03 | 7.000000e+01 | 0.000000e+00 | 101083.000000 | 42.337731 | -71.076729 | 16.000000 | 7.000000 | 2017.000000 | 16.000000 | 6.000000 | 2017.000000 | 182.000000 |
75% | 2.000000e+04 | 1.900000e+02 | 0.000000e+00 | 150352.750000 | 42.353086 | -71.059224 | 23.000000 | 9.000000 | 2020.000000 | 23.000000 | 10.000000 | 2020.000000 | 183.000000 |
max | 2.100000e+09 | 2.250152e+08 | 1.000000e+10 | 459501.000000 | 42.395173 | -70.994860 | 31.000000 | 12.000000 | 2023.000000 | 31.000000 | 12.000000 | 2024.000000 | 3838.000000 |
Interpretations:
- The mean declared valuation is approximately 116,640, with a standard deviation of 4,087,588. The minimum declared valuation is negative, potentially indicating errors or outliers. The 25th, 50th (median), and 75th percentiles are 1,500, 5,500, and 20,000, respectively. The maximum declared valuation is 2,100,000,000.
- The mean total fees are approximately $1,789.93, with a standard deviation of 319,903.10. The minimum total fees are negative, potentially indicating errors or outliers. The 25th, 50th (median), and 75th percentiles are 33, 70, and 190, respectively. The maximum total fees are 225,015,200.
- The mean square footage is approximately 19,313, with a wide standard deviation of approximately 13,412. The minimum square footage is 0, and the 25th, 50th (median), and 75th percentiles are also 0.
- The mean property ID is approximately 119,792, with a standard deviation of approximately 99,903. The minimum property ID is 4, and the 25th, 50th (median), and 75th percentiles are 49,350, 101,083, and 150,352.75, respectively. The maximum property ID is 459,501.
- The mean latitude is approximately 42.33, with a small standard deviation of approximately 0.03. The minimum latitude is 42.23, and the 25th, 50th (median), and 75th percentiles are 42.30, 42.34, and 42.35, respectively. The maximum latitude is 42.40.
- The mean longitude is approximately -71.09, with a small standard deviation of approximately 0.04. The minimum longitude is -71.19, and the 25th, 50th (median), and 75th percentiles are -71.11, -71.08, and -71.06, respectively. The maximum longitude is -70.99.
- The dataset includes information on the day, month, and year of permit issuance and expiration.
- The mean processing time is approximately 181.81 days, with a standard deviation of approximately 108.69. The minimum and maximum processing times are -79,190 and 3,838 days, respectively, potentially indicating errors or outliers. However, there are outliers with negative and extremely large processing times.
Variance of years of life lost over time for each racial group
Interpretations:
- It is clear that the data is not consistent over the years.
- The other racial group’s variation increased significantly from the year 2017 to 2020 and suddenly dropped drastically in the year 2022.
- Low fluctuations in years of life lost are found in White, and non-Hispanic and are constantly reducing.
- We can see repeated patterns for the variance in Asian racial groups from the year 2015 to 2021.
Means of each racial group over time
Interpretations:
- The mean years of life lost is almost similar for racial groups Asian-Hispanic, and White-Black.
- The trend line between the White-Black racial groups moved very close to each other from the year 2015 to 2021 and is significantly lower at the start of the year 2022.
- The mean years of life lost is significantly higher for the Asian racial group with the highest peaks recorded in the year 2017 and 2021 respectively.
- There is a significant drop in the mean of years of life lost and reached the minimum when compared to other groups in the year 2021 and slightly increased by the start of the year 2022.
YLL – Armed Feature
About the plot:
- The variable race_index has armed categories by the total sum of years of life lost in descending order.
- The variable values have the total YLL years of each armed category.
- While plotting I used logarithm on the yll values to make the plot visually apparent, especially when there are categories with much higher YLL than others. It helps in identifying trends and patterns more clearly.
- So the plot represents Armed categories on the x-axis and log transformed values of YLL on the y-axis.
Interpretations:
- The gun category has by far the highest number of years of life lost which indicates that fatal police shootings involving guns result in a significantly higher loss of potential years of life compared to other categories.
- Categories such as knife, machete, etc., also have notable YLL values. This suggests that incidents involving edged weapons, although not as high as firearms, are associated with substantial losses of potential years of life.
- It’s notable that the unarmed category also contributes significantly to the YLL, indicating that even when individuals are unarmed, there can be a considerable loss of potential years of life in fatal police shootings. This finding emphasizes the importance of considering the consequences of unarmed incidents.
YLL – Race
Interpretations:
- The group with the highest “years of life lost” is ‘W’ (White), represented by the value 89,715.1. This means that, on average, individuals in the White category experience the highest number of years of life lost due to fatal police shootings.
- The group with the lowest “years of life lost” is ‘O’ (Other), represented by the value 766.60004. Individuals in the ‘O’ category experience the lowest number of years of life lost on average.
- The group that is most affected, in terms of experiencing the highest impact of fatal police shootings, is ‘W’ (White). This group has the highest number of years of life lost. Conversely, the ‘O’ (Other) group is the least affected in terms of years of life lost in this context.
- The ‘H’ group, representing Hispanic individuals, follows with a value of 38,728.203. Hispanic individuals experience the third-highest number of years of life lost among the groups represented in the data, making them the third most affected group.
- The ‘B’ group, representing Black individuals, has the second-highest “years of life lost” with a value of 48,043.8. This indicates that Black individuals experience a significant number of years of life lost due to fatal police shootings, making them the second most affected group after ‘W’ (White).
- The ‘A’ group, representing Asian individuals, has the lowest value among the groups with 4,637.4 years of life lost. This means that, on average, Asian individuals experience the least impact in terms of years of life lost due to fatal police shootings.
YLL – Year
In the earlier posts, I have talked about years of life lost. To summarise:
- I created a new column called “YLL” (Years of Life Lost) using the life expectancy estimates for the year 2021.
- I dropped a few rows if there are any nan or none values in the race, gender, age, armed, flee, longitude, and latitude.
- Calculated Years of life lost by matching the gender and race to the life expectancy estimates for the year 2021.
Years of life lost vs year:
- The data is organized by grouping it according to the year.
- The “total years of life lost” is calculated with the “yll” column being used to track the yearly variations in years of life lost.
- Prior to grouping the data by year, the “date” column us removed to ensure smooth “group by” operations. This step is essential for data cleaning and analysis.
- The analysis is visualized through the bar plot, depicting the trend in years of life lost over the years.
- From the above plot, we can see a consistent decrease in years of life lost over the years.
- Notably, in 2022, the years of life lost have dropped to approximately 5000, indicating a positive change in this metric for that year.
- Nearly 63.09% of fatal shootings have been reduced by the year 2022.
YLL – Part 2 (Generalized Linear Mixed Model)
Generalized Linear Mixed Models (GLMMs) are a statistical modeling technique that combines elements of Generalized Linear Models (GLMs) and mixed effects models to:
- GLMMs are used to analyze data with non-normal distributions, correlations, or hierarchical structures.
- They incorporate fixed effects for modeling relationships with predictors and random effects to account for unexplained variability, often found in clustered or repeated-measures data.
- GLMMs are effective for data with nested or hierarchical structures, such as repeated measurements within individuals or groups.
- Parameters are estimated using maximum likelihood or restricted maximum likelihood, providing a robust and flexible framework for statistical inference.
- Like GLMs, GLMMs employ link functions to relate the response variable to the predictors and random effects, depending on the type of data being modeled.
Applications of Generalized Linear Mixed Models (GLMMs) in police fatal shootings:
- GLMMs can be used to investigate whether there are demographic disparities in police fatal shootings, specifically focusing on factors such as race, gender, age, and socioeconomic status. This analysis can provide insights into potential biases in law enforcement actions.
- They can be employed to examine temporal patterns in police fatal shootings, including the analysis of trends over time, seasonality, and day-of-week effects. Understanding temporal patterns can help identify periods of increased risk and inform law enforcement strategies.
- Spatial GLMMs can be used to analyze the geographic distribution of police fatal shootings, identifying spatial clusters or areas with higher incident rates. This information can guide resource allocation and community policing efforts.
- They can help identify and quantify risk factors and covariates associated with police fatal shootings. This may include factors such as the presence of weapons, mental health conditions, prior criminal history, and the characteristics of the officers involved.
- They can be applied to evaluate the impact of policy changes and reforms within law enforcement agencies on the occurrence of fatal shootings. By comparing data before and after policy changes, it becomes possible to assess the effectiveness of new practices and procedures in reducing fatal incidents.
Years of life lost
YLL (Years of Life Lost) is a metric that is used to quantify the impact of premature deaths on a population. It can be calculated as:
Where:
- represents the Years of Life Lost.
- is the life expectancy at the age at which the person died.
- is the actual age at death.
I took the life expectancy values from the year 2021 for different races and genders.
By using the data frame “life_expectancy_2021_df”, I have calculated the YLL:
I will post my findings in the next post!!!
Gun shot trends by year
Shaping the dataset:
- To see the gunshot trend by year we need to group by the dataset with each year and count the number of rows using the unique identifier “id”.
- We can plot interactive plots using a library called “Plotly”. Each plot is a “trace” in the plot and that goes into the figure object.
- We can save the plot as an image file or html file.
Interpretations from the above line plot:
- From the above plot, we can say that the victims are being killed more and more every year.
- The lowest kill rate was in 2016 with <1000 people it has drastically increased to <1050 people in the year 2021.
- From 2021 we can observe a downtrend.
Descriptive Statistics of the Fatal Police Shootings dataset
Descriptive Statistics of the fatal police shooting dataset:
- id: ID represents a unique identifier for each fatal police shooting incident. It allows us to uniquely reference and track individual incidents. The range of IDs from 3 to 8696 suggests that there have been 8002 unique incidents recorded in the dataset, with no missing or duplicate IDs.
- date: The date column records the date and time of each fatal police shooting incident. The dataset spans from Jan 2, 2015, to Dec 1, 2022. The mean date around Jan 12, 2019, indicated the central tendency of the incident dates. 25% of the incidents occurred before January 18, 2017, and 75% January 21, 2021.
- age: The age columns represent the age of the victim at the time of the fatal police shooting. The age of victims in the dataset ranges from 2 to 92 years old. The mean age is 37.209, which signifies the average age of the victim. The 25th and 75th percentiles provide insights into the age distribution, with 25% of victims being 27 years old or younger and 75% being 45 years old or younger. The variability in the victim ages is given by std, which is around 12.979.
- longitude: This longitude contains the longitude coordinates of the locations where the fatal police shootings occurred. The data spans a wide range of longitudes, from approximately -160.007 to -67.867. The mean longitude, around -97.041, represents the central location. 25% of incidents occurred west of -112.028 and 75% west of -83.152. The standard deviation of around 16.525 indicates the dispersion of incident locations along the longitude axis.
- latitude: This column represents the latitude coordinates of the locations where the fatal police shootings occurred. Latitude values vary from approximately 19.498 to 71.301. The mean latitude, around 36.676, represents the central location. 25% of incidents occurred south of 33.480 and 75% south of 40.027. The standard deviation of around 5.380 indicates the dispersion of incident locations along the latitude axis.
About police shootings dataset
The “Fatal Force Database,” initiated by The Washington Post in 2015, is a comprehensive effort that meticulously tracks and documents police-involved killings in the United States. Focusing exclusively on cases where law enforcement officers, while on duty, shoot and kill civilians, it provides essential data, including the racial background of the deceased, the circumstances of the shooting, whether the individual was armed, and if they were experiencing a mental health crisis. Data collection involves sourcing information from various channels such as local news reports, law enforcement websites, social media, and independent databases like Fatal Encounters. Notably, in 2022, the database underwent an update to standardize and publish the names of the involved police agencies, enhancing transparency and accountability at the department level. This dataset, distinct from federal sources like the FBI and CDC, has consistently documented more than double the number of fatal police shootings since 2015, emphasizing a critical data gap and the need for comprehensive tracking. Continually updated, it remains a valuable resource for researchers, policymakers, and the public, offering insight into police-involved shootings, promoting transparency, and contributing to ongoing discussions about police accountability and reform.
Project 1: CDC Diabetes: The effects of social determinants of health on diabetes
Post Hoc Tests
Post hoc tests are follow-up tests conducted after an initial statistical test (typically ANOVA or t-test) when there are multiple groups or conditions being compared. These tests are performed to identify which specific group(s) or condition(s) differ from each other. Post hoc tests are essential because they help avoid making type I errors (false positives) when conducting multiple pairwise comparisons.
Common post hoc tests include:
- Tukey’s Honestly Significant Difference (HSD): This test is commonly used after ANOVA to identify which groups have significantly different means.
- Bonferroni correction: This method adjusts the significance level to account for multiple comparisons. It is often used with t-tests or ANOVA.
- Duncan’s multiple range test: Similar to Tukey’s HSD, this test identifies significantly different groups after ANOVA.
- Holm-Bonferroni method: Another method for controlling the familywise error rate when conducting multiple comparisons.
- Scheffé’s method: A conservative post hoc test is used when the assumptions of ANOVA are met.
Types of t-test
Independent Samples t-test:
It is also known as a two-sample t-test, this statistical method is employed to ascertain if there is a significant difference between the means of two separate and unrelated groups. This test assumes that both groups are independent and exhibit a normal distribution, with roughly equivalent variances.
Paired Samples t-test:
It is also known as a dependent samples t-test or a paired t-test, this analytical tool is applied when dealing with two sets of interrelated or paired data points. Its purpose is to evaluate whether the means of the discrepancies between paired observations are significantly distinct from zero. Typically, this test is utilized in scenarios such as before-and-after studies or when comparing two related treatments or conditions.
One-Sample t-test:
This statistical technique is utilized to compare the mean of a single sample against a known or hypothesized population mean. Its objective is to determine whether the sample mean significantly differs from the hypothesized population mean.
Welch’s t-test:
This method resembles the independent samples t-test; however, it does not make the assumption of equal variances between the two groups. It proves valuable when the assumption of equal variances is violated, and there is an unequal sample size or variance between the two groups.
Student’s t-test for Equal Variance (Pooled t-test):
A variant of the independent samples t-test, this test operates under the assumption of equal variances in the two groups. It is chosen when the assumption of equal variances is reasonable and holds true.
Student’s t-test for Unequal Variance (Unpooled t-test):
Much like Welch’s t-test, this test is employed when dealing with two independent samples, but without the assumption of equal variances between the groups. It is also known as the Behrens-Fisher problem.
Two-Sample t-test for Proportions:
This statistical approach is used to compare the proportions of two unrelated groups. It finds application in scenarios where there is an interest in comparing the success rates or proportions between the two groups.
PCA: Advantages and Disadvantages
Advantages of PCA:
- Dimensionality Reduction: PCA reduces the number of features in a dataset while preserving most of the original variance, making data analysis and visualization easier, especially for high-dimensional data.
- Decorrelation: PCA transforms original variables into uncorrelated principal components, addressing multicollinearity and reducing information redundancy.
- Interpretability: PCA highlights the most important features or dimensions in the data, helping identify key contributors to dataset variance, and making data more interpretable.
- Noise Reduction: PCA focuses on significant principal components, reducing the impact of noise and improving the signal-to-noise ratio.
- Visualization: PCA enables the visualization of high-dimensional data in lower dimensions (e.g., 2D or 3D), simplifying the exploration of data structure.
- Feature Engineering: PCA can be used to create new features that capture essential data patterns, which can be valuable for machine learning tasks.
Disadvantages of PCA:
- Information Loss: PCA may result in a loss of detail as less important dimensions are discarded during dimensionality reduction.
- Linearity Assumption: PCA assumes linear relationships between variables, which may not hold in datasets with nonlinear relationships.
- Interpretability of Components: The principal components generated by PCA can be challenging to interpret, especially when they lack clear physical or domain-specific meanings.
- Sensitivity to Scaling: PCA is sensitive to variable scales, requiring standardization (scaling to mean 0 and standard deviation 1) to avoid disproportionate influence.
- Computational Cost: PCA can be computationally expensive for large datasets with numerous variables, demanding significant time and memory resources.
- Non-Robust to Outliers: PCA is not robust to outliers, meaning that a few extreme values in the data can skew the results, necessitating preprocessing to handle outliers.
- Linear Combination of Variables: PCA components represent linear combinations of original variables, potentially failing to capture complex nonlinear relationships in the data.
Principal Component Analysis
PCA is a dimensionality reduction technique. It takes a dataset with multiple features and transforms it into a new coordinate system, thereby simplifying the data while retaining its most critical aspects. Performing PCA involves a series of steps:
- To perform PCA, we need to center the data by subtracting the mean of each feature from the data points. This ensures that the new coordinate system is centered at the origin.
- Next, we will calculate the covariance of the matrix to summarize the relationships between the different features. Covariance can be calculated as,
-
where Xc is the data that is centered and (n-1) is used for unbiased estimation of the covariance.
-
- After obtaining the covariance matrix, the next step is to calculate its eigenvectors and eigenvalues which represent the directions (principal components) of maximum variance in the data, and the eigenvalues indicate the amount of variance explained by each component.
. where V is the ith eigenvector and is the ith eigenvalue.
- To reduce the dimensionality of the data, you can select the top k eigenvectors (principal components) based on the corresponding eigenvalues. These k principal components capture the most variance in the data. The choice of k depends on the desired level of dimensionality reduction.
- Finally, we transform the original data into the new coordinate system defined by the selected principal components. This transformation is achieved by multiplying the centered data matrix by the matrix of selected principal components:
. is a matrix containing the top k eigenvectors and is the transformed data matrix.
Best Practices and Common mistakes of Cross-validation
Best Practices:
- Cross-validation should be utilized to evaluate your model’s performance. For example, when training a classifier for spam email detection, apply k-fold cross-validation to assess its accuracy on various subsets of the email dataset.
- Selecting the right type of cross-validation for the specific dataset is important. In a medical study examining the effectiveness of a new drug, consider using time series cross-validation to account for changing patient responses over time.
- Shuffle the data to eliminate any potential order bias. For sentiment analysis of product reviews, ensure that the order of reviews is randomized before conducting cross-validation to ensure an equal representation of all sentiments in each fold.
- Evaluate the model’s performance using a range of metrics. For example, a fraud detection system, in addition to accuracy, takes into account precision (to minimize false positives), recall (to catch actual fraud cases), and F1-score (which balances precision and recall) to gauge its effectiveness.
Common Mistakes to Avoid:
- Avoid using information from the test/validation set during training. For instance, in a predictive maintenance scenario, refrain from using future sensor data from the test set to train the model, as this can artificially inflate its performance.
- Be vigilant about data leakage. In a stock price prediction task, steer clear of using financial indicators that would not have been available at the time of prediction. It’s a common mistake to use future stock price data for feature engineering, which can lead to data leakage.
- Don’t overlook class imbalance issues. When developing a model to detect rare diseases in a medical dataset, use stratified k-fold cross-validation to ensure that the model has a balanced representation of both diseased and non-diseased cases in each fold.
- Refrain from adjusting hyperparameters using the test set. For instance, when training a deep learning model for image classification, resist the temptation to modify the learning rate based on test set performance, as it can result in overfitting to the test data.
Resampling Methods: Cross-validation and Bootstrap
Cross-validation is commonly used in statistics and machine learning to assess the model performance make a more robust model and get reliable inferences from the data.
- To perform cross-validation:
- Divide the dataset into two parts: training and validation set or test set.
- Choose the number of folds (k) for cross-validation and split the training set into k equal subsets.
- Select a performance metric to evaluate our model.
- Repeat the following k times.
- Train the model using k-1 folds.
- Validate the model on the remaining fold and calculate the chosen performance metric each time.
- Calculate the average of the ‘k’ performance metric to assess the model’s overall performance.
- To illustrate this, I have created random data, performed cross-validation, and compared the evaluation metric using boxplots.
- I have plotted the box plot for the evaluation metric to compare before and after applying the cross-validation.
- Getting a lower median after applying the cross-validation is a positive sign. It indicates that on average, the model’s predictions have improved after cross-validation and it helped to reduce bias and resulted in a more accurate model.
- However, a higher IQR implies greater variability in the model’s performance across different subsets of the data. This might indicate that the model’s performance is less consistent, possibly due to the presence of outliers or differences in data quality between folds.
Bootstrap is a resampling technique used for estimating the sampling distribution of a statistic by repeatedly resampling from the observed data. It’s particularly useful when we have a limited dataset and want to make inferences about population parameters or assess the uncertainty of your estimates.
- To perform bootstrap resampling:
- Randomly pick ‘n’ data points (with replacement) from your dataset. Repeat this process many times (usually thousands).
- Compute the statistic you’re interested in (mean, median, etc.) for each of these new samples. This creates a distribution of the statistics.
- Use the distribution of the statistic to draw conclusions about the population parameter of interest or to understand the uncertainty associated with the estimate.
- ‘X’ indicates data points that are generated by bootstrapping.
Mann-Whitney U Test (Wilcoxon Rank-Sum Test)
From the above plots, we can see that the distribution of diabetes from the inactivity dataset is not normally distributed. In order to know whether there is a significant difference between the two groups or not, we can use another statistical test called Mann-Whitney U Test (Wilcoxon Rank-Sum Test).
Wilcoxon Rank-Sum Test:
- Mann-Whitney U Test does not rely on the assumption of a speicific distribution in the data. So, we can compare samples with non-normally distributed data.
- The test statistic in the Mann-Whitney U Test is the U statistic, which represents the sum of the ranks of the samples. The value of U depends on whether one sample tends to have larger values than the other or vice versa.
Assumptions:
- H0 (Null Hypothesis): There is no significant difference between the two groups.
- H1 (Alternative Hypothesis): There is a significant difference between the two group
- Mann-Whitney U Statistic is relatively high (174721.0), which suggests that one of the groups has consistently larger values and higher ranks compared to the other group.
- A small p-value (2.5437718749751766e-18 < 0.05) indicates that the difference in rank sums between the two groups is statistically significant. we can conclude that there is a significant difference between the two groups.
Hypothesis testing between distributions of diabetic values
- From the above plots, we can say that the distribution of diabetes from obesity_diabetes datasets is approximately normal with a small peak at the right side of the mean, whereas the distribution of diabetes from the inactivity_diabetes dataset is a little right-skewed.
- By assuming that the basic assumptions of the t-test are true, I have performed the hypothesis testing as follows:
The assumptions of the t-test are:
- The observations within each group must be independent of each other.
- The data within each group should follow a normal distribution.
- The variances of the two groups should be approximately equal.
- The samples from the two groups should be randomly selected from the respective populations.
Hypothesis Testing:
- H0 (Null Hypothesis): There is no significant difference in the means of the two datasets.
- H1 (Alternative Hypothesis): There is a significant difference in the means of the two datasets.
Monte Carlo Simulation:
- The observed t-statistic of approximately -8.587 is a measure of how different the means of your two datasets are. The negative sign indicates that the mean of the first group (group1) is significantly lower than the mean of the second group (group2).
- The Monte Carlo estimated p-value of 0.0 represents the probability of obtaining a t-statistic as extreme as or more extreme than the observed t-statistic under the null hypothesis (i.e., no difference in means).
- A p-value of 0.0 indicates that the observed difference in means is highly unlikely to have occurred by random chance alone.
- With the obtained p-value of 0.0, we reject the null hypothesis (H0) and conclude that there is a significant difference in the means of the two datasets. In other words, the difference in means between group1 and group2 is statistically significant, and it is unlikely to be due to random variation.
Multiple Linear Regression
Multiple Linear Regression is a statistical method used to model the relationship between a dependent variable and two or more independent variables. The basic form of multiple linear regression:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
- Y is the dependent variable.
- X₁, X₂, …, Xₖ are the independent variables.
- β₀ is the intercept, representing the expected value of Y when all independent variables are set to 0.
- β₁, β₂, …, βₖ are the coefficients that represent the effect of each independent variable on Y.
- ε is the error term. This term accounts for the variability in the data that is not explained by the model.
The goal of multiple linear regression is to estimate the values of the coefficients that best fit the data. The least squares method aims to estimate the coefficients in the multiple linear regression model by minimizing the sum of squared differences between the predicted values and the actual values of the dependent variable. This is often referred to as the “sum of squared residuals” or “sum of squared errors“.
The function we aim to minimize is: Minimize: Σ(Yᵢ - Ŷᵢ)²
,
- Yᵢ is the actual observed value of the ith data point.
- Ŷᵢ is the predicted value of the ith data point based on the regression model.
Matrix Notation:
- Matrix notation in multiple linear regression simplifies the mathematical representation of complex relationships between dependent and independent variables.
- It offers computational advantages, making it suitable for handling large datasets and high-dimensional models.
- Its standardized format enhances compatibility with statistical software, enabling seamless implementation and analysis, etc.
The multiple linear regression model can be expressed using matrix notation.
Y = Xβ + ε
Where:
- Y is the vector of observed values of the dependent variable.
- X is the matrix of independent variables (including a column of 1s for the intercept).
- β is the vector of coefficients.
- ε is the vector of errors or residuals.
To estimate the coefficients β:
β = (XᵀX)⁻¹XᵀY
Identifying Heteroscedasticity in Data Analysis – Part 2
Using diagnostic tools and statistical tests like residual plots, the Breusch-Pagan test, the White test for heteroscedasticity is an important part of regression analysis to assess the validity of your model assumptions:
- Residual Plot:
-
- Within the residual plot depicted above, it becomes apparent that the residuals tend to cluster towards the right side of the graph. This pattern signifies that as the predicted values increase, the residuals also exhibit an upward trend. In other words, we can conclude that the spread of residuals widens as the predicted values grow larger.
- This violates one of the key assumptions of linear regression, which is that the variance of the residuals is constant across the plot (i.e., the spread is constant).
- Heteroscedasticity indicates that our model is not learning the true relationship between diabetes and obesity.
-
- Breusch-Pagan Test:
-
- It is a formal statistical test for heteroscedasticity.
- Null Hypothesis (H0): There is no heteroscedasticity.
- Alternative Hypothesis (H1): There is heteroscedasticity.
- Both the LM p-value (0.6415) and the F p-value (0.6426) are greater than the common significance level of 0.05.
- This means we do not have enough evidence to reject the null hypothesis.
- Therefore, based on this test, you conclude that there is “No evidence of heteroscedasticity“.
-
- White Test:
-
- Both the LM p-value (0.2417) and the F p-value (0.2432) are greater than the common significance level of 0.05.
- This means that we do not have enough evidence to reject the null hypothesis.
- Therefore, based on this test, you conclude that there is “No evidence of heteroscedasticity“.
-
Identifying Heteroscedasticity in Data Analysis – Part 1
- Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.
-
-
- In this case, our regression line will be, Diabetes = β0 + β1 * Obesity + ε.
-
- where β0 is the intercept representing the predicted value of diabetes when obesity is zero.
- β1 is the change in diabetes for a one-unit change in obesity.
- ε (Residual) is the error term, representing the difference between the predicted and actual values of diabetes.
-
- In this case, our regression line will be, Diabetes = β0 + β1 * Obesity + ε.
-
-
- The linear model for the above features i.e., diabetes and obesity, captures 14% of the total variation in the data.
- As we can see, from the above plot, the data points are clustered at the right end which indicates the presence of heteroscedasticity. This can be witnessed by plotting a residual plot. The residual plot is shown below.
- The most important thing that is seen in a residual plot is the variation in data points.
- As we can see, the spread or variance of the residuals is not constant across the range of predicted values.
- The variance of residuals seems to be clustered as the predicted values increase and we can also see the presence of outliers which are individual points that deviate significantly from the others.
- Heteroscedasticity can be problematic because it breaks one of the assumptions of the linear regression models, model estimates have large standard errors, reducing the precision of your coefficient estimates, etc.
To check for heteroscedasticity, we can use diagnostic tools and statistical tests, such as residual plots, the Breusch-Pagan test, the White test, or the Goldfeld-Quandt test. I will post them in my next post.
Thank you !!!
Distributions of obesity and diabetes
Summarise:
- ‘diabetes‘ and ‘obesity‘ datasets are joined using inner join on the ‘FIPS’ column, creating a new data frame named ‘obesity_diabetes.’ This merge combines data related to diabetes and obesity based on a common identifier.
- Histograms are plotted to visualize the relationship between the ‘%DIABETIC’ and ‘% OBESE’ columns within the ‘obesity_diabetes‘ data frame to provide insights about the distributions.
- Distribution of “% DIABETIC”:
- The provided statistics (mean, median, mode, skew, and kurtosis) on the plot “Distribution of “% DIABETIC“ is relatively close to being normally distributed.
- Mean (7.14), Median (7.0), and Mode (6.90) are quite close to each other, which is a positive sign for normality.
- Skewness (0.09) indicates that it is a slight right-skewed distribution, but it is very close to zero. whereas in normal distribution skewness is zero. There may be some outliers on the right side of the distribution.
- Kurtosis (2.76) indicates that the distribution has heavier tails. Kurtosis of the normal distribution is 3.0. The distribution has more extreme values of outliers than a normal distribution, and we can see a small peakedness near the skewed region (right side).
- Distribution of “% OBESE”:
- Even though mean, median, and mode are close to each other, skewness and kurtosis give a clearer view of the distribution.
- Skewness (-2.69) indicates that it is a left-skewed distribution. The tails on the left side are longer than the right side suggesting that there are so many outliers on the left side of the distribution.
- Kurtosis (12.32) is an extremely high value, much higher than the kurtosis of normal distribution i.e., 3. It suggests that the distribution has heavier tails compared to the normal distribution which indicates that the distribution has many extreme values, i.e., outliers.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!