Here’s a quick tip on reducing the categories in the feature “armed”.
Cardinality refers to the number of distinct values in a feature. We can take the unique values in every categorical feature and plot a bar plot to see the cardinality. This is important because we want to avoid the cruse of dimensionality problems while modeling.
Cardinality of Armed is as follows:
From the above plot, it is evident that we have more than 80+ categories in the feature armed. By doing one hot encoding to that variable we increase the number of columns by 80+ in our dataset and it is very inefficient to do that. We cannot drop that feature because it might have important information while modeling the data.
To encounter this problem, we can categorize the categories in the feature to reduce the dimensions significantly while maintaining the right information.
I have grouped the categories in the following manner.
From 80+ unique categories, I have reduced to 8: “Firearms”, “Edged_Weapons”, “Blunct_Objects”, “Tools_and_Construction_Items”, “Improvised_Weapons”, “Non_lethal_Weapons”, “Not_Weapons”, and “Miscellaneous_Weapons”.
Now we can check the cardinality of our entire categorial features.
By doing things in this way, we can reduce the dimensions generated while converting categorical to numerical data while maintaining all the important information.