If you want to see metadata or get more detailed information on the data set, please refer to the link below.
https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf
- [Introduction]
- [Business Problem]
- [Data]
The total number of vehicles worldwide is increasing every year.
It mean that Car accidents can occur all the time all the way.
However there are some conditions were the probabilities of have an accident arise due multiple variables.
This report has as purpose develop a model for Seattle government to predict the probabilities of have a car accident and severity, based on different conditions as weather or road conditions.
The information was provided by Seattle Police Department form 2004 to 2020.
Identify the conditions that can cause future car accidents in order to alarm the people with anticipation to be aware and drive more carefully.
In an effort to reduce the frequency of car collisions in a community, an alogorithm must be developed to predict the severity of an accdient given the current features.
It will give us THREE BIG benefits :
1. Save lives as main benefit
2. Reduce costs in damage infrastructure
3. Reduce cost from police and paramedics to attend each accident
It comes from Seattle Police Department and recorded by Traffic Records and include Collisions at intersection or mid-block of a segment. The period information is from 2004 to May 2020.
The information is organized in a CSV File with 37 attributes and originally 194673 rows. Information is labeled and unbalanced. Additionally a document with the description of each column were given.
Due our information is labeled we know the result for each record, we have select the column SEVERITYCODE as Dependent varible. The possible values are:
1 -- Property Damage Only Collision
2 -- Injury Collision
The information is unbalanced by the difference in samples for each accident type. In our case there are only two types of accidents. Look at the picture below:
In it's original form, this data is not fit for analysis. For one, there are many columns that we will not use for this model. Also most of the features are of type object, when they should be numerical type.
We must use label encoding to covert the features to our desired data type.