- Categorical
- Nominal (example: Red, Green, Blue)
- Oridnal (example: Small, Medium, Large)
- Numerical
- Discrete (example: 585 people, 2 dogs)
- Continuous (example: Age, height, temperature)
- Linear Regression
- Logisitic Regression
- Simple Logistic Regression
- Multiple Logistic Regression
-
Simple Linear Regression draws many lines and tries to find the best fit line with Minium value of Sum of squre of difference between Actual Value and Predicted Value (the smallest sum value). This method is called Ordinary Least Square.
-
y - y^ is the difference between Actual Value - Predicted Value.
- Sum of Square of Residual : Sum of differences between Acutal values and Predicted values
- Total Sum of Square: Sum of differences between Average line and Actual values
R Squared = 1 - (SSres / SStot)
- R Squared value tells us how good is our best fit line compared to Average line.
- The closer R squared value is to 1, the better our model is.
- However R squared value can be easily influenced by the number or variables. The more variables we added, the larger R squared value can be. So to avoid this, we need to use Adjusted R-Squared value.
- to find the correlation between Salary and Years of Experience.
- Open File in gretl. Model > Ordinary Least Squares
- Dependent Variable: Salary, Independent Variable: Year of Experiences
- Coefficient: 1 unit increase in Year of Experiences will result in 9449.96$ increase in Salary.
- p-value: tells the statistically significance between the variables. the smaller the p-value, the better. In this case 1.14e-020.
- Graphs > Fitted, Actual Plots > Fitted Vs Actual
- Forcasts >
Before we can use Linear Regression, we need to make sure the following assumptions are correct. Only then we should proceed with LR.
- Linearity
- Homoscedasticity
- Multivariate normality
- Independence of errors
- Lack of multicollinearity
- to create the model which analyze multiple variables (the related spending, etc) of 50 startups and understand which yields the best profit.
- csv file includes spending and profit of each startups.
- As
State
is categorical variable, we need to encode it first. - make sure not to fall into Dummy Variable trap too. (omit the duplicated column)
As there can multiple features to predict the label, we can't just use all the features. We need to discard some features which are not useful for predictions. There are 2 main reasons.
- Garbage in , Garbage Out
- When you have thousands of variables, it is not pratical to explain those to managment level. We might want to keep only important variables.
- All-in
- Backward Elimination
- Forward Selection
- Bidirectional Elemination
- Score Comparison
2, 3, 4 are Stepwise Regression.
- P12-50-Startups.csv
- our significance level is 5% (0.05)
- using backward elimination, we are only left with only one variable
RD Spend
. However before we eliminateMarketing Spend
, we took a look at graph and we can see there is some kind of relationship going on and also p-value is around 0.05 which is only a bit higher than our defined significance level of 0.05.- So how can fix this kind of problem?
- That's when
Adjusted R-Squared
comes in.
- If we compare all 4 models
Adjusted R-Squared
values, we can see that 3rd model has the highest value. (with variables: RDSpend and Marketing Spend)
- using backward elimination, we are only left with only one variable
- So we can conclude that 3rd model is the best model.
- Company has sent out email to customers for promoted item to buy. Email CSV file includes whether customer has clicked via promoted link on email or not.
- We want to predict customers action (take action or not) based on the past information (Age, Gender).
- for binary classification, we can't really use linear regression for fitting the best fit line. We have to use
Sigmoid
function to separate the classes. - Let's say we have age of 20, 30, 40 and 50. We can project those points on X axis and project their probabilities of being closer to the class (1, 0) as below. Example: 20 yr old have a very low probability of 0.7% of clicking the promotion where 50 yr old have a very high probability of 99.4%.
- Using this initution, we can draw a threadshold of 0.5 (
defined by us and can be changed accordingly
) in the middle. Any points below this line are belonged to class 0 (NO) and above line are belonged to class 1 (YES).
- using Age and Take action data
- => Model > Limited Dependent Variable > Logit
- For graph, => Graphs > Fitted Actual Plot > Against Age
- Do the same thing with Female or Male variable.
- => Analysis > Forecasts > we can see the predictions.
- False Positive (Type I error, Actual 0 => Predicted 1)
- Flase Negative (Type II error, Actual 1 => Predicted 0)
- to predict whether the customer will churn (
Existed: 0 / 1
) or not based on the information for a bank. So that the bank can take necessary action on the hightest risk customers to continue using bank's services. - This can be applicable for similiar scenarios like customer default loan or not, etc.
- Segmentating data with similar traits using different groups.
- We can apply multiple transformation to the variables.
- square root
- square
- natural log
- Why we want to use transformation?
- basically to make change to the variable to make consinstant changes, regardless of large or small value.
- Example: changes to 1000$ to 1000$ and 10,000$ have a different effect. Second one doesn't really have much noticible changes. Instead if we use nautral log, then changes to both can have consistent effect.
- Generally,
Balance
is associated withAge
, meaning Older people tends to have larger Balance than in their account. However sometimes it is not the case. There are young people who can accumulate wealth and older people who can't accumulate wealth enough even though in their 50s or 60s. So we want to create new effect to separate out those groups. Example:WealthAccumulation=Balance/Age
- Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related.
- Basically you are putting variables which are very similar in nature. Example: putting both
WealthAccumulation
andBalance
into the model which are very similiar. - How can we check it in gretl? => After creating the model > Analysis > Collinearity
- We can check Correlation Matrix in gretl by => View > Correlation Matrix
- CAP allows you to access model performance.
- Note: CAP doesn't equal to ROC (Reciver Operating Characteristic)
- Let's say we have history of customer information who churned(existed) the services. Using this information, we trained and created the model which predicts customer who are likely to churn (exist) the service.
- Out of 1000 customers we cacluated
P-Hat
value which is the model predictions of likelyhood of people existing the services. and we compare against the actual valueExisted
. - We can draw below CAP. According to model's CAP, we can cover 80% of targeted customers even if we just send promo email to 50% of the customers, sorted by
P-Hat
values decending order (people who has the highest probability of leaving the service). So that customers can stay with the services for longer preriod of time. This is a value added service for the company.
- This will tell us how likely the customer is going to churn, etc.
- Let's say the cost of sending email is 1 cent per email. If we can target to the customers who have the highest probability of churning, we can save a lot of money for company, rather than blasting emails to everyone.
- By varying the target % (Whether we want to reach 50% or 80% or customers,etc), this will avoid unnecessary contacting, spamming email to the customers who are unlikly to churn.
- As there are many points that we can decide to set the target, we need to check whether there is prominent differences between point A and point B. If the differences between them is not significant, we might want to considering choosing point A.
- We want to get higher ROI as much as possible.
- we can get how much odds increase by using exponent of cofficients of a variable. For example: 1 unit increase for Age will increase the odds of 1.075 for the customer to churn.
- Additional Factors (which are added only after model is deployed)
- Changes in behaviour (such as people start to use mobile banking instead of traditional banking)
- Changes in Process
- Changes in Existing Factors (customers getting older and model doesn't accomodate the new customer based, etc)
- Competitor
- Changes in Industry
- Changes in Regulations
- Changes in Product
- Depletion
- Spontaneous Changes (like customers from France left the services due to boycotting (or) more customers from France join the services as your service the only avaliable one, etc)
- Access
- Retrain
- Rebuild
Another way is Champion Challenger Set up
, where you run the original model and new model side by side. Then compare the performance. Or you can split the population half by half and run the models, etc.
- Original Data
- Prepared Data
- Uploaded Data
- Analysis
- Insights
- Final