Data Science

Posts

Showing posts from April, 2020

ACCURACY, RECALL, PRECISION, F1 SCORE

April 02, 2020

ACCURACY, RECALL, PRECISION, F1 SCORE Classifying a single point can result in a true positive ( truth = 1 , guess = 1 ), a true negative ( truth = 0 , guess = 0 ), a false positive ( truth = 0 , guess = 1 ), or a false negative ( truth = 1 , guess = 0 ). Accuracy measures how many classifications your algorithm got correct out of every classification it made. Recall measures the percentage of the relevant items your classifier was able to successfully find. Precision measures the percentage of items your classifier found that were actually relevant. Precision and recall are tied to each other. As one goes up, the other will go down. F1 score is a combination of precision and recall. F1 score will be low if either precision or recall is low. The decision to use precision, recall, or F1 score ultimately comes down to the context of your classification. Maybe you don’t care if your classifier has a lot of false positives. If that’s the case, precision doesn’t matter as m...

Naive Bayes

April 02, 2020

NAIVE BAYES Two events are independent if the occurrence of one event does not affect the probability of the second event If two events are independent then: P ( A ∩ B ) = P ( A ) × P ( B ) P(A ∩ B) = P(A) \times P(B) P ( A ∩ B ) = P ( A ) × P ( B ) A prior is an additional piece of information that tells us how likely an event is A frequentist approach to statistics does not incorporate a prior A Bayesian approach to statistics incorporates prior knowledge Bayes’ Theorem is the following: P ( A ∣ B ) = P ( B ∣ A ) ⋅ P ( A ) P ( B ) P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) ⋅ P ( A )

Random Forest

April 02, 2020

RANDOM FOREST A random forest is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees. Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful. Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random with replacement , which means a single data point can be chosen more than once. This process is known as bagging . When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has n features, it is common practice to randomly select the square root of n features. Boosting Steps : Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1 ...

Decision Tree

April 02, 2020

Decision Trees Decision trees are machine learning models that try to find patterns in the features of data points. Decision Trees Construction Decision Trees are usually constructed from top to bottom. At each level of the tree, the feature that best splits the training set labels is selected as the “question” of that level. Two different criteria are available to split a node, Gini Index and Information Gain. The convenience of one or the other depends on the problem. Decision Trees in scikit-learn The sklearn.tree module contains the DecisionTreeClassifier class. Information Gain at decision trees When making decision trees, two different methods are used to find the best feature to split a dataset on: Gini impurity and Information Gain. An intuitive interpretation of Information Gain is that it is a measure of how much information the individual features provide us about the different classes. Gini impurity When making decision trees, calc...

Support Vector Machines

April 02, 2020

Support Vector Machines A Support Vector Machine (SVM) is a powerful supervised machine learning model used for classification. An SVM makes classifications by defining a decision boundary and then seeing what side of the boundary an unclassified point falls on. Decision boundaries get defined, by using a training set of classified points. Decision boundaries exist even when your data has more than two features. If there are three features, the decision boundary is now a plane rather than a line. As the number of dimensions grows past 3, it becomes very difficult to visualize these points in space. Nonetheless, SVMs can still find a decision boundary. However, rather than being a separating line, or a separating plane, the decision boundary is called a separating hyperplane . Optimal Decision Boundaries In general, we want our decision boundary to be as far away from training points as possible. Maximizing the dis...

Logistic Regression

April 02, 2020

Logistic Regression Logistic Regression is used to perform binary classification, predicting whether a data sample belongs to a positive (present) class, labeled 1 and the negative (absent) class, labeled 0 . The Sigmoid Function bounds the product of feature values and their coefficients, known as the log-odds, to the range [0,1] , providing the probability of a sample belonging to the positive class. A loss function measures how well a machine learning model makes predictions. The loss function of Logistic Regression is log-loss. A Classification Threshold is used to determine the probabilistic cutoff for where a data sample is classified as belonging to a positive or negative class. The standard cutoff for Logistic Regression is 0.5 , but the threshold can be higher or lower depending on the nature of the data and the situation. Scikit-learn has a Logistic Regression implementation that allows you to fit a model to your data, find the feature coefficien...

K-Nearest Neighbor

April 02, 2020

K-Nearest Neighbor The K-Nearest Neighbors algorithm is a powerful supervised machine learning algorithm typically used for classification. However, it can also perform regression. K-Nearest Neighbors (KNN) is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories. K-Nearest Neighbors : from sklearn.neighbors import KNeighborsClassifier Three steps of the K-Nearest Neighbor Algorithm: Normalize the data Find the k nearest neighbors Classify the new point based on those neighbors Euclidean Distance To find the Euclidean distance between two points, we first calculate the squared distance between each dimension. If we add up all of these squared differences and take the square root, we’ve computed the Euclidean distance. The image below shows a visual of Euclidean distance being calculated: d = ( a 1 − b 1 ) 2 + ( a 2 − b 2 ) 2 ...

Regression Vs Classification

April 02, 2020

Regression Regression is used to predict outputs that are continuous . The outputs are quantities that can be flexibly determined based on the inputs of the model rather than being confined to a set of possible labels. Classification Classification is used to predict a discrete label . The outputs fall under a finite set of possible outcomes. Many situations have only two possible outcomes. This is called binary classification (True/False, 0 or 1). Multi-label classification is when there are multiple possible outcomes. It is useful for customer segmentation, image categorization, and sentiment analysis for understanding text. To perform these classifications, we use models like Naive Bayes, K-Nearest Neighbors, and SVMs.

Linear Regression

April 01, 2020

LINEAR REGRESSION When we are trying to find a line that fits a set of data best, we are performing Linear Regression . LOSS For each data point, we calculate loss , a number that measures how bad the model’s (in this case, the line’s) prediction was. This is also being referred to as error. Gradient Descent As we try to minimize loss, we take each parameter we are changing, and move it as long as we are decreasing loss. The process by which we do this is called gradient descent . CONVERGENCE Convergence is when the loss stops changing (or changes very slowly) when parameters are changed. LEARNING RATE We have to choose a learning rate , which will determine how far down the loss curve we go. A small learning rate will take a long time to converge — you might run out of time or cycles before getting an answer. A large learning rate might skip over the best value. It might never converge

Machine Learning

April 01, 2020

Supervised Learning Machine learning can be branched out into the following categories: Supervised Learning Unsupervised Learning Supervised Learning is where the data is labeled and the program learns to predict the output from the input data. Supervised learning problems can be further grouped into regression and classification problems. Regression: In regression problems, we are trying to predict a continuous-valued output. Classification: In classification problems, we are trying to predict a discrete number of values. Unsupervised Learning Unsupervised Learning is a type of machine learning where the program learns the inherent structure of the data based on unlabeled examples. Clustering is a common unsupervised machine learning approach that finds patterns and structures in unlabeled data by grouping them into clusters.