_______________________________________________________________ Abstract : The sinking of the RMS Titanic is one of the mostinfamous shipwrecks in history. On April 15, 1912, during her maidenvoyage, the Titanic sank after colliding with an iceberg, killing 1502 out of2224 passengers and crew.
This sensational tragedy shocked the internationalcommunity and led to better safety regulations for ships.In this paper we aregoing to make the predictive analysis ofwhat sorts of people were likely to survive and using some tools of machine learing to predict which passengers survived thetragedy with accuracy..
IndexTerms- Machine learning .________________________________________________________________________________________________________ I. Introduction Machinelearning means the application of any computer-enabled algorithm that can beapplied against a data set to find a pattern in the data. This encompasses basically all types of datascience algorithms, supervised, unsupervised,segmentation, classification, or regression.An algorithm indata mining (or machine learning) is a set of heuristics and calculations thatcreates a model from data.
To create a model, the algorithm first analyzes thedata you provide, looking for specific types of patterns or trends. Thealgorithm uses the results of this analysis over many iterations to find theoptimal parameters for creating the mining model. These parameters are thenapplied across the entire data set to extract actionable patterns and detailedstatistics.Machine learning means the application of anycomputer-enabled algorithm that can be applied against a data set to find apattern in the data. This encompassesbasically all types of data science algorithms, supervised,unsupervised,segmentation, classification, or regression” The mining model that an algorithm creates from yourdata can take various forms, including:· A set of clustersthat describe how the cases in a dataset are related.· A decision tree thatpredicts an outcome, and describes how different criteria affect that outcome.
· A mathematical modelthat forecasts sales.· A set of rules thatdescribe how products are grouped together in a transaction, and theprobabilities that products are purchased together. Choosingan Algorithm by Type· Classificationalgorithms predict one or morediscrete variables, based on the other attributes in the dataset.· Regressionalgorithms predict one or morecontinuous numeric variables, such as profit or loss, based on other attributesin the dataset.· Segmentationalgorithms divide data intogroups, or clusters, of items that have similar properties.
· Associationalgorithms find correlationsbetween different attributes in a dataset. The most common application of thiskind of algorithm is for creating association rules, which can be used in amarket basket analysis.· Sequenceanalysis algorithms summarizefrequent sequences or episodes in data, such as a series of clicks in a website, or a series of log events preceding machine maintenance. 2.Literaturesurvey Everymachine learning algorithm works best under a given set of conditions. Makingsure your algorithm fits the assumptions / requirements ensures superiorperformance.
You can’t use any algorithm in any condition.Instead, in such situations, you should try usingalgorithms such as Logistic Regression, Decision Trees, SVM, Random Forest etc. Logistic Regression ? LogisticRegression is a classification algorithm. It is used to predict a binaryoutcome (1 / 0, Yes / No, True / False) given a set of independent variables.
To represent binary / categorical outcome, we use dummy variables. You can alsothink of logistic regression as a special case of linear regression when theoutcome variable is categorical, where we are using log of odds as dependentvariable. In simple words, it predicts the probability of occurrenceof an event by fitting data to a logit function. Peformance of Logisticregression model: AIC (AkaikeInformation Criteria) –The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of modelcoefficients. Therefore, we always prefer model with minimum AIC value Null Deviance and Residual Deviance –Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model.
Residualdeviance indicates the response predicted by a model on adding independent variables. Lower the value, better themodel. Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting.
McFadden R2 is called as pseudo R2. Whenanalyzingdata with a logistic regression, an equivalent statistic to R-squared does not exist. However, to evaluate the goodness-of-fit of logistic models, several pseudo R-squareds have been developed.
accuracy=truepostives + true negatives/(truepostivies+true negatives+false positives+false negatives) Decision Trees Decision tree is a hierarchical tree structurethat canbe used to divide up a large collection of records into smaller sets of classes by applying asequence of simple decision rules. A decision tree model consists of a set ofrules for dividing a large heterogeneous population into smaller, morehomogeneous(mutually exclusive) classes.The attributes of the classes can beany type of variables from binary, nominal, ordinal, and quantitative values,while the classes must be qualitative type (categorical or binary, or ordinal).In short, given a data of attributes together with its classes, a decision treeproduces a sequence of rules (or series of questions) that can be used torecognize the class.One rule is applied after another, resulting in ahierarchy of segments within segments. The hierarchy is called a tree, and eachsegment is called a node.
With each successive division, the members of theresulting sets become more and more similar to each other. Hence, the algorithm used to construct decision treeis referred to as recursive partitioningDecision tree applications : prediction tumor cells as benign or maligant classify credit card transaction as legitimate or fradulent classify buyers from non -buyers decision on whether or not to approve a loan diagnosis of various diseases based on symptoms and profiles 3.Methodolgy: our approach to solve the problem:1.
collect the raw data need to solve the problem.2. improt the dataset into the working environment 3.Data preprocessing whichincludes data wrangling and feature engineering .
4.explore the data and prepare a model for performing analysis usingmachine learing algorithms 5.Evaluate the model and re-iterate till we get satisfactory modelperformance 6.Compare the results and select a model which gives a more accurateresult. the data we collected isstill rawdata which is very likely tocontains mistakes ,missing values and corrupt values. before drawing anyconclusions from the data we need to do some data preprocessing which involvesdata wrangling and feature engineering .data wrangling is the process of cleaning and unify the messy andcomplex data sets for easy access and analysis feature engineering process attempts to create additional relevantfeatures from existing raw features in the data and to increase the predictivepower of learing algorithms 4 Experimental Analysis and Discussion a) Data set description: The original data has been split into twogroups :training dataset(70%) and test dataset(30%).The trainingset should be used to build your machine learning models.
.The testset should be used to see how well your model performs on unseen data. Forthe test set, we do not provide the ground truth for each passenger. It is yourjob to predict these outcomes.
For each passenger in the test set, use themodel you trained to predict whether or not they survived the sinking of theTitanic. b) Measures DataDictionary Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / children aboard the Titanic ticket Ticket number fare Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton VariableNotespclass: A proxy for socio-economic status (SES)1st = Upper2nd = Middle3rd = Lowerage: Age is fractional if less than 1. If the age isestimated, is it in the form of xx.5sibsp: The dataset defines family relations in this way..
.Sibling = brother, sister, stepbrother, stepsisterSpouse = husband, wife (mistresses and fiancés were ignored)parch: The dataset defines family relations in this way..
.Parent = mother, fatherChild = daughter, son, stepdaughter, stepsonSome children travelled only with a nanny, therefore parch=0 for them. c) Results after training with the algorithms , we have to validate our trainedalgorithms with test data set and measure the algorithms performance withgodness of fit with confusion matrix for validation. 70% of data as trainingdata set and 30% as training data setconfusion matrix for decision tree trained data set testdata set References predictions 0 1 0 395 71 1 45 203 References predictions 0 1 0 97 20 1 12 48 confusion matrix for logistic regression trained data testdata References predictions 0 1 0 395 12 1 21 204 References predictions 0 1 0 97 12 1 21 47 d) Enhancements and reasoning predicting the survivalrate with others machine learing algorithms like random forests , various SupportVector machines may improve the accuracyof prediction for the given data set.
5. Conclusion:Theanalyses revealed interesting patterns across individual-level features.Factors such as socioeconomic status, social norms and family compositionappeared to have an impact on likelihood of survival. These conclusions,however, were derived from findings in the dataThe accuracy of predicting thesurvival rate using decision tree algorithm(83.7) is high when compared withlogistic regression(81.
3) for a givendata set types of conclusions 1. The analyses revealed interesting patterns acrossindividual-level features. Factors such as socioeconomic status, social normsand family composition appeared to have an impact on likelihood of survival.These conclusions, however, were derived from findings in the data.
Manystories and oral histories havebeen collected by both survivors and relatives of the passengers in the pastcentury, and these qualitative data sets may help to elucidate what reallyhappened that fateful night.