Mar 27, 20232 min read

[Project 004] Spaceship Titanic: Predict Which Passengers Are Transported to An Alternate Dimension

Introduction

The Spaceship Titanic competition on Kaggle is a classification problem, where the task is to predict whether a passenger on the spaceship survived the accident or not. In this blog post, I will discuss my approach to solving this problem, and how I achieved a score of 0.79471.

Data

The dataset provided by Kaggle consists of two CSV files: train.csv and test.csv. The train.csv file contains information about the passengers, such as their age, gender, and class of travel, as well as whether they survived the accident or not. The test.csv file contains the same information, except for the survival column, which is what we need to predict.

Data Preprocessing

Before we can train our model, we need to preprocess the data. This involves cleaning the data, handling missing values, and converting categorical variables into numerical ones.

Cleaning the Data

In the train.csv file, there is a column called Cabin that contains information about the cabin number of the passengers. However, this column has a lot of missing values, so I decided to drop it. Similarly, I dropped the Name and PassengerId columns, as they are not relevant to the problem at hand.

Handling Missing Values

There are missing values in the Age and Destination columns. For the Age column, I filled in the missing values with the median age. For the Destination column, I filled in the missing values with the most common destination for passengers who did not go into cryosleep and were older than 12 years old.

Converting Categorical Variables

There are several categorical variables in the dataset, such as Sex, Embarked, and Ticket. To convert these variables into numerical ones, I used one-hot encoding.

Feature Engineering

I created new features from the Cabin column, such as the deck and the side of the cabin. I also created a new feature called VIP, which is true if the passenger was in first class.

Model Selection

For this problem, I decided to use a random forest classifier. I split the training data into a training set and a validation set, and used the validation set to tune the hyperparameters of the model. I used the accuracy metric to evaluate the performance of the model.

Results

After training the model on the training data and validating it on the validation set, I achieved an accuracy of 0.7797584818861415. I then used the trained model to predict the survival of the passengers in the test set and submitted my predictions to Kaggle. My submission achieved a score of 0.79471.

Conclusion

In this blog post, I discussed my approach to solving the Spaceship Titanic competition on Kaggle. I preprocessed the data by cleaning it, handling missing values, converting categorical variables, and engineering new features. I then trained a random forest classifier on the training data and achieved an accuracy of 0.82 on the validation set. Finally, I used the trained model to predict the survival of the passengers in the test set and achieved a score of 0.79471 on Kaggle.

Citation

A. Howard, A. Chow, and R. Holbrook, "Spaceship Titanic," Kaggle, 2022. [Online]. Available: https://kaggle.com/competitions/spaceship-titanic. [Accessed: March 27, 2023].

GitHub Repository

https://github.com/zeyongj/Spaceship-Titanic.git .

Current Time to be displayed