Data Science Portfolio

This portfolio is a compilation of my data science projects that focus on data analysis, applied machine learning, statistics, and software engineering.

Stand-alone Projects

Each project is an end to end project that address a real-world problem with the aids of data visualisations.

Predict Sentiment in Yelp Reviews

This is a comprehensive NLP project on Yelp’s user reviews, including interesting insights, trends, and sentiment analysis to offer recommendations to business owners. Using logistic regression with N-grams feature engineering, I was able to train a model can predict sentiment (negative or positive) with 95% accuracy. You can find the detail of the full project here.

The Faces of Reddit

This is a project that involve the application of computer visions method such as facial detection and facial landmarks construction. The project is set out to create an average face for a set of images from a community. For example, the images from each subReddit were used to generate an average face that represent the average facial features of the given subReddit.

My Youtube presentation video can be found here. The code for this project can be found here.

IOT Monitor Environment Data

A interactive dashboard application that monitor the condition of my room in real-time. The main functions of this application include:

Monitor the temperature, pressure, light, color and accelerometer data
Detect anomaly in the system’s temperature
Provide analytics about energy usage using historical data

You can access the web application here. Username and password are needed to access this page:

Username: pi

Password: GrandBudapest2014

Exploratory Data Analysis of the Sensor Data can be found here

Enterprise Project

Industry: Telecom

I was responsible for designing automated system for detecting and evaluating last-mile network incidents that oversees internet subscribers to enhance service quality and reduce customer churn. Some results from this project include:

• Outputs from the detection system supported 8,000-10,000 subscribers monthly, resulting in a 43% – 50% reduction in churn rate for subscribers affected by incidents.

• Implemented univariate clustering and statistical methods, leading to a 60% improvement in detection accuracy.

• Automated workflows and data pipelines that handle data cleansing, ETL, machine learning implementation, detection processes and error analysis of 40 million rows of data daily.

• Engaged with stakeholders to understand their use cases and define appropriate KPIs for each party involved.

I also created PowerBI dashboard for monitoring, testing, and communicating the results with project manager and executives.

Classification Problems

These are projects that aim to solve classification problems involve structured data. Some projects were part of Kaggle competitions.

Detect Fraud from Customer Transactions

Researchers from the IEEE Computational Intelligence Society (IEEE-CIS) presented a challenge regarding improving the accuracy of fraud prevention system to improve customer experience in banking services. In this competition, I benchmark machine learning models on a challenging large-scale dataset. There is a lot of data about users behavior and potential patterns for detecting fraud. Here’s my kernel.

MNIST Digit Recognition

MNIST Digit Recognition is a knowledge competition on Kaggle. Many people started practicing in machine learning with this competition, so did I. This is a multiple classification problem: based on the data image of a given digit, predict which number is corresponded with the given input. This dataset provides an interesting opportunity to learn about models comparison and feature engineering. Here’s my kernel.

Predicting Diabetes using Ensemble Methods

The objective of this project is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. The project focus on handling missing values, by using information from other variables in the dataset to predict and impute the missing values. I used ensemble methods such as bagging and boosting, as well as K-folds validation techniques to predict diabetes with the accuracy of 76%. Here’s my analysis in R notebook.