petersalmon.github.io

Data Science Portfolio by Peter Salamon

This portfolio is a collection of data analysis and machine learning projects I have created in an effort to hone my skills as a data scientist and machine learning engineer. Projects are divided into several broad categories below. It is recommended that guests view the projects in Jupyter nbviewer if given the choice; links are provided below each project title.

Data Processing and Exploration

College Scorecard Analysis

View this project on Jupyter nbviewer or GitHub

This project involves the analysis of colleges across all 50 states between the years 2006 and 2016. The analysis focuses on the financial status and employment status of students after graduation, as well as the distribution of degree types, based on geography.

National Youth Tobacco Survey Analysis

View this project on Jupyter nbviewer or GitHub

This project involves the analysis of responses to the National Youth Tobacco Survey (NYTS) distributed by the Center for Disease Control (CDC). The analysis focuses on the rate of smoking amongst young adults in the United States and seeks to uncover whether or not the introduction of electronic cigarettes has really sparked a health crisis among the youth of our nation.

Analysis of Pharmaceutical Product Reviews (NLP)

View this project on Jupyter nbviewer or GitHub

This project involves the analysis of raw text data in the form of reviews, written by users of www.drugs.com. The analysis focuses on exploring various characteristics of the brand names of the medications as well as the reviews of the drugs themselves. The main goals of the analysis involve finding ways in which drug manufacturers may improve upon the production quality and marketing of the medications they produce.

Machine Learning Projects

Superconductor Critical Temperature Regression

View this project on Jupyter nbviewer

This project involves the prediction of the critical temperature for various, potential, superconducting compounds using a Random Forest Regression model. The critical temperature for materials is nutoriously difficult to predict. The model created here may aid researchers in the inference of critical temperatures based on a compound’s chemical structure and characteristics.

Start-Up Company Business Analysis

View this project on Rpubs

Investors often rely on intuition and past experience in order to judge whether a start-up company will be worth investing in. This project aims to put forth a rigorous analysis that can be used to identify relevant factors indicative of a successful start-up company. In addition, a machine learning model is created to predict whether or not a company is likely to succeed based on an array of company characteristics and economic measures.

Autism Spectrum Disorder Classification

View this project on Jupyter nbviewer or GitHub

In this project, a predictive classification model is built with the goal of predicting whether or not a child will be diagnosed with Autism Spectrum Disorder (ASD). The process for diagnosing ASD is, currently, lengthy as well as costly. A statistical model may relieve some of the shortcomings of the current process making it more efficient and easier to implement.

Bank Marketing Data Analysis & Classification Using Random Forests

View this project on Rpubs

Built a random forest model to predict likely customers from bank marketing data with 87% accuracy using over 30,000 rows of data collected by a Portuguese banking institution.

Natural Language Processing of Amazon Instant Video Data

View this project on Rpubs

In this analysis, Parts-of-Speech Tagging and sentiment analysis using a Logistic Regression classification model are performed on Amazon Instant Video data in an effort to determine if marketing of the Amazon Instant Video service can be further improved leading to an increase in revenue.

Cancer Gene Expression Clustering

View this project on Jupyter nbviewer or GitHub

In this project, a clustering analysis is performed on the gene expression data of over 20,000 genes and across five different types of cancers. In addition, a classification model is built with the goal of predicting the type of cancer present in a patient based on patterns in the expression of their genes. Advances in biology and medicine seem to stagnate as a result of overwhelming amounts of data needing to be analyzed. Machine learning can be implemented to compress this data down to smaller and more meaningful portions of information.

Customer Segmentation of an E-Commerce Website

View this project on Jupyter nbviewer

In this project, the customers of a UK-based e-commerce retailer are clustered into groups based on their shopping behaviors and a Logistic Regression model is created to classify new customers based on their purchase history using data consisting of over 500,000 user transactions over the span of 2 years.

Time Series Analysis

Forecasting Wikipedia Web Traffic

View this project on Jupyter nbviewer

In this project, an ARIMA model and Facebook’s Prophet model are used to forecast the future views of some of their most popular web pages with 95% accuracy.

The Technological Hype Cycle: Is it All Hype?

View this project on my blog or Jupyter nbviewer

Utilized time series analytics and predictive modeling to forecast the future trajectory of several emerging tech trends, with 95% accuracy, utilizing an ARIMA model and over 10 years of Google Trends’ time series data.

Deep Learning

Artificial Neural Networks (ANN)

View this project on Jupyter nbviewer or GitHub

This project explores the inner workings of Artificial Neural Networks. A simple neural network model is built from scratch and with the Keras package as well. Finally, it is turned into a classification agent for the recognition of hand written digits using the famed MNIST dataset.

Convolutional Neural Networks (CNN)

View this project on Jupyter nbviewer or GitHub

This project probes the inner workings of Convolutional Neural Networks. How CNNs are structured and how they function is first discussed followed by the creation of two simple CNNS. The first model attempts to classify images of dogs and cats while the other distinguishes between three different types of tumors using brain MRI scans.

Predicting Google Stock Prices With Recurrent Neural Networks (RNN)

View this project on Jupyter nbviewer or GitHub

This post uncovers the secrets of the Recurrent Neural Network. The structure and functions of the model are initially discussed folowed by a working example of an RNN in action. The goal of the RNN that is built is to predict the future open prices of Google’s stock, for January of 2017, using the daily open price, close price, high, low, and volume of the stock between the years of 2012 and 2017.