If you’re looking to pick and do the perfect (for you) data science project, then look no further. This will take you through the entire process from coming up with an idea, gathering and preparing data, extracting valuable information, and sharing that information. There are 7 main steps to choosing and executing a data science project, discussed here in the context of a personal project of mine: a ski run recommender. The ski run recommender is a web application where users can select a run they’ve skied before and get recommendations of other similar runs to ski at a different resort.
1. Think of what you are interested in.
You really want to choose something that is important to you because if you don’t, you’ll get burnt out, and you won’t want to work on it. However, if you are interested, you’ll be motivated to put in the time, it won’t even feel like work, and the quality will be better and will show your work ethic when you talk about it. Also, having domain knowledge is incredibly helpful in Data Science projects. Even better, it’ll be great to have a project you’re interested in to talk about with future employers or clients to give you a competitive edge.
Prior to becoming a data scientist, I was a ski instructor. I decided to work on a project that had something to do with skiing, since it was something that I was really interested in and passionate about. I also spent a lot of time doing it and had quite a bit of domain knowledge. Bonus: it was great to talk about in job interviews (like my interview with M&K), and I even used the data I collected in a project for my interview process!
2. Think of problems that you want to solve.
You’ll want to try to choose to solve a problem that you would personally want to use or something that will help you (or others). A good problem to choose could make some process more efficient or uncover something previously unknown. The more applicable your project is, the better.
I chose to build a ski run recommender system. The problem I chose to solve was one that I had struggled with a lot as a skier. I was at Vail, where I had never skied before, and I had no idea what trails to ski on. I was there for a very important ski instructor exam, and I knew that part of the exam was on one double black diamond mogul trail, but I had no idea what it was like, and three different people gave me three different depictions. I had also threatened to break up with my boyfriend at the time who sent me down a black diamond trail that he said I would love, which actually almost killed me. From those experiences, I wanted to know which trails at a ski area that I had never been to were similar to trails that I already knew I enjoyed skiing. I knew this was something other people would also find useful. As far as I knew, there wasn’t anything in existence that could easily tell you that. This project would make it easier to know where you should ski without skiing the whole mountain to find out (make some process more efficient) and let you know which runs you would probably like (uncover something previously unknown).
3. Decide how you want to approach the problem.
When looking at your problem, you’ll have to decide what’s important. You might be interested in just the answer or the probabilities, importances, or path to the answer. You could use publicly available data or find your own. There might already be something out there that does what you want to do. You could use something similar as a “template” if it exists. You’ll need to consider what tools you want/need to use and if you are looking for experience with a specific tech stack.
4. Clean and explore your data.
Cleaning and exploring your data are arguably some of the most important parts of executing your data science project. These are also the most time consuming aspects. There are many things you have to consider, such as how to treat null value. You could drop them, fill them in with a mean, a median, or a sequential value, or something else. You should also think about what to do with outliers, such as dropping them or bringing them into the interquartile range. You’ll have to decide if you want to normalize or standardize your data. When looking at your features, you can decide if your data has all the features you could ever want or need, or if you can engineer some new ones or scrape together more data from different sources to gather more features. You can decide if you want to use one hot encoding on your categorical features to use them as numerical. You should also check to see if your data has the proper data types.
For my data, I scraped it together from multiple sources, since no one source had every feature I was looking for. I used one-hot encoding on features like whether or not a trail was groomed. Instead of simply “groomed” or “ungroomed”, there would be a column with a 1 if groomed and a 0 if not. If there were too many null values, I just dropped a row or column, since there was no good way for me to estimate what to fill the null values in with – other options would have included filling in with the mean or median.
As for outliers, none of my data could really be discarded for that reason, since some trails just had wildly different stats than others. In some cases, it was possible that there was a data entry error, like if a trail was a green trail but the pitch seemed far too steep to be an easy trail. For those cases, I would use my domain knowledge to make a judgement call. Normalizing or standardizing data can be good if the features are all on different scales like mine were. Slope was an angle (between 0 and definitely never more than 90 degrees), while length of the run was on a scale of thousands of feet. In my data, like in a lot of data that has numeric values, some of the numbers were strings in the data I scraped. I had to convert some data types to get them to be numeric and usable in my calculations.
5. Try different things.
Every problem can be looked at from different angles. Maybe you want to look at your problem as supervised learning. Maybe you can use unsupervised learning to gain more knowledge, like clustering before predicting. Do you want to use Natural Language Processing on text features, or would you prefer to stick to using only numeric features? Maybe deep learning will work, maybe it’s more complicated than necessary. Maybe you want to come up with something totally new and different, or maybe something simple and out of the box will get you the best results.
For my project, I chose to make a recommendation system. I decided that on my first pass, I would keep it simple and not use any text (I’d keep that in my back pocket for v2). I didn’t find deep learning to be necessary or useful in this case, so I skipped that.
As far as modeling went, I used an algorithm that comes in scikit learn in python. It seemed to do exactly what I wanted, which was to compare multiple trails and give them a similarity score based on all of the features. I just needed to prep my data the right way, and figure out how to get everything into the proper format, and decide how to best return the results.
6. Consider how you want to validate and evaluate your results.
All problems have different metrics that make sense for them. For some supervised learning projects, accuracy is relevant, but for others, precision or recall might be the best. Choosing the best metric really depends on the context (use that domain knowledge!) of your problem. For unsupervised learning or recommender systems like mine, you may have to survey people or do some sort of A/B testing to validate. Another metric you might want to consider is the time – if a model gets great results but takes a week to run, is that worth it?
For my project, the best way I could validate results was by surveying friends, family members, and fellow ski instructors who were familiar with the trails. I wanted to see if the recommendations they were getting seemed accurate, based on trails they were familiar with as the input and output. That is subjective, of course, but the important part of the ski run recommender was to make sure that people were happy with the recommendations they received.
7. Figure out how to visualize your results.
The most important part of the project is telling the story of why you did it and what someone can get out of it. To do that, you need to figure out how you’re going to share it. Will you put it into a presentation? Some sort of user interface? A web/mobile app? You want to not only give the results, but your audience should know what they mean and why they’re significant. This also relates back to #3 – what part of the results do you care about? Would a graph or plot help to convey the results? Or would a table be more useful?
For my ski run recommender, I made a web app and hosted it on AWS (Amazon Web Services). You can find it at skirunrecommender.com. The website outputs the top trails in order of most similar to least similar, based on the inputs given by the user. The results also show the features, so the user can infer why the trails are similar to the one they used as input.
If you can follow these steps, you’ll be well on your way to choosing and executing a great data science project. From here, your project will probably take some fine-tuning and iterating. Nothing is ever perfect on the first try. You can also expand on your original project idea. From my original ski run recommender, I decided that in the next version, I could use reviews of trails or text descriptions as another feature and implement Natural Language Processing. Another idea that I had was to turn it into a mobile app instead of just a web app. The possibilities are nearly endless.
Looking back at this project years later, I also see so many ways that I can improve upon my code and streamline my process. I’d love to go back and include classes or create pipelines so that the work is more automated and less manual. As you go along with your project, you’ll see things like that, and you’ll continue to grow as a data scientist through the process.