Probably the hardest part of any Data Science project is coming up with an original but feasible idea. An expert looking for such an idea can easily fall into the "dataset trap". They spend many hours looking through existing datasets and trying to come up with interesting new ideas. But this approach has one problem. The fact is that someone who only looks at existing datasets (with Kaggle, Google Datasets, FiveThirtyEight) limits their creativity by seeing only a small set of tasks that the datasets they study are focused on.
How to Create Ideas for Data Science Projects in 5 Easy Steps
Sometimes we examine the datasets of interest. If we build a successful model for data from Kaggle, for which countless models have already been created, there will be no practical value. On the other hand, it allows us to learn something new. But data scientists are people who strive to create something new, unique, something that can bring real benefit to the world.
How do you generate new ideas? In order to find the answer to this question, we combined our own experience and the results of creativity research. This led to the fact that we were able to form 5 questions, the answers to which help to find new ideas. As you search for answers to the questions presented here, you will walk the path of generating new ideas and will be able to use your creative potential to its fullest. As a result, you will have new unique ideas that you can implement in your Data Science projects.
1. Why do I want to start working on a new project?
When you think about starting a new project, you have an intention or purpose in mind. First, you need to find the answer to the question of why you want to create another project in the field of data science. Having a rough outline of what kind of goal you are aiming for will help you focus on finding an idea. So think about what you are going to create a project for. Here are some options:
- This is a portfolio project that you are going to showcase to potential employers.
- This is a draft for an article about concepts, models, or exploratory data analysis.
- This is a project that will allow you to practice something. For example, we can talk about natural language processing, about data visualization, about primary data processing, about some specific machine learning algorithm.
- This is a very special project that is not described in this list.
2. What areas are my interests and my experience in?
There are three main reasons to think about this question:
- First, knowledge in a specific area is an important asset that every data scientist should have. It is possible to solve certain problems by processing data only if the subject area to which this data belongs is clear. Otherwise, algorithms will be applied, visualizations and predictions will be created that seem inadequate to any practitioner of the appropriate profile. And if what you are doing doesn't make sense, then why bother doing it at all?
- Secondly, it is important that you are interested in the idea of the project, so that you are interested in the dataset you are working with. You will hardly want to force yourself to spend your free time on a project that you do not care about. If you are interested in a certain area of knowledge, then you do not need to be an expert in it. But you must be prepared to invest time in additional research and in sorting out the problems behind the data.
- Third, keep in mind that researchers have found that limiting the creative process leads to better results. This means that focusing on a specific subject area or a combination of several areas will yield better results than trying to find an idea without any restrictions.
3. How to find inspiration?
The main source of inspiration is reading. As you search for an idea, you can find interesting topics by reading various materials:
- News, articles, blog posts. Reading about events or phenomena that the authors of publications have observed is a great way to generate ideas. For example, the WIRED portal published an article on the fact that the autocomplete function in Google searches demonstrates political bias. Inspired by this idea, you can investigate systematic errors in language models. Or you might wonder about the possibility of predicting a person's geographic location based on the search queries they enter into Google.
- Scientific literature. Scientific publications often include stories of unresolved issues related to the topic under study. For example, the Semantic Scholar publication talks about the GPT-2 language model. You can also find that this model, without its fine tuning, shows itself on certain tasks, like answering questions, no better than trying to solve these problems by random guessing. Why not write something about the nuances of fine tuning this model?
- Materials from the Data Science field. Reading materials that present topics related to Data Science and provide overviews of related projects can lead to new ideas. For example, when we read about the NLP study of The Office, we may think about our own study of the series. Maybe you could study a few films and try to identify language patterns? And to write texts for your favorite TV series, you can try using the GPT-2 model.
If we talk about other sources of inspiration, then inspiration can be found in everyday life. Whenever you are interested in a question, think about whether you can answer that question using data manipulation techniques. If something interests you, take a moment and study the relevant data.
How do you generate project ideas based on the above sources of inspiration? Neuroscientists have identified three distinct psychological processes associated with generating ideas:
- You can combine existing ideas to create new ones (combinatorial creativity). For example, various projects analyzed rental offers posted on Airbnb. There are projects aimed at analyzing the real estate market. If you combine these ideas, you can look for an answer to the question of whether housing prices in a certain city are increasing thanks to Airbnb.
- You can explore an existing idea and look for a problem within its framework that you can try to solve (research creativity). For example, you can pay attention to the comparison of the data Scientists who received the appropriate education with the specialists who trained on their own. By examining this reasoning, you can try to find out which category of data scientists is more successful.
- You can take an existing idea and change something in it that completely changes its meaning (transformational creativity). This is the rarest form of creativity. It operates outside the existing conceptual space. This approach to creativity is difficult to understand and even difficult to describe. An example is this idea: instead of predicting the occurrence of an event, try to predict its non-occurrence.
4. Where can I find the relevant data?
Once you have decided on the general direction of research, you will need to search for data that will allow you to understand how to implement your idea in the form of a Data Science project. This is extremely important in determining whether an idea will succeed. In answering the question in the title of this section, it is worth considering the possibility of having what you need in existing data stores. You may have to collect the necessary data yourself, which complicates the task. So here's an overview of data sources:
- Existing data stores: Kaggle, Google Datasets, FiveThirtyEight, BuzzFeed, AWS, UCI Machine Learning Repository, data.world, Data.gov, and many more that can be found using Google.
- Data sources used by other data scientists. Search Google and Google Scholar for information on a topic of interest. Find out if anyone has already tried to find an answer to a question similar to yours. What data were used in similar studies? For example, Our World in Data presents academic and non-academic data sources that you may not be aware of.
- Data you need to collect yourself. To collect such data, you can resort to web scraping, text analysis, various APIs, event tracking, and working with log files.
If you are unable to find data that can help you implement your project idea, reformulate the idea. Try to get an idea from the original idea that can be implemented using the data you have. In the meantime, ask yourself a question about why you are not able to find the data you need. What is wrong with the area you are interested in? What can you do about it? Answers to these questions alone can lead to the emergence of a new Data Science project.
5. Is the found idea realizable?
So you have a fantastic idea! But can it be implemented? Go through the steps in the idea generation process again. Think about what you want to achieve (question # 1), whether you are interested in the chosen area (question # 2), if you have experience in it(question # 3), do you have the data you need to implement the idea (question # 4). Now you need to determine the following: do you have the skills necessary to implement the idea and to achieve the goal?
It is important to take into account such a factor as the time you plan to spend on this project. You are probably not going to write a doctoral dissertation on your chosen topic. Therefore, the project that you will do within the framework of the found idea, perhaps, will affect only a certain part of it. Maybe it will consist only of learning something new that you will need to implement the idea in the future.
After you go through the 5 steps above for generating an idea, you should have a question that you can and want to answer, spending as much time on it as you are willing to spend on achieving your goal.
Outcome
- Match your expectations with reality. Finding an original idea that can be implemented will take more than a few hours. Finding such an idea is a continuous process, driven by inspiration, when you need to write down everything that comes to mind. For example, you can make appropriate notes on your phone. Several of these ideas can, in the end, be combined and come up with an interesting project.
- Discuss your idea with someone. Discussing your project idea can be of great service to you. Perhaps during the conversation, some questions will come up that will be more interesting than the original idea. You may be given a hint regarding additional data sources. Or maybe you just need a good listener, who you will share your thoughts with, thus understanding better whether you should pursue the idea you have come up with.
- Don't be afraid to start over. Whatever you do, you always learn something new. Every time you write a line of code, you practice and expand your knowledge and skills. If you realized that the implementation of the new idea would not bring you closer to your goal, or if it turns out that the idea is not feasible, do not be afraid to leave it and move on. The time you have spent looking for this idea is not lost for you. It is necessary to sensibly evaluate the benefits that can be obtained from the implementation of the idea.
We believe that this material will inspire you to create cool new data science projects. Good luck with your development!