Building a Data Science project /research paper from Scratch – a case study

Keeping aside the mental preparation, motivation and strong determination required to make any good project successful, which I sincerely believe you possess, I would like to point out the technical know-how needed to get started. You will find the codebase of a number of projects that I have worked with in my Github profile. Do check out my first research paper titled “Understanding Email Interactivity and Predicting User Response to Email“.

Points to remember

  1. What problem or issue will your solution or project address
  2. What are the competing solutions, i.e, how people have addressed the issue till now – the state-of-the-art and what are the alternative available
  3. A thorough literature survey or background research needs to be done.
  4. Given the state-of-the-art, what extra your solution is achieving or what gap is your solution covering or filling.(Your research contribution)
  5. The utility of the solution – Whether the problem really exists, does it really affect your target audience, does it satisfy their current needs.

Developing end-to-end systems

Data Science Project pipeline – a rough overview

Regarding the technologies needed for developing an end-to-end system in Data Science, I found the following to be very useful.

1. Programming language, R or Python or both

If you are new to programming, I will recommend to only learn Python and forget about R. As you further progress into Deep Learning, you will also need to learn at least one of the Deep learning frameworks like PyTorch, Keras, Tensorflow or Theano. For this also, Python is the best choice to invest upon. I have covered how to get started with R and Python in separate articles. [Python tutorial, R tutorial]

2. Benchmark and clean datasets

As you may have already come across the saying that 80% of the time goes in collecting, cleaning and performing feature engineering. It do not completely attest to this claim. But there is some truth it.

To be more fairly put, I will say 80% of the time in formulating the problem, identifying the current data sources, understanding what type of questions or patterns you are looking for. All of the above also requires certain amount of literature review which will give you some domain knowledge and also going through the data, which help you understand and plan the above steps.

Believe me, it’s not that hard. You just need patience and must know where to look.


However, that again comes only from experience. What a conundrum!

These are used by the state-of-the-art systems for evaluation and comparison for their baseline experiments. I have it in a separate blog article.

3. Front-end and back-end of your Web application

 Flask framework is easy to implement and start with. I will recommend you to follow Manuel Grinberg’s blog which in fact is a kind of small course, which teaches you in great detail and in a very practical and organized fashion, “How to build a web application using the Flask framework.

Personally, I found it to be very helpful, and developed my first research demo using it. Hope, it will get published someday 😛

4. How a relational database works.

 PostgreSQL and a NoSQL database – MongoDB. This usually makes up the backend part of our pipeline.

5. How to run SQL queries

Structured Query Language is another must that you should have the knowledge of. It is pretty simple to learn. A more advanced version of same is to learn P-SQL, which stands for Programming SQL. I will mention a good resource to learn from, if I come across something useful.

6. Basic understanding of ML algorithms

There are many good courses and quora posts covering that topic. My Quora answer :

I would recommend you to read this answer on Quora titled “What is the best MOOC to get started in Machine Learning?

I personally completed the edX course named “The Analytic’s Edge” by MIT. I would like to add 2 more courses that involves neural networks and Deep Learning which I myself found to very helpful.

Natural Language Processing : CS224n: Natural Language Processing with Deep Learning
Computer Vision : CS231n: Convolutional Neural Networks for Visual Recognition

My answer on Quora

Source : My answer on Quora

If you have more of a developer inclination, I sincerely these Youtube playlists by Jeremy Howard, will make your day.

  1. Introduction to Machine Learning
  2. Deep Learning for Coders

If you find any of the above resources to be useful, please mention them in the Comments section. I would love to hear from you.

If there is any particular course you want me to review, do inform me. I hope, but cannot promise to give some feedback within 2 weeks. As you know Research Life 🙂


Hello everyone. I am Soumyadeep. I have been working on Machine learning projects for the last 4 years. I am now pursuing Ph.D. in Computer Science Department at IIT Kharagpur. I recently completed M.S (Research) from the same department in November, 2019. My research interests involve applying Machine Learning, NLP and Deep Learning to solve Online Reputation Monitoring and Consumer Health Search problems.

Leave a Reply