How to build an end-to-end Data Science project from scratch

This article covers the five technologies required for building an end-to-end data science, artificial intelligence, or machine learning project from scratch.

They are defining a problem statement, identifying the competing solutions, conducting a literature survey, identifying research gaps with possible Proof-of-concept, and finally, the practical utility of the solution.

Welcome.

I hope you are now mentally prepared to start a long, end-to-end journey of building a data science or artificial intelligence (AI) application or project from scratch.

Keeping aside the mental preparation, motivation, and strong determination required to make any good project successful, I sincerely believe you possess, I would like to point out the technical know-how needed to get started.

You will find the codebase of several projects I have worked with in my Github profile.

How to build an end-to-end data science project from scratch

  1. What problem or issue will your solution or project address
  2. What are the competing solutions, i.e., how people have addressed the issue till now – the state-of-the-art, and what are the alternatives available
  3. A thorough literature survey or background research needs to be done.
  4. Given the state-of-the-art, what extra your solution is achieving, or what gap is your solution covering or filling? (Your research contribution)
  5. The utility of the solution – Whether the problem exists, does it affect your target audience, does it satisfy their current needs?

Developing end-to-end systems

Data Science Project pipeline – a rough overview

Regarding the technologies needed for developing an end-to-end system in Data Science, I found the following to be very useful.

1. Programming language, R or Python or both

If you are new to programming, I recommend only learning Python and forgetting about R. As you further progress into Deep Learning, you will also need to learn at least one of the Deep Learning frameworks like PyTorch, Keras, Tensorflow, or Theano. For this also, Python is the best choice to invest upon. In separate articles, I have covered how to get started with R and Python. [Python tutorial, R tutorial]

2. Benchmark and clean datasets

You may have already come across the saying that 80% of the time goes into collecting, cleaning, and performing feature engineering. It does not completely attest to this claim. But there is some truth to it.

To be more fairly put, I will say 80% of the time in formulating the problem, identifying the current data sources, and understanding what type of questions or patterns you are looking for. All of the above also requires a certain amount of literature review, which will give you some domain knowledge and also go through the data, which will help you understand and plan the above steps.

Believe me, it’s not that hard. You just need patience and must know where to look.

However, that again comes only from experience. What a conundrum!

State-of-the-art systems use these for evaluation and comparison for their baseline experiments. I have it in a separate blog article.

3. Front-end and back-end of your Web application

 Flask framework is easy to implement and start with. I will recommend you to follow Manuel Grinberg’s blog, which, in fact is a kind of small course, which teaches you in great detail and in a very practical and organized fashion, “How to build a web application using the Flask framework.

I found it to be very helpful and developed my first research demo using it. I hope it will get published someday 😛

4. How a relational database works.

 PostgreSQL and a NoSQL database – MongoDB. This usually makes up the backend part of our pipeline.

5. How to run SQL queries

Structured Query Language is another must that you should know of. It is pretty simple to learn. A more advanced version is to learn P-SQL, which stands for Programming SQL. I will mention a good resource to learn from if I come across something useful.

6. A basic understanding of ML algorithms

There are many good courses and quora posts covering that topic. My Quora answer :

I recommend reading this answer on Quora titled “What is the best MOOC to get started in Machine Learning?

I personally completed the edX course named “The Analytic’s Edge” by MIT. I would like to add 2 more courses that involves neural networks and Deep Learning which I myself found to very helpful.

Natural Language Processing : CS224n: Natural Language Processing with Deep Learning
Computer Vision : CS231n: Convolutional Neural Networks for Visual Recognition

My answer on Quora

Source: My answer on Quora

If you have more of a developer inclination, I sincerely recommend the Youtube playlists by Jeremy Howard, which I am sure will make your day.

  1. Introduction to Machine Learning
  2. Deep Learning for Coders

If you find any of the above resources useful, I am sure you will also enjoy at least some of the 30+ FREE research-related articles available on the website.

So, happy exploring!


💚30+ free articles already available at datanalytics101.com

💚 Your feedback is critical to improving the content, so please feel free to share your take on this topic

💚Follow me on Twitter @roysoumya1 for getting updates on “AI in Healthcare”

💚I plan to write one post a month on Medium. To get updates directly to your email, please subscribe at https://medium.com/subscribe/@soumyadeeproy

What is your take on this topic?