Introduction
As part of DataFest 2017, we launched a new initiative – DataHack Hour. DataHack Hour was inspired by numerous queries we get related to learning Data Science. Questions like “How to learn analytics?” or “How to become a data scientist?” are asked to us multiple time every day.
While we had written several articles on this subject on Analytics Vidhya – we needed something more definitive to answer these queries. In order to answer these queries, we decided to create an experience to show people how to learn Data Science. This version of DataHack Hour is our answer to the above questions or many more questions which come to us.
DataHack Hour is completely free to consume for Analytics Vidhya community and is created with an aim to help more and more people learn data science. This article will tell you the journey participants of DataHack Hour are undergoing. If you are one of the people struggling to learn Data Science – join DataHack Hour today and become part of this awesome experience.
Table of Contents
- What is DataHack Hour?
- Progress till date
- Day 1: How to convert a business problem to analytics problem and hypothesis generation.
- Day 2: Getting started with Jupyter Notebook and write your first program in Python.
- Day 3: Data exploration and visualization methods.
- Day 4: Detect and treat outlier & missing values.
- Day 5: Build your first linear regression model.
- Next steps and upcoming sessions
- End Notes
What is DataHack Hour?
DataHack Hour is based on a very simple concept – “Daily small improvements or learning in small steps can make a huge difference over time.” It is the same principle which Jeff Olson describes in his book “The Slight Edge”. Let me explain this in a bit more detail.
Most of the queries which we receive on Analytics Vidhya about challenges in learning can fall in one of the following categories:
- Not having access to Structured learning path and resources – this is a relatively simple problem to solve. We have Learning paths on Analytics Vidhya for precisely this. There are also other courses and specialisations which you can look at. But for some reason, the problem still persists.
- Not having access to mentors and an ecosystem – This is comparatively more difficult to solve. A lot of people start their journey only to come across a problem. They do not have access to mentors or people who would have taken that journey before. So, if a newbie got stuck with installation of conflicting Python libraries – it might overwhelm him / her to solve the problem and move forward.
- Not being able to learn things on a daily basis – This is probably the most difficult of all the problems to solve. You might have your work or family responsibilities which makes it difficult to take out dedicated time for learning on a daily basis.
- Lack of motivation – this is hard to solve for!
We believe that DataHack Hour is the solution to the first 3 problems mentioned here. We believe that by going over one chapter at a time daily, with help of volunteers and mentors from community to help is the most powerful way to learn Data Science. You learn by solving hands on problem, the content has been curated by Analytics Vidhya team and there are mentors to help you out on a daily basis. Honestly, I can’t think of a better way to learn!
Launch of DataHack Hour
We launched DataHack Hour on 16th April 2017 as part of DataFest 2017. We got outstanding response from the community members and from people who want to really learn the subject.
We came across various users like EspyM, who said he does not have access to such resources in his country and in 5 days we have seen him devoting time to build first model and submit a solution to DataHack platform! I am pretty sure that by end of this DataHack Hour, we will have multiple people like EspyM who would enable learning in their own communities later on.
In order to raise awareness about DataHack Hour further, we are releasing the content of the first 5 days on our blog. The idea is to put the content out to a larger world and invite people who have missed out on 5 awesome days. You can still join today by learning the content below. You can register for DataHack here.
If you registered on DataHack Hour and missed out a particular day, you can go through the content below and come back on track.
Day 1 – Webinar
We kicked off Datahack Hour with this awesome session by Tavish Srivastava. The agenda of the webinar was “How to convert a business problem to analytics problem? and Importance of hypothesis generation”. This is the best place to start your journey about learning analytics. It also touches about the point which gets ignored in a lot of tool focussed courses today.
Here is the webinar recording from the session:
Hopefully you are all geared up for the hands on exercises to come!
Day 2 – Jupyter Notebook and Python Scripts
From Day 2 onwards we started our 1 hour challenges. The agenda for day 2 included the following:
- Why learn Python for data analysis?
- Python Data Structures
- Conditional and iterative statements
- Loading data
- Understand pandas dataframes
Let us cover them one by one. You can download all Day 1 resources here after Logging in and Signing up. By end of the day you would have installed Anaconda, become comfortable with Jupyter notebook interface and would have written a few simple programs in Python and Pandas. We also cover different data structures in Python, iterative and conditional statement and ways to load and access the data.
Our mentor for the day was none other than me
Day 3 – Data Exploration and Visualization
This day was probably the most important session for a beginner after going through the first 2 days. On day 3, we got our hands dirty by doing data exploration and visualization. The topics covered in this session included the following:
- Variable Identification
- Univariate Analysis
- Bi-variate Analysis
People started working on a dataset which included various types of variables – continuous, nominal and ordinal. People started plotting distributions and became equipped with the tools to understand the hidden insights in data. By end of the day, people plotted Box plots, histograms and were comfortable looking for correlations.
Mentor for Day 3 session was our community members – Ziron
Day 4 – Missing Values & Outliers
This session focussed on some of the practical challenges people face while doing exploratory analysis. Irrespective of how good is your data, you would come across missing values and Outliers. This session was aimed to help people deal with missing values and Outliers in the data. Again, you can access the content here after logging in and registering for DataHack Hour.
Topics covered in Missing Value
- What is a missing value?
- But why do missing values occur?
- Impact of missing values
- How to detect missing values?
- Which are the methods to treat missing values?
Outlier detection
- What is an Outlier?
- Why do outliers occur?
- Impact of Outliers
- How to detect Outliers?
- How to deal with Outliers?
By now, participants would have understood missing values, imputed it and would have handled the Outliers in the data. Let us progress to build our first Predictive model.
Mentor for the day was our volunteer Aman Kapoor
Day 5 – Building a Linear Regression model
On day 5, people will start build simple predictive models. The sessions starts with talking about what is a predictive model and enables you to build a simple and a multivariate regression model by end of this session. You can download the resources here.
- What is Predictive Modeling?
- Building the first model
- How to find the best regression line?
- Performance Evaluation Metrics in Regression
- Multivariate Regression
- Hands-on practice problem
- A few points of caution when applying Linear Regression
Mentor for the day was our volunteer – Madhur Modi
Next Steps:
Here is the agenda for the coming days. If you think you got stuck in learning analytics and data science in past, come and join us in these DataHack Hour sessions. By end of this DataHack hour, you will be able to work on data science problems independently, would have 10+ mentors you wold have interacted with and a few hundreds of peers. All of this is available freely and can be absorbed as long as you are motivated!
Day 6: Feature Engineering and Transformation, helps to improve model performance
Day 7: Validating and measuring model performance
Day 8: Building logistic regression model
Day 9: Building naive bayes model
Day 10: Building a decision tree model
Day 11: Building a k-NN model
Day 12: Ensemble, Methods to combine model outcomes
Day 13: Apply your learnings on 6-hours hackathon
See you around in DataHack Hour.
End Notes:
Thanks a lot to our volunteers and mentors, who are helping out the community learn by the day. We are truly on our way to create thousands of real data scientists!