Skip to content

Creating a repository for Data Science 101 Workshop

Notifications You must be signed in to change notification settings

Stopwolf/Data-Science-101

Repository files navigation

Data Science 101

This repository represents all materials shown and used during S2S event held in Belgrade at Faculty of Organizational Sciences.
This workshop introduces participants to the world of data and presents foundations needed for further development in this industry. Topics covered in this repository and workshop are:

  1. Intro to Python
  2. Meet Pandas
  3. Visualizations
  4. Machine Learning
  5. Advanced Practices
  6. Ensemble Algorithms
  7. Clustering

Intro to Python

The first lecture in the series is learning Python. There are a lot of arguments whether you should pick R or Python, but in my opinion, Python is more versatile and can do much more by itself. Another great thing about Python is that it has amazingly large community which develops new libraries almost every day!
R shouldn't be neglected though. It's still used in the industry, but you should know that it's capabilities are only statistics based. So if you need to perform a certain and specific statistical task, maybe R is the way to go.
I should emphasize that this lecture is absolutely introductory. It doesn't cover a big part of Python as a language, such as creating classes, object-oriented programming and more advanced practices. It covers only the strict basics needed for this workshop. It is recommened that the participants of the workshop learn the rest on their own.

Meet Pandas

In this lecture, participants will learn how to use most-widely used library for loading the dataset and performing initial analysis and preprocessing. This is arguably the most important lecture in the whole workshop because participants will use these skills the most when getting a job in data science sometime in the future (yes, when. Not if.. I'm confident they'll achieve their dreams if they work hard enough after the workshop :)).

Visualizations

This part is the most underrated part of Data Science to the new-commers. In real-world, your managers and non-technical people need to understand your findings. They're not interested in formulas and in which Machine Learning algorithm produced the best results. What they need to know is how can they use the information that you found, how can your findings impact the whole company. In the end, it's their job to actually make decisions. You need to be assertive. And what a better to do that, than showing them pretty visuals on your presentation they all could easily understand.

Machine Learning

Machine Learning was traditionally an academic discipline. It was one of the fields where math and statistics could be applied. But, when combined with Data Science, ML was used to solve real-world problems, like churn prediction, market segmentation, weather prediction and so on. ML provides the ability for a system to learn without explicit instructions by the programmer. As you'll see, ML algorithms learn rules from the available data in order to predict the final outcome.

Advanced Practices

This section was created in order for the participants to learn some slightly more advanced practices when handling data. Such advanced practices are handling categorical features (one hot and label encoding), parameter tuning (grid search), validation techniques (cross validation), thresholds and finnaly the ROC AUC curve.

Ensemble Algorithms

When one model isn't enough, why not use many! Ensemble algorithms provide different ways to use multiple models to make final predictions. The term ensemble comes from entertainment industry/history and means: a group of musicians, actors, or dancers who perform together.

Clustering

Even though this term is highly used in this industry, almost everybody knows that it means (even if you don't, don't worry I'll still explain :)). Clustering is the process of grouping up similar instances of your dataset. More often than not, clustering is used to segment the market into reasonable segments of clients in order to create better marketing campaigns for an example. We'll work through k-means and hierarchical clustering algorithms and we'll have a mention of t-sne.

What now?

If you have finished this course, I would suggest to read up on some topics we didn't cover like linear regression, dimensionality reduction, support vector machines.. Of course, try to code and practices these newly learned skills as much as you can. You can even try out your skills in some Kaggle competitions. Thank you and good luck!

About

Creating a repository for Data Science 101 Workshop

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published