- Atabey Kaygun ([email protected])
- Lectures: Mondays 14:30-17:30 (OLS3)
Data science is a broad interdisciplinary field. It lies in the intersection of mathematics, statistics, machine learning, and computer science and use their methods and tools to extract information and insight from data. This is a course on the mathematical foundations of standard statistical and machine learning models used in the field. The class aims to teach students majoring in fundamental sciences to effectively use and deploy these algorithms in applications.
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. (Available on the web)
- C.M. Bishop. Pattern Recognition and Machine Learning.
- E. Alpaydin. Machine Learning.
The books I listed are mostly theoretical. But for the computational homeworks you may need the following:
- M. Kirk, Thoughtful Machine Learning with Python.
- C. O'Neil, R. Schutt, Doing Data Science.
- J. VanderPlas, Python Data Science Handbook.
Also, there are excellent resources on the web. I would recommend:
- edX
- MIT-X
- Kaggle Courses on Python, Pandas, Visualization, Data cleaning, and GIS Data.
Enroll to any of the data, machine learning, statistics, R or python classes that catches your fancy, or you think might be useful for you.
- UCI datasets
- Google dataset explorer
- Registry of open datasets on AWS
- Open MRI, MEG, EEG, iEEG, and ECoG data
- NCBI datasets
- Open GIS data
- NASDAQ data
The course is an applied data analysis class. This means the course requires a degree of proficiency of computational tools from which you are going to be responsible.
- git and GitHub
- Python programming language (version 3.10 or higher)
- Anaconda or Pip package managers
- Jupyter notebook system
- Markdown markup language
Installing and maintaining these systems on your machine is your responsibility. I can't help you if something doesn't work. You will need to figure it out on your own. If you can't install these systems on your machine you may try to use an online service:
I will make all of the course related announcement on İTÜ's course management system NINOVA. I will post the grades on NINOVA as well. So, do check it regularly.
I receive approximately 50 e-mails per day. So, if you need to contact me, use the subject ``MAT388E'' in your e-mails. Spend some time structuring your e-mail with grammatically correct sentences in Turkish or in English. Be polite, direct, and concise. State what you need in the first two sentences. Sign your e-mails with your name and student number. If I can't figure out who you are and what you need within 30 seconds of opening your message, I will delete your e-mail with no response. You are hereby warned.
Your performance is going to be judged via 4 homework assignments posted on the course github page and one final project that you need to write from scratch. Each home work is 15 points, and the final project is worth 40 points. Your total assessment for the course will be evaluated as follows:
If you receive 0 (missing HWs are graded as 0) any 2 of the homeworks, or if your total from homeworks is less than 35% you'll get a VF. If your final is less than 25%, or your total is less than 35% you'll receive an F. Note that the conditions for receiving a VF are both necessary and sufficient, while the conditions for receiving an F are only sufficient. This means you may still get an F with a higher score than 35% depending on the distribution of th e scores.
Assessment | Deadline |
---|---|
Github link | Sep 26 |
Homework 1 | Oct 10 |
Homework 2 | Oct 31 |
Final Project Proposal | Nov 14 |
Homework 3 | Nov 21 |
Homework 4 | Dec 5 |
Final Project | Dec 30 |
There is no make-up for the homeworks. If you miss any of the homework deadline because of an emergency, do contact me to make an arrangement as soon as you can.
I will collect a written attendance in each lecture. I will use the attendance records for those students that are edge cases in their grades. (Push them up or down.)
For the homeworks, you are going to need to open a GitHub account and create a repository for this class. I am going to pull your howeworks and final project from your GitHub repositories at 11:59PM of each deadline date. You must open a private github repository and share it with my hotmail address: [email protected]. Then send my itu address ([email protected]) your name, student number and your private github repository link. Your deadline is September 26, 11:59PM. If you do not follow these instructions, I will deduct upto 15 points from your final grade.
I am going to post the homework assignments on the course github page, you'll need to fill in the answers and post it on your own github account by the deadline.
The final project is worth 40 points and will be evaluated on your final project notebook. You may work with a team, but no larger than 3 students. You must open a separate repository with your team and submit th e link via e-mail with the subject ``MATH388E Final Project Link'' by November 14th. In that proposal git repository, put a jupyter notebook with
- The title of the project
- The list of team members (names and student numbers)
- Project summary
The project summary must contain the description of the data set you are going to work with, what you want to do with it, and a clear plan how you are going to accomplish your goals. I will grade your proposals (15 points) and might make adjustments on your data set, your hypothesis and your approach.
At the end of the semester when you submit your final project, I also want a short description of who did what for the final project as a supplement.
By regulations I must give a final exam. But in the exam I will only ask you explain your final project.
Passing someone else's code or text as your own is cheating, or worse yet, theft. Copying code with variable names changed is another lazy form of cheating. Cheaters will receive 0 and be reported to the university. In short, don't do it.
The following is a tentative schedule of topics I am going to cover. I may go faster or slower depending on the week. I may even add new subjects, or even drop subjects depending on requests and participation.
Week | Subject |
---|---|
Sep 19 | Data Science, Machine Learning, Statistics, Computer Science: Similarities and Differences. |
Sep 26 | Deadline for GitHub link submission. |
Crash Course in Python and its Library Ecosystem. | |
Oct 3 | Data types, data apis, popular data sources, and how to use them. |
Post HW1 | |
Oct 10 | Deadline for HW1. |
Supervised and unsupervised learning. Cross-validation. | |
Clustering vs classification. k-means clustering. k-nearest neighbor classification. | |
Oct 17 | Regression: OLS, regularization, lasso, elastic net. |
Oct 24 | Logistic regression. Decision tree regression. |
Post HW2 | |
Oct 31 | Deadline for HW2. |
Nov 14 | Hiearchical clustering. Density based clustering. |
Deadline for final project proposals. | |
Nov 21 | Entropy and Gini. Decision trees. Random forests. |
Post HW3 | |
Nov 28 | Deadline for HW3. |
Support Vector Machines. | |
Dec 5 | Dimensionality reduction. PCA, kernel PCA, LDA, NNMD. |
Dimensionality reduction applications for image and natural language processing. | |
Post HW4 | |
Dec 5 | Deadline for HW4. |
Newton-Raphson. Gradient Descent. Perceptron. Neural Networks | |
Dec 19 | A taxonomy of neural networks. Applications. |
Dec 29 | Autoencoders. |