Information processing has always been the culprit of Computer Science. Traditionally, professionals in the field understood the data, and that allowed them to design algorithms that made the data useful in numerous applications. However, the exponential growth of the of information that is generated continuously in the world makes it difficult, or outright impossible, to use the traditional approach. The problem lays in the inability to comprehend the data, and – what follows – inability to design algorithms for processing the information carried over by the data. Data Mining is a field of study of techniques and approaches that allow to gain understanding of otherwise unknown data. At its most basic level, it encompasses processing and visualization techniques and tools for data exploration, but at it’s core it attempt to provide mechanisms for automated generation of algorithms that can be used to build predictive models that are based on recognition of – often hidden – patterns in the data.
Data mining can be described as learning from data, and quite often goes by another term, “machine learning”. While the term “data mining” is often used to emphasize focus on data, “machine learning” usually refers more acutely to the theory and practice of extracting models from available sample data, and using these models in projecting hypotheses about the complete data set.
In this course, we will focus on data mining techniques that have their roots in Mathematical field of Statistics. The course will explore the theory, but will put stress on its practical applications. In the theoretical part students will study the concepts underlying the fundamental issues in data mining. In the practical part, students will work on hands-on projects in which they will explore sample data using the principles and techniques learned in the theoretical part.
- Data Mining Pipeline
- Python and NumPy Tutorial
- Overview of Statistical Learning
- Matplotlib and Pandas Tutorial
- Linear Regression
- Logistic Regression
- Linear Discriminant Analysis
- k-Nearest Neighbors
- Resampling Methods
- The Bootstrap
- Linear Model Selection
- Ridge Regression
- Non-Linear Methods
- Polynomial Regression
- Splines and Generalized Additive Models
- Decision Trees
- Support Vector Machines
- Principal Component Analysis
- k-Means Clustering
After graduation from the course, the students will be able to:
- organize and express ideas concerning the foundations of data mining clearly and convincingly in oral and written forms,
- recognize applicability of machine learning principles to real-world data mining problems,
- identify appropriate analytical techniques to solve specific real-life problems,
- express real-life problems in terms suitable for data mining,
- implement data mining applications that analyze data from real-life problems,
- evaluate and compare various approaches to data mining,
- further their knowledge of the field by applying the foundations studied in this course, and
- gain practical knowledge of several Python-based data mining tools.
Instructor: AJ Bieszczad
Office: Sierra Hall 3315
Phone: (805) 437-2773
Title: An Introduction to Statistical Learning, 1st ed. 2013 (6th printing 2016)
Authors: Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Free PDF: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf
Title: The Elements of Statistical Learning, 2nd ed. (10th printing 2013)
Authors: Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Free PDF: http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
The instructors will respond to your inquiry within 24 hours except weekends (5 pm on Friday and 8 am on Monday) and holidays. If we do not reply in this timeframe, please assume that we did not receive your email and contact us again.
Please manage the communication diligently.
There will be three components of the course: theory, practice, and research.
The theoretical part will utilize a series of video lectures created by the authors of the textbook. Each class meeting will start with a lecture introducing a new topic. That will be followed by a discussion of the concepts with the purpose of clarifying the presented material.
The practical part of the class will involve presentation of some applications of the theory that will attempt to follow the labs from the textbook. However, instead of using R that the authors of the textbook prefer, Python-based tools – Jupyter Notebook, NumPy, Matplotlib, Pandas, and Scikit-learn – will be used instead. At the end of the presentation, an exercise will be introduced on which students will be working individually in the time remaining in the class and at home. A submission with the resolution of the assigned tasks will be due by the start of the following class.
In the research part of the course, students will work on applying the learned principles and techniques to solve a larger data mining problem of their choice. Determining the problem to work on will be part of the research. The final report will be submitted in a form of a extended conference paper that must utilize Jupyter Notebook and the other Python-based tools studied in the course. The paper will have to present the problem, propose a solution, describe experiments used to solve the problem, and end with comprehensive conclusions.
The final grade in the course will be based on the evaluation of the lab assignment and the quality of the research project, its presentation, and the supporting material (code, auxiliary files, etc.).
There will be no exams in this course.
Lab assignments: 75%
Research Project: 25%
Class attendance is mandatory. A student missing a class seating must submit a reasonable and formal document that proves that indeed the student was not able to attend the class. Any undocumented absence from a lecture or a lab will result in zero points allocated for the corresponding test or lab assignment.
If a student is justifiably absent from a class seating, it is the student’s responsibility to study the material presented in the class on her or his own and to check on any announcements made while the students was absent. Should a student justifiably miss a lab session, the student must complete the lab assignments on her or his own and submit it by the due date.
Restrictions on Lab Activities
- The use of the lab resources including time is restricted to the activities directly related to the lab. Any other use is not allowed.
- No solid or liquid food is allowed in the lab. If you need to drink or eat, please go outside. Closed water bottles are allowed.
- This is an interactive educational lab, so no activities or equipment disturbing interactions and education are allowed; for example, headsets, any kind of goggles, phones, consoles, watching videos, listening to music, playing games, maintaining distracting conversations, etc.
Any violators of the restrictions will be requested to leave the classroom. Any contention will be reported to Judicial Affairs.
The university, the course, the labs and the instructors are here for the students so they can acquire sufficient knowledge to open a window of opportunity for them in their future careers. Any academic dishonesty limits the student’s chances to succeed.
Please consult the Academic Catalog (http://www.csuci.edu/academics/scheduleandcatalog.htm/) for the details on the CSUCI’s academic code of honor.
Students with Disabilities
Students needing special accommodations must make formal requests to the CSUCI Disability Accommodation Services (http://www.csuci.edu/disability/index.htm/). No informal requests for any accommodation will be granted.
Subject to Change
This syllabus and schedule are subject to change in the event of extenuating circumstances.