PSTAT 5A: Lecture 00

Introduction to Data Science

Mallory Wang

2024-06-24

Welcome!

Course Staff

  • Instructor:
    • Mallory (she/her)
    • mallorywang@ucsb.edu
    • T 9:30 - 10:30a (zoom) and 4 - 5p (zoom)

Teaching Assistants:

  • Daniel Silva
  • dcsilva@ucsb.edu
  • OH: TBD
  • Hezhong Zhang
  • hzhang586@umail.ucsb.edu
  • OH: TBD

Course Resources

  • Canvas: for grades
  • Gradescope: for homework, quizzes, and labs
  • Course Website: https://pstat5a.github.io/Pages/syllabus.html
    • All relevant course material will be posted to the website!
    • One exception: quizzes, which will be administered through Gradescope
  • Please read the syllabus fully and carefully!

Any Questions about the syllabus?

What is Data Science

What is Data Science?

  • Not a bad definition! To be fair, data science means different things to different people / companies.

  • There isn’t a single agreed-upon definition of what data science is.

  • Most people agree that Data science is cross-disciplinary, drawing experience and expertise from a wide variety of different fields.

    • Perhaps the two main fields from which Data Science draws are Statistics and Computer Science
  • Like ChatGPT suggested, computation is an integral part of Data Science.

    • As we will soon see, the data that is being analyzed these days is huge; certainly too large to be able to do anything with it on pen and paper.
  • However, Data Science is not just running things through computer programs.
  • An equally integral part of Data Science is the theory that surrounds data, modeling, and randomness- theory that comes from the field of Statistics.

  • Even if you are planning on going into industry right after university, you will still need to know some of the theory.

The Path Forward

  • So, how does this course factor into the discourse surrounding Data Science?

  • From the course description:

Introduction to data science. Concepts of statistical thinking. Topics include random variables, sampling distributions, hypothesis testing, correlation and regression. Visualizing, analyzing and interpreting real world data using Python. Computing labs required.

  • Indeed, this course will serve as a sort of “table of contents” of Data Science, touching on many (but still not all) of the wonderful subfields and subtopics that comprise the field.
  • We will start with Descriptive Statistics, a branch of statistics designed to try and describe or summarize data.
  • We will then devote some time to talking about Probability, which is in many ways the theory behind randomness and uncertainty.

  • Next, we will use Inferential Statistics to discuss how we can use data to draw conclusions (i.e. inferences) about the world around us.

    • This will include both Confidence Intervals as well as Hypothesis Testing.
  • Then, we will discuss a topic known as Regression which will be our first (and only, for this class) foray into statistical modeling.

  • We will then take a closer look at how data is collected, and the various strategies that can be utilized when trying to collect data of our own.

Why Should I Care?

  • I suspect not all of you are necessarily pursuing a degree in Statistics or Data Science. However, this day in age, data is truly everywhere, and having strong mathematical thinking will give you a leg up in any role you want.
  • However, wherever there is data, there is the need for a Data Scientist (or, at least, some of the principles from Data Science).

    • So, even if you are working in (what you might think is) a field that is far removed from Statsitics, the minute you start dealing with Data is the minute you start needing to know Data Science!
  • Here’s a perhaps more pragmatic answer: even if you think you want to go straight into industry right after this course, no company wants to hire someone to just mindlessly crunch numbers - though computing experience is absolutely crucial in making yourself a good candidate, employers would much rather have someone who is both skilled at running code but also understands why they are running the code they are running!

So, without further ado…. Let’s Get Started!