前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Data cleaning: missing values and outliers detection

Data cleaning: missing values and outliers detection

原创
作者头像
403 Forbidden
发布2021-05-19 13:32:22
3800
发布2021-05-19 13:32:22
举报
文章被收录于专栏:hsdoifh biuwedsyhsdoifh biuwedsy

Lectures 4 and 5: Data cleaning: missing values and outliers detection

-be able to explain the need for and the motivation behind data preprocessing and data cleaning

  • Measuring data quality
    • Accuracy
      • Correct or wrong, accurate or not
    • Completeness
      • Not recorded, unavailable
    • Consistency
      • E.g. discrepancies in representation
    • Timeliness
      • Updated in a timely way
    • Believability
      • Do I trust the data is correct?
    • Interpretability
      • How easily can I understand the data?

-be able to explain the need for and what is involved in each of the major data pre-processing activities (data cleaning, integration, reduction and transformation)

  • Data Cleaning
    • Many tools exist
      • Data scrubbing
      • Data discrepancy detection
      • Data auditing
      • ETL (Extract Transform Load) tools: users specify transformations via a graphical interface
    • Noisy data
      • Truncated fields (exceeded 80 character limit)
      • Text incorrectly split across cells (e.g. separator issues)
      • Salary=“-5”
      • Some causes:
        • Imprecise instruments
        • Data entry issues
        • Data transmission issues
    • Inconsistent data
      • Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)
      • Different date formats (“3/4/2016” versus “3rd April 2016”)
      • Age=20, Birthdate=“1/1/2002”
      • Two students with the same student id
      • Outliers
        • E.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999
          • No good if it is list of ages of hospital patients
          • Might be ok though for a listing of people number of contacts on Linkedin though
        • Can use automated techniques, but also need domain knowledge
    • Intentionally disguised data
      • Everyone’s birthday is January 1st?
      • Email address is xx@xx.com
      • Adriaans and Zantige
        • “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”
      • How to handle
        • Look for “unusual” or suspicious values in the dataset, using knowledge about the domain
    • Incomplete (missing data)
      • Lacking feature values
        • Name=“”
        • Age=null
      • Some types of missing data (Rubin 1976)
        • Missing completely at random: Data are missing independently of observed and unobserved data.
          • E.g/ Coin flipping to decide whether or not to answer an exam question.
        • Missing not at random
          • I create a dataset by surveying the class about how healthy they feel. What is the meaning of missing values for those who don’t respond?
  • Data Integration
    • Bringing data from multiple sources together
      • Resolve conflicts
      • Detect duplicates
  • Data Reduction
    • Decrease number of feature / instance in a dataset
      • Modifying sampling strategies
      • Removing irrelevant features & reducing noise
    • Makes data easier to visualise & faster to analyse
  • Data Transformation
    • Standardisation

-understand the terminologies: features, attributes, instances, objects.

  • Columns = features / attributes
  • Data rows = instances / objects

-understand the difference between categorical/discrete features versus continuous features

-be able to explain the reasons why data might be missing, what are the possible causes?

  • Causes:
    • Malfunction of equipment (e.g. sensors)
    • Not recorded due to misunderstanding
    • May not be considered important at time of entry
    • Deliberate

-understand the difference between data missing completely at random versus data missing not completely at random

  • If the missing values on a variable are related to the values of that variable itself

-understand what is meant by noise in data or ”noisy” data

  • Any data that has been received, stored, or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy

-understand the following strategies for handling missing data and their relative advantages/disadvantages (delete all instances with a missing value, manual correction, imputation)

  • Delete all instances with a missing value (case deletion)
    • Easy to analyse the new complete data
    • May produce bias on analysis if new sample size small or structure exists in the missing value
  • Manually correct
    • Human finds the missing value and fills it in using expert knowledge
  • Imputation
    • Replace the missing value with a substitute one
    • After imputing all missing values, can see standard analysis techniques for complete datasets
    • Fill in with 0
    • Fill in with mean
    • Fill in with median value (if skewed distribution)
    • Fill in mode value (categorical)
      • Use mode (most frequent value) imputation for categorical features
    • Fill in Category mean

-understand the following strategies for imputation of missing values and their relative advantages/disadvantages (fill in with zeros, fill in with mean/median value, fill in with category mean)

  • Fill in with 0
    • Simple
    • Won’t break application programs
    • Limited utility for analysis
  • Fill in with mean
    • Popular method
      • Can be good for supervised classification
      • Apply separately to each attribute
    • Drawbacks
      • Reduces the variance of the feature
      • Incorrect view of the distribution of that attribute
      • Relationships to other features changes
  • Fill in with median value (if skewed distribution)
  • Fill in Category mean

-be able to explain the importance of finding outliers and give concrete examples where this would be useful

  • The outlier objects deviate from this generating process
  • Outliers can be different from the noise data
    • Noise is random error or variance in a measured variable
    • Noise should be removed before outlier detection
  • Outliers are interesting: Violation of the mechanism that generates the normal data

-be able to explain what is an outlier

  • Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism
  • From a statistics perspective:
    • Normal (non-outlier) objects are generated using some statistical process
  • How to detect Outliers
    • 1-D data
      • Boxplot
      • Histogram
      • Statistical tests
    • 2-D Data: Scatter plot and eyeball
    • 3-D data: Can also use scatter plot and eyeball
    • >3-D data: Statistical or algorithmic methods

-be able to explain the difference between a global outlier and a contextual outlier

  • Global outlier (or point anomaly)
    • Object is Og if it significantly deviates from the rest of the data set
    • Ex. Intrusion detection in computer networks
    • Issue: Find an appropriate measurement of deviation
  • Contextual outlier (or conditional outlier)
    • Object is Oc if it deviates significantly based on a selected context
    • Is 5o in Melbourne an outlier? (depending on summer or winter?)
      • Attributes of data should be divided into two groups
      • Contextual attributes: defines the context, e.g., time & location
      • Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature
    • Issue: How to define or formulate meaningful context?

-be able to explain how a histogram can be used to detect outliers, their relative advantages/disadvantages for this task and be able to construct and interpret a histogram for data understanding and outlier detection

  • Outlier detection using histogram:
    • Figure shows the histogram of purchase amounts in transactions
    • A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000
  • Problem: Hard to choose an appropriate bin size for histogram
    • Too small bin size → normal objects in empty/rare bins, false positive
    • Too big bin size → outliers in some frequent bins, false negative

-be able to draw and read a 2-D scatter plot and visually identify outliers from it

-be able to construct and interpret a Tukey Boxplot and explain why it is a useful tool for data understanding and outlier detection

-it is not necessary to know Grubb’s test for outlier detection

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档