### 统计代写|机器学习作业代写Machine Learning代考| Outlier removal

Outlier removal is another common data pre-processing task. An outlier is an observation point that is considerably different from the other instances. Some machine learning techniques, such as logistic regression, are sensitive to outliers, i.e., outliers might seriously distort the result. For instance, if we want to know the average number of Facebook friends of Facebook users we might want to remove prominent people such as politicians or movie stars from the data set since they

typically have many more friends than most other individuals. However, if they should be removed or not depends on the aim of the application, since outliers can also contain useful information.

Outliers can also appear in a data set by chance or through a measurement error. In this case, outliers are a data quality problem like noise. However, in a large data set outliers are to be expected and if the number is small, they are usually not a real problem. Clustering is often used for outlier removal. Outliers can also be detected and removed visually, for instance, through a scatter plot, or mathematically, for instance, by determining the $z$-score, the standard deviations by which the outlier is above the mean value of the data set.

## 统计代写|机器学习作业代写Machine Learning代考|Data deduplication

Duplicates are instances with the exact same features. Most machine learning tools will produce different results if some of the instances in the data files are duplicated, because repetition gives them more influence on the result [40]. For example, Retweets are Tweets posted by a user that is not the author of the original Tweet and have the exact same content as the original Tweet except for metadata such as the timestamp of when it has been posted and the user who posted, retweeted, it. As with outliers, if duplicates should be removed or not depends on the context of the application. Duplicates are usually easily detectable by simple comparison of the instances, especially if the values are numeric, and machine learning frameworks often offer data deduplication functionality out of the box. We can also use clustering for data deduplication since many clustering techniques use similarity metrics and they can be used for instance matching based on similarities.

## 统计代写|机器学习作业代写Machine Learning代考| Relevance filtering

Relevance filtering typically happens at different stages of a machine learning project. Data deduplication can be considered a relevance filtering step if every instance has to be unique. Feature selection can also be considered relevance filtering since relevant features are sep-

arated from irrelevant ones. Stop words removal in text analysis is a relevance filtering procedure since irrelevant words or signs such as smileys are removed. Many natural language processing frameworks offer stop words removal functionality. Stop words are usually the most common words in a language such as “the”, “a”, or “that”. However, the list often needs to be adjusted since a stop word might be relevant, for instance, in a name such as “The Beatles”.

Since feature selection can be considered a search problem, using different search filters can be used to combat noise. For instance, people often enter fake details when entering personal data, such as fake addresses or phone numbers, since they do not want to be contacted by a call center. These fake profiles need to be filtered out otherwise they can negatively influence the predictive performance of a learner. Often this already happens when data is collected by using queries that omit irrelevant or fake data.

Relevance filtering can also happen after the features have been selected. Different features often do not contribute equally to the result. Some features might not contribute at all and can be filtered out. Data mining tools usually provide filter functionality at the feature level so learners can be trained on different feature sets.

