### 机器学习代写|自然语言处理代写NLP代考|PREPARING DATASETS

## 机器学习代写|自然语言处理代写NLP代考|Discrete Data Versus Continuous Data

As a simple rule of thumb: discrete data is a set of values that can be counted, whereas continuous data must be measured. Discrete data can reasonably fit in a drop-down list of values, but there is no exact value for making such a determination. One person might think that a list of 500 values is discrete, whereas another person might think it’s continuous.

For example, the list of provinces of Canada and the list of states of the United States are discrete data values, but is the same true for the number of countries in the world (roughly 200 ) or for the number of languages in the world (more than 7,000$)$ ?

Values for temperature, humidity, and barometric pressure are considered continuous. Currency is also treated as continuous, even though there is a measurable difference between two consecutive values. The smallest

unit of currency for U.S. currency is one penny, which is $1 / 100$ th of a dollar (accounting-based measurements use the “mil,” which is $1 / 1,000$ th of a dollar).
Continuous data types can have subtle differences. For example, someone who is 200 centimeters tall is twice as tall as someone who is 100 centimeters tall; the same is true for 100 kilograms versus 50 kilograms. However, temperature is different: 80 degrees Fahrenheit is not twice as hot as 40 degrees Fahrenheit.

Furthermore, keep in mind that the meaning of the word “continuous” in mathematics is not necessarily the same as continuous in machine learning. In the former, a continuous variable (let’s say in the 2D Euclidean plane) can have an uncountably infinite number of values. A feature in a dataset that can have more values than can be reasonably displayed in a drop-down list is treated as though it’s a continuous variable.

For instance, values for stock prices are discrete: they must differ by at least a penny (or some other minimal unit of currency), which is to say, it’s meaningless to say that the stock price changes by one-millionth of a penny. However, since there are so many possible stock values, it’s treated as a continuous variable. The same comments apply to car mileage, ambient temperature, and barometric pressure.

## 机器学习代写|自然语言处理代写NLP代考|“Binning” Continuous Data

Binning refers to subdividing a set of values into multiple intervals, and then treating all the numbers in the same interval as though they had the same value.

As a simple example, suppose that a feature in a dataset contains the age of people in a dataset. The range of values is approximately between 0 and 120 , and we could bin them into 12 equal intervals, where each consists of 10 values: 0 through 9,10 through 19,20 through 29 , and so forth.

However, partitioning the values of people’s ages as described in the preceding paragraph can be problematic. Suppose that person A, person B, and person C are 29,30 , and 39 , respectively. Then person $A$ and person $B$ are probably more similar to each other than person $B$ and person C, but because of the way in which the ages are partitioned, $B$ is classified as closer to $C$ than to A. In fact, binning can increase Type I errors (false positive) and Type II errors (false negative), as discussed in this blog post (along with some alternatives to binning):

As another example, using quartiles is even more coarse-grained than the earlier age-related binning example. The issue with binning pertains to the consequences of classifying people in different bins, even though they are in close proximity to each other. For instance, some people struggle financially because they earn a meager wage, and they are disqualified from financial assistance because their salary is higher than the cutoff point for receiving any assistance.

## 机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Normalization

A range of values can vary significantly, and it’s important to note that they often need to be scaled to a smaller range, such as values in the range $[-1,1]$ or $[0,1]$, which you can do via the tanh function or the sigmoid function, respectively.

For example, measuring a person’s height in terms of meters involves a range of values between $0.50$ meters and $2.5$ meters (in the vast majority of cases), whereas measuring height in terms of centimeters ranges between 50 centimeters and 250 centimeters: these two units differ by a factor of 100 . A person’s weight in kilograms generally varies between 5 kilograms and 200 kilograms, whereas measuring weight in grams differs by a factor of 1,000 . Distances between objects can be measured in meters or in kilometers, which also differ by a factor of 1,000 .

In general, use units of measure so that the data values in multiple features belong to a similar range of values. In fact, some machine learning algorithms require scaled data, often in the range of $[0,1]$ or $[-1,1]$. In addition to the tanh and sigmoid function, there are other techniques for scaling data, such as standardizing data (think Gaussian distribution) and normalizing data (linearly scaled so that the new range of values is in $[0,1]$ ).

The following examples involve a floating point variable $x$ with different ranges of values that will be scaled so that the new values are in the interval $[0,1]$.

• Example 1: If the values of $x$ are in the range $[0,2]$, then $x / 2$ is in the range $[0,1]$.
• Example 2: If the values of $x$ are in the range $[3,6]$, then $x-3$ is in the range $[0,3]$, and $(x-3) / 3$ is in the range $[0,1]$.
• Example 3: If the values of $x$ are in the range $[-10,20]$, then $x+10$ is in the range $[0,30]$, and $(x+10) / 30$ is in the range of $[0,1]$.

## 机器学习代写|自然语言处理代写NLP代考|“Binning” Continuous Data

## 机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Normalization

• 示例 1：如果X在范围内[0,2]， 然后X/2在范围内[0,1].
• 示例 2：如果X在范围内[3,6]， 然后X−3在范围内[0,3]， 和(X−3)/3在范围内[0,1].
• 示例 3：如果X在范围内[−10,20]， 然后X+10在范围内[0,30]， 和(X+10)/30是在范围内[0,1].

