计算机代写|机器学习代写machine learning代考|What Is Special About Learning from Text

Most machine learning applications in the text domain work with the bag-of-words representation in which the words are treated as dimensions with values corresponding to word frequencies. A data set corresponds to a collection of documents, which is also referred to as a corpus. The complete and distinct set of words used to define the corpus is also referred to as the lexicon. Dimensions are also referred to as terms or features. Some applications of text work with a binary representation in which the presence of a term in a document corresponds to a value of 1 , and 0 , otherwise. Other applications use a normalized function of the word frequencies as the values of the dimensions. In each of these cases, the dimensionality of data is very large, and may be of the order of $10^5$ or even $10^6$. Furthermore, most values of the dimensions are $0 \mathrm{~s}$, and only a few dimensions take on positive values. In other words, text is a high-dimensional, sparse, and non-negative representation.

These properties of text create both challenges and opportunities. The sparsity of text implies that the positive word frequencies are more informative than the zeros. There is also wide variation in the relative frequencies of words, which leads to differential importance of the different words in mining applications. For example, a commonly occurring word like “the” is often less significant and needs to be down-weighted (or completely removed) with normalization. In other words, it is often more important to statistically normalize the relative importance of the dimensions (based on frequency of presence) compared to traditional multidimensional data. One also needs to normalize for the varying lengths of different documents while computing distances between them. Furthermore, although most multidimensional mining methods can be generalized to text, the sparsity of the representation has an impact on the relative effectiveness of different types of mining and learning methods. For example, linear support-vector machines are relatively effective on sparse representations, whereas methods like decision trees need to be designed and tuned with some caution to enable their accurate use. All these observations suggest that the sparsity of text can either be a blessing or a curse depending on the methodology at hand. In fact, some techniques such as sparse coding sometimes convert non-textual data to text-like representations in order to enable efficient and effective learning methods like support-vector machines [405].

计算机代写|机器学习代写machine learning代考|Analytical Models for Text

The section will provide a comprehensive overview of text mining algorithms and applications. The next chapter of this book primarily focuses on data preparation and similarity computation. Issues related to preprocessing issues of data representation are also discussed in this chapter. Aside from the first two introductory chapters, the topics covered in this book fall into three primary categories:

  1. Fundamental mining applications: Many data mining applications like matrix factorization, clustering, and classification, can he used for any type of multidimensional data. Nevertheless, the uses of these methods in the text domain has specialized characteristics. These represent the core building blocks of the vast majority of text mining applications. Chapters 3 through 8 will discuss core data mining methods. The interaction of text with other data types will be covered in Chapter 8 .
  2. Information retrieval and ranking: Many aspects of information retrieval and ranking are closely related to text mining. For example, ranking methods like ranking SVM and link-based ranking are often used in text mining applications. Chapter 9 will provide an overview of information retrieval methods from the point of view of text. mining.
  3. Sequence- and natural language-centric text mining: Although multidimensional mining methods can be used for basic applications, the true power of mining text can be leveraged in more complex applications by treating text as sequences. Chapters 10 through 16 will discuss these advanced topics like sequence embedding, neural learning, information extraction, summarization, opinion mining, text segmentation, and event extraction. Many of these methods are closely related to natural language processing. Although this book is not focused on natural language processing, the basic building blocks of natural language processing will be used as off-the-shelf tools for text mining applications.

In the following, we will provide an overview of the different text mining models covered in this book. In cases where the multidimensional representation of text is used for mining purposes, it is relatively easy to use a consistent notation. In such cases, we assume that a document corpus with $n$ documents and $d$ different terms can be represented as a sparse $n \times d$ document-term matrix, which is typically very sparse. The $i$ th row of $D$ is represented by the $d$-dimensional row vector $\overline{X_i}$. One can also represent a document corpus as a set of these $d$-dimensional vectors, which is denoted by $\mathcal{D}=\left[\bar{X}_1 \ldots \bar{X}_n\right]$. This terminology will be used consistently throughout the book. Many information retrieval books prefer the use of a term-document matrix, which is the transpose of the document-term matrix and the rows correspond to the frequencies of terms. However, using a document-term matrix, in which data instances are rows, is consistent with the notations used in books on multidimensional data mining and machine learning. Therefore, we have chosen to use a document-term matrix in order to consistent with the broader literature on machine learning.

Much of the book will be devoted to data mining and machine learning rather than the database management issues of information retrieval. Nevertheless, there is some overlap between the two areas, as they are both related to problems of ranking and search engines. Therefore, a comprehensive chapter is devoted to information retrieval and search engines. Throughout this book, we will use the term “learning algorithm” as a broad umbrella term to describe any algorithm that discovers patterns from the data or discovers how such patterns may be used for predicting specific values in the data.

计算机代写|机器学习代写machine learning代考|What Is Special About Learning from Text

文本域中的大多数机器学习应用程序都使用词袋表示,其中词被视为具有与词频相对应的值的维度。数据集对应于文档的集合,也称为语料库。用于定义语料库的完整且不同的单词集也称为词典。维度也称为术语或特征。文本的一些应用程序使用二进制表示,其中文档中的术语对应于值 1 ,否则为 0 。其他应用程序使用词频的归一化函数作为维度的值。在每一种情况下,数据的维数都非常大,可能是105甚至106. 此外,维度的大多数值是0 秒, 只有少数维度取正值。换句话说,文本是一种高维的、稀疏的、非负的表示。

文本的这些属性既带来了挑战,也带来了机遇。文本的稀疏性意味着正词频比零词频提供更多信息。单词的相对频率也存在很大差异,这导致不同单词在挖掘应用程序中的重要性不同。例如,像“the”这样经常出现的词通常不太重要,需要通过归一化来降低权重(或完全删除)。换句话说,与传统的多维数据相比,统计维度的相对重要性(基于出现频率)通常更为重要。在计算它们之间的距离时,还需要对不同文档的不同长度进行归一化。此外,尽管大多数多维挖掘方法都可以推广到文本,但表示的稀疏性会影响不同类型挖掘和学习方法的相对有效性。例如,线性支持向量机在稀疏表示上相对有效,而决策树等方法需要谨慎设计和调整以使其能够准确使用。所有这些观察结果表明,文本的稀疏性可能是福也可能是祸,这取决于手头的方法。事实上,某些技术(例如稀疏编码)有时会将非文本数据转换为类似文本的表示形式,以便实现高效且有效的学习方法,例如支持向量机 [405]。表示的稀疏性对不同类型的挖掘和学习方法的相对有效性有影响。例如,线性支持向量机在稀疏表示上相对有效,而决策树等方法需要谨慎设计和调整以使其能够准确使用。所有这些观察结果表明,文本的稀疏性可能是福也可能是祸,这取决于手头的方法。事实上,某些技术(例如稀疏编码)有时会将非文本数据转换为类似文本的表示形式,以便实现高效且有效的学习方法,例如支持向量机 [405]。表示的稀疏性对不同类型的挖掘和学习方法的相对有效性有影响。例如,线性支持向量机在稀疏表示上相对有效,而决策树等方法需要谨慎设计和调整以使其能够准确使用。所有这些观察结果表明,文本的稀疏性可能是福也可能是祸,这取决于手头的方法。事实上,某些技术(例如稀疏编码)有时会将非文本数据转换为类似文本的表示形式,以便实现高效且有效的学习方法,例如支持向量机 [405]。而像决策树这样的方法需要谨慎地设计和调整,以使其能够准确使用。所有这些观察结果表明,文本的稀疏性可能是福也可能是祸,这取决于手头的方法。事实上,某些技术(例如稀疏编码)有时会将非文本数据转换为类似文本的表示形式,以便实现高效且有效的学习方法,例如支持向量机 [405]。而像决策树这样的方法需要谨慎地设计和调整,以使其能够准确使用。所有这些观察结果表明,文本的稀疏性可能是福也可能是祸,这取决于手头的方法。事实上,某些技术(例如稀疏编码)有时会将非文本数据转换为类似文本的表示形式,以便实现高效且有效的学习方法,例如支持向量机 [405]。

计算机代写|机器学习代写machine learning代考|Analytical Models for Text


  1. 基础挖掘应用:许多数据挖掘应用,如矩阵分解、聚类和分类,可以用于任何类型的多维数据。然而,这些方法在文本域中的使用具有特殊性。这些代表了绝大多数文本挖掘应用程序的核心构建块。第 3 章到第 8 章将讨论核心数据挖掘方法。文本与其他数据类型的交互将在第 8 章介绍。
  2. 信息检索和排序:信息检索和排序的许多方面都与文本挖掘密切相关。例如,排序 SVM 和基于链接的排序等排序方法经常用于文本挖掘应用程序。第 9 章将从文本的角度概述信息检索方法。矿业。
  3. 以序列和自然语言为中心的文本挖掘:虽然多维挖掘方法可用于基本应用程序,但通过将文本视为序列,可以在更复杂的应用程序中利用挖掘文本的真正力量。第 10 章到第 16 章将讨论这些高级主题,如序列嵌入、神经学习、信息提取、摘要、意见挖掘、文本分割和事件提取。其中许多方法与自然语言处理密切相关。虽然本书的重点不是自然语言处理,但自然语言处理的基本构建块将用作文本挖掘应用程序的现成工具。

下面,我们将概述本书涵盖的不同文本挖掘模型。在文本的多维表示用于挖掘目的的情况下,使用一致的表示法相对容易。在这种情况下,我们假设文档语料库n文件和d不同的术语可以表示为稀疏n×d文档术语矩阵,通常非常稀疏。这一世第排丁由d维行向量X一世¯. 也可以将文档语料库表示为一组这些d维向量,表示为丁=[X¯1…X¯n]. 该术语将在整本书中始终如一地使用。许多信息检索书籍更喜欢使用术语-文档矩阵,它是文档-术语矩阵的转置,行对应于术语的频率。但是,使用文档术语矩阵(其中数据实例为行)与多维数据挖掘和机器学习书籍中使用的符号一致。因此,我们选择使用文档术语矩阵,以便与更广泛的机器学习文献保持一致。


