计算机代写|数据库作业代写SQL代考|SQL Query Structure

计算机代写|数据库作业代写SQL代考|SQL Query Structure

SQL queries have common clauses and syntax, although these can be combined in a nearly infinite number of ways to achieve analysis goals. This book assumes you have some prior knowledge of SQL, but I’ll review the basics here so that we have a common foundation for the code examples to come.

The SELECT clause determines the columns that will be returned by the query. One column will be returned for each expression within the SELECT clause, and expressions are separated by commas. An expression can be a field from the table, an aggregation such as a sum, or any number of calculations, such as CASE statements, type conversions, and various functions that will be discussed later in this chapter and throughout the book.

The FROM clause determines the tables from which the expressions in the SELECT clause are derived. A “table” can be a database table, a view (a type of saved query that otherwise functions like a table), or a subquery. A subquery is itself a query, wrapped in parentheses, and the result is treated like any other table by the query that references it. A query can reference multiple tables in the FROM clause, though they must use one of the JOIN types along with a condition that specifies how the tables relate. The JOIN condition usually specifies an squality between ficlds in cach table, such as orders.customer_id = customers.customer_id. JOIN conditions can include multiple fields and can also specify inequalities or ranges of values, such as ranges of dates. We’ll see a variety of JOIN conditions that achieve specific analysis goals throughout the book. An INNER JOIN returns all records that match in both tables. A LEFT JOIN returns all records from the first table, but only those records from the second table that match. A RIGHT JOIN returns all records from the second table, but only those records from the first table that match. A FULL OUTER JOIN returns all records from both tables. A Cartesian JOIN can result when each record in the first table matches more than one record in the second table. Cartesian JOINs should generally be avoided, though there are some specific use cases, such as generating data to fill in a time series, in which we will use them intentionally. Finally, tables in the FROM clause can be aliased, or given a shorter name of one or more letters that can be referenced in other clauses in the query. Aliases save query writers from having to type out long table names repeatedly, and they make queries easier to read.

计算机代写|数据库作业代写SQL代考|Profiling: Distributions

Profiling is the first thing I do when I start working with any new data set. I look at how the data is arranged into schemas and tables. I look at the table names to get familiar with the topics covered, such as customers, orders, or visits. I check out the column names in a few tables and start to construct a mental model of how the tables relate to one another. For example, the tables might include an order_detail table with line-item breakouts that relate to the order table via an order_id, while the order table relates to the customer table via a customer_id. If there is a data dictionary, I review that and compare it to the data I see in a sample of rows.

The tables generally represent the operations of an organization, or some subset of the operations, so I think about what domain or domains are covered, such as ecommerce, marketing, or product interactions. Working with data is easier when we have knowledge of how the data was generated. Profiling can provide clues about this, or about what questions to ask of the source, or of people inside or outside the organization responsible for the collection or generation of the data. Even when you collect the data yourself, profiling is useful.

Another detail I check for is how history is represented, if at all. Data sets that are replicas of production databases may not contain previous values for customer addresses or order statuses, for example, whereas a well-constructed data warehouse may have daily snapshots of changing data fields.

Profiling data is related to the concept of exploratory data analysis, or EDA, named by John Tukey. In his book of that name, ${ }^{1}$ Tukey describes how to analyze data sets by computing various summaries and visualizing the results. He includes techniques for looking at distributions of data, including stem-and-leaf plots, box plots, and histograms.

After checking a few samples of data, I start looking at distributions. Distributions allow me to understand the range of values that exist in the data and how often they occur, whether there are nulls, and whether negative values exist alongside positive ones. Distributions can be created with continuous or categorical data and are also called frequencies. In this section, we’ll look at how to create histograms, how binning can help us understand the distribution of continuous values, and how to use n-tiles to get more precise about distributions.

计算机代写|数据库作业代写SQL代考|Histograms and Frequencies

One of the best ways to get to know a data set, and to know particular fields within the data set, is to check the frequency of values in each field. Frequency checks are also useful whenever you have a question about whether certain values are possible or if you spot an unexpected value and want to know how commonly it occurs. Frequency checks can be done on any data type, including strings, numerics, dates, and booleans. Frequency queries are a great way to detect sparse data as well.

The query is straightforward. The number of rows can be found with count(* ), and the profiled field is in the GROUP BY. For example, we can check the frequency of each type of fruit in a fictional fruit_inventory table:

A frequency plot is a way to visualize the number of times something occurs in the data set. The field being profiled is usually plotted on the $x$-axis, with the count of observations on the $y$-axis. Figure 2-1 shows an example of plotting the frequency of fruit from our query. Frequency graphs can also be drawn horizontally, which accommodates long value names well. Notice that this is categorical data without any inherent order.

计算机代写|数据库作业代写SQL代考|SQL Query Structure


计算机代写|数据库作业代写SQL代考|SQL Query Structure

SQL 查询具有通用的子句和语法,尽管它们可以以几乎无限的方式组合以实现分析目标。本书假设您有一些 SQL 的先验知识,但我将在这里回顾基础知识,以便我们为后面的代码示例有一个共同的基础。

SELECT 子句确定查询将返回的列。SELECT 子句中的每个表达式都将返回一列,表达式用逗号分隔。表达式可以是表中的字段、聚合(如求和)或任意数量的计算(如 CASE 语句、类型转换和将在本章后面和整本书中讨论的各种函数)。

FROM 子句确定派生 SELECT 子句中的表达式的表。“表”可以是数据库表、视图(一种保存的查询类型,其功能类似于表)或子查询。子查询本身就是一个查询,用括号括起来,结果被引用它的查询与任何其他表一样对待。一个查询可以在 FROM 子句中引用多个表,但它们必须使用一种 JOIN 类型以及一个指定表如何关联的条件。JOIN 条件通常指定 cach 表中 ficld 之间的 squality,例如 orders.customer_id = customers.customer_id。JOIN 条件可以包括多个字段,还可以指定不等式或值范围,例如日期范围。我们将在整本书中看到各种实现特定分析目标的 JOIN 条件。INNER JOIN 返回两个表中匹配的所有记录。LEFT JOIN 返回第一个表中的所有记录,但仅返回第二个表中匹配的那些记录。RIGHT JOIN 返回第二个表中的所有记录,但仅返回第一个表中匹配的那些记录。FULL OUTER JOIN 返回两个表中的所有记录。当第一个表中的每条记录与第二个表中的多个记录匹配时,可能会导致笛卡尔连接。通常应该避免笛卡尔 JOIN,尽管有一些特定的用例,例如生成数据以填充时间序列,我们将在其中有意使用它们。最后,FROM 子句中的表可以别名,或给出一个或多个字母的较短名称,可以在查询的其他子句中引用。别名使查询编写者不必重复输入长表名,并且它们使查询更易于阅读。

计算机代写|数据库作业代写SQL代考|Profiling: Distributions

分析是我开始使用任何新数据集时要做的第一件事。我看看数据是如何排列到模式和表中的。我查看表名以熟悉所涵盖的主题,例如客户、订单或访问。我检查了几个表中的列名,并开始构建一个表如何相互关联的心理模型。例如,这些表可能包括一个 order_detail 表,其中包含通过 order_id 与 order 表相关的行项目细分,而 order 表通过 customer_id 与 customer 表相关。如果有数据字典,我会查看它并将其与我在行样本中看到的数据进行比较。



分析数据与由 John Tukey 命名的探索性数据分析或 EDA 的概念有关。在他那个名字的书中,1Tukey 描述了如何通过计算各种摘要和可视化结果来分析数据集。他介绍了查看数据分布的技术,包括茎叶图、箱线图和直方图。

在检查了一些数据样本后,我开始查看分布。分布使我能够了解数据中存在的值的范围以及它们出现的频率、是否存在空值以及负值是否与正值一起存在。可以使用连续或分类数据创建分布,也称为频率。在本节中,我们将了解如何创建直方图,分箱如何帮助我们理解连续值的分布,以及如何使用 n-tile 来更精确地了解分布。

计算机代写|数据库作业代写SQL代考|Histograms and Frequencies


查询很简单。可以使用 count(*) 找到行数,并且已分析的字段位于 GROUP BY 中。例如,我们可以在一个虚构的 fruit_inventory 表中检查每种水果的频率:

频率图是一种可视化数据集中某事发生的次数的方法。被分析的字段通常绘制在X-axis,与观察的计数是-轴。图 2-1 显示了一个从我们的查询中绘制水果频率的示例。频率图也可以水平绘制,可以很好地适应长值名称。请注意,这是没有任何固有顺序的分类数据。

