计算机代写|数据库作业代写SQL代考|Detecting Duplicates

数据库SQL


计算机代写|数据库作业代写SQL代考|Detecting Duplicates

A duplicate is when you have two (or more) rows with the same information. Duplicates can exist for any number of reasons. A mistake might have been made during data entry, if there is some manual step. A tracking call might have fired twice. A processing step might have run multiple times. You might have created it accidentally with a hidden many-to-many JOIN. However they come to be, duplicates can really throw a wrench in your analysis. I can recall times early in my career when I thought I had a great finding, only to have a product manager point out that my sales figure was twice the actual sales. It’s embarrassing, it erodes trust, and it requires rework and sometimes painstaking reviews of the code to find the problem. I’ve learned to check for duplicates as I go.

Fortunately, it’s relatively easy to find duplicates in our data. One way is to inspect a sample, with all columns ordered:
SELECT column_a, column_b, column_c…
FROM table
SELECT column_a, column_b, column_c.
FROM table
ORDER BY $1,2,3 \ldots$
ORDER BY $1,2,3 \ldots$

This will reveal whether the data is full of duplicates, for example, when looking at a brand-new data set, when you suspect that a process is generating duplicates, or after a possible Cartesian JOIN. If there are only a few duplicates, they might not show up in the sample. And scrolling through data to try to spot duplicates is taxing on your eyes and brain. A more systematic way to find duplicates is to SELECT the columns and then count the rows (this might look familiar from the discussion of histograms!):
SELECT count() FROM ( SELECT column_a, column_b, column_c… , count() as records
GROUP BY $1,2,3 \ldots$
) a
SELECT count() FROM ( SELECT column_a, column_b, column_c… , count $^{}$ ) as records FROM… GROUP BY $1,2,3 \ldots$ ) a WHERE records > 1 ; WHERE records > 1 ; This will tell you whether there are any cases of duplicates. If the query returns 0 , you’re good to go. For more detail, you can list out the number of records $(2,3,4$, etc.): SELECT records, count $()$
SELECT column_a, column_b, column_c…, count(*) as records
GROUP BY $1,2,3 \ldots$
) a
WHERE records > 1

计算机代写|数据库作业代写SQL代考|Deduplication with GROUP BY and DISTINCT

Duplicates happen, and they’re not always a result of bad data. For example, imagine we want to find a list of all the customers who have successfully completed a transaction so we can send them a coupon for their next order. We might JOIN the custom ers table to the transactions table, which would restrict the records returned to only those customers that appear in the transactions table:
SELECT a.customer_id, a.customer_name, a.customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
This will return a row for each customer for each transaction, however, and there are hopefully at least a few customers who have transacted more than once. We have accidentally created duplicates, not because there is any underlying data quality problem but because we haven’t taken care to avoid duplication in the results. Fortunately, there are several ways to avoid this with SQL. One way to remove duplicates is to use the keyword DISTINCT:
SELECT distinct a.customer_id, a.customer_name, a.customer_email
FROM customers a
JoIN transactions b on a.customer_id = b.customer_id
SELECT distinct a.customer_id, a.customer_name, a.customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
Another option is to use a GROUP BY, which, although typically seen in connection with an aggregation, will also deduplicate in the same way as DISTINCT. I remember the first time I saw a colleague use GROUP BY without an aggregation dedupe-I

didn’t even realize it was possible. I find it somewhat less intuitive than DISTINCT, but the result is the samc:
SELECT a.customer_id, a.customer_name, a.customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
GROUP BY $1,2,3$
Another useful technique is to perform an aggregation that returns one row per entity. Although technically not deduping, it has a similar effect. For example, if we have a number of transactions by the same customer and need to return one record per customer, we could find the min (first) and/or the max (most recent) transac tion_date:
SELECT customer_id
,min(transaction_date) as first_transaction_date
, max(transaction_date) as last_transaction_date
, count $()$ as total_orders FROM table GROUP BY customer_id SELECT customer_id ,min(transaction_date) as first_transaction_date ,max(transaction_date) as last_transaction_date , count $\left(^{}\right.$ ) as total_orders
FROM table
GROUP BY customer_id
uplicate data, or data that contains multiple records per entity even if they techni-
Duplicate data, or data that contains multiple records per entity even if they technically are not duplicates, is one of the most common reasons for incorrect query results. You can suspect duplicates as the cause if all of a sudden the number of customers or total sales returned by a query is many times greater than what you were expecting. Fortunately, there are several techniques that can be applied to prevent this from occurring.
Another common problem is missing data, which we’ll turn to next.

计算机代写|数据库作业代写SQL代考|Cleaning Data with CASE Transformations

CASE statements can be used to perform a variety of cleaning, enrichment, and summarization tasks. Sometimes the data exists and is accurate, but it would be more useful for analysis if values were standardized or grouped into categories. The structure of CASE statements was presented earlier in this chapter, in the section on binning.
Nonstandard values occur for a variety of reasons. Values might come from different systems with slightly different lists of choices, system code might have changed,

options might have been presented to the customer in different languages, or the customer might have been able to fill out the value rather than pick from a list.

Imagine a field containing information about the gender of a person. Values indicating a female person exist as “F” “female”, and “femme.” We can standardize the values like this:
CASE when gender $=$ ‘ $F$ ‘ then ‘Female’
when gender = ‘female’ then ‘Female’
when qender = ‘femme’ then ‘Female’
else gender
end as gender_cleaned
CASE statements can also be used to add categorization or enrichment that does not exist in the original data. As an example, many organizations use a Net Promoter Score, or NPS, to monitor customer sentiment. NPS surveys ask respondents to rate, on a scale of 0 to 10 , how likely they are to recommend a company or product to a friend or colleague. Scores of 0 to 6 are considered detractors, 7 and 8 are passive, and 9 and 10 are promoters. The final score is calculated by subtracting the percentage of detractors from the percentage of promoters. Survey result data sets usually include optional free text comments and are sometimes enriched with information the organization knows about the person surveyed. Given a data set of NPS survey responses, the first step is to group the responses into the categories of detractor, passive, and promoter:
SELECT response_id
, likelihood
, case when llkelthood $<=6$ then ‘Detractor’
when likelihood $<=8$ then ‘Passive’
else ‘Promoter’
SELECT response_id
, Likelihood
,case when Llkelthood $<=6$ then ‘Detractor’
when likelihood $<=8$ then ‘Passive’
else ‘Promoter’
end as response_type
FRoM nps_responses
end as response_type
FROM nps_responses

计算机代写|数据库作业代写SQL代考|Detecting Duplicates


计算机代写|数据库作业代写SQL代考|Detecting Duplicates

重复是当您有两个(或更多)行具有相同的信息时。由于多种原因,可能存在重复项。如果有一些手动步骤,则可能在数据输入过程中出现错误。跟踪呼叫可能已触发两次。一个处理步骤可能已运行多次。您可能使用隐藏的多对多 JOIN 意外创建了它。无论它们如何出现,重复项确实会给您的分析带来麻烦。我记得在我职业生涯的早期,当我认为我有一个很好的发现时,却有一个产品经理指出我的销售额是实际销售额的两倍。这很尴尬,会削弱信任,并且需要返工,有时还需要对代码进行艰苦的审查才能发现问题。我学会了边走边检查重复项。

SELECT column_a、column_b、column_c…
FROM table
SELECT column_a、column_b、column_c。
ORDER BY1,2,3…

SELECT count() FROM ( SELECT column_a, column_b, column_c… , count() as records
) 一个
SELECT count() FROM ( SELECT column_a, column_b, column_c… , count) 作为记录来自… GROUP BY1,2,3…) a WHERE 记录 > 1 ;WHERE 记录 > 1 ; 这将告诉您是否存在重复的情况。如果查询返回 0 ,您就可以开始了。有关更多详细信息,您可以列出记录数(2,3,4等):SELECT 记录、计数()
SELECT column_a, column_b, column_c…, count(*) 作为记录
GROUP BY1,2,3…
) a
WHERE 记录 > 1

计算机代写|数据库作业代写SQL代考|Deduplication with GROUP BY and DISTINCT

SELECT a.customer_id, a.customer_name, a.customer_email
FROM customers a
JOIN transactions b on a。 customer_id = b.customer_id
但是,这将为每个客户的每笔交易返回一行,并且希望至少有几个客户进行了多次交易。我们不小心创建了重复,不是因为存在任何潜在的数据质量问题,而是因为我们没有注意避免结果中的重复。幸运的是,使用 SQL 有几种方法可以避免这种情况。删除重复项的一种方法是使用关键字 DISTINCT:
SELECT distinct a.customer_id, a.customer_name, a.customer_email
FROM customers
a 在 a.customer_id = b.customer_id 上加入交易 b
SELECT distinct a.customer_id, a.customer_name, a .customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
另一种选择是使用 GROUP BY,尽管它通常与聚合相关联,但也会以与 DISTINCT 相同的方式进行重复数据删除。记得第一次看到同事用 GROUP BY 没有聚合去重-我

甚至没有意识到这是可能的。我发现它不如 DISTINCT 直观,但结果是 samc:
SELECT a.customer_id, a.customer_name, a.customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
SELECT customer_id
,min(transaction_date)作为 first_transaction_date
, max(transaction_date) 作为 last_transaction_date
, count()as total_orders FROM table GROUP BY customer_id SELECT customer_id ,min(transaction_date) as first_transaction_date ,max(transaction_date) as last_transaction_date , count() 作为 total_orders
FROM table
GROUP BY customer_id



计算机代写|数据库作业代写SQL代考|Cleaning Data with CASE Transformations

CASE 语句可用于执行各种清理、扩充和汇总任务。有时数据存在并且是准确的,但如果将值标准化或分组到类别中,它将对分析更有用。CASE 语句的结构在本章前面的分箱一节中介绍过。


CASE when gender= ‘ F’ 然后 ‘Female’
当性别 = ‘female’ 然后 ‘Female’
当 qender = ‘femme’ 然后 ‘Female’
以 gender_cleaned 结尾
CASE 语句还可用于添加原始数据中不存在的分类或丰富。例如,许多组织使用净推荐值或 NPS 来监控客户情绪。NPS 调查要求受访者以 0 到 10 的等级对他们向朋友或同事推荐公司或产品的可能性进行评分。0 到 6 分被认为是批评者,7 和 8 分是被动的,9 和 10 是推动者。最终得分是通过从推荐者的百分比中减去批评者的百分比来计算的。调查结果数据集通常包括可选的自由文本评论,有时还包含组织了解的有关被调查人的信息。给定一组 NPS 调查响应的数据集,第一步是将响应分为批评者、被动者和促进者类别:
SELECT response_id
, 可能性
, case when llkelthood<=6然后是“贬低者”的
SELECT response_id
,Llkelthood 时的情况<=6然后是“贬低者”的
以 response_type
FRoM nps_responses 结尾

以 response_type
FROM nps_responses 结尾

