Human resource management (HRM) is the part of an organization that focuses on an organization’s recruitment, training, and retention of employees. With the increased use of analytics in business, HRM has become much more data-driven. Indeed, HRM is sometimes now referred to as “people analytics.” HRM professionals use data and analytical models to form high-performing teams, monitor productivity and employee performance, and ensure diversity of the workforce. Data visualization is an important component of HRM, as HRM professionals use data dashboards to monitor relevant data supporting their goal of having a high-performing workforce.
A key interest of HRM professionals is employee churn, or turnover in an organization’s workforce. When employees leave and others are hired, there is often a loss of productivity as positions go unfilled. Also, new employees typically have a training period and then must gain experience, which means employees will not be fully productive at the beginning of their tenure with the company. Figure $1.8$, a stacked column chart, is an example of a visual display of employee turnover. It shows gains and losses of employees by month. A stacked column chart is a column chart that shows part-to-whole comparisons, either over time or across categories. Different colors or shades of color are used to denote the different parts of the whole within a column. In Figure 1.8, gains in employees (new hires) are represented by positive numbers in darker blue and losses (people leaving the company) are presented as negative numbers and lighter blue bars. We see that January and July-October are the months during which the greatest numbers of employees left the company, and the months with the highest numbers of new hires are April through June.Visualizations like Figure $1.8$ can be helpful in better understanding and managing workforce fluctuations.

## CS代写|数据可视化代写Data visualization代考|Marketing

Marketing is one of the most popular application areas of analytics. Analytics lis used for optimal pricing, markdown pricing for seasonal goods, and optimal allocation of marketing budget. Sentiment analysis using text data such as tweets, social networks to determine influence, and website analytics for understanding website traffic and sales, are just a few examples of how data visualization can be used to support more effective marketing.
Let us consider a software company’s website effectiveness. Figure $1.9$ shows a funnel chart of the conversion of website visitors to subscribers and then to renewal customers. A funnel chart is a chart that shows the progression of a numerical variable for various categories from larger to smaller values. In Figure 1.9, at the top of the funnel, we track $100 \%$ of the first-time visitors to the website over some period of time, for example, a six-month period. The funnel chart shows that of those original visitors, $74 \%$ return to the website one or more times after their initial visit. Sixty-one percent of the first-time visitors downloaded a 30-day trial version of the software, $47 \%$ eventually contacted support services, $28 \%$ purchased a one-year subscription to the software, and $17 \%$ eventually renewed their subscription. This type of funnel chart can be used to compare the conversion effectiveness of different website configurations, the use of bots, or changes in support services.

## CS代写|数据可视化代写Data visualization代考|Big Data

There is no universally accepted definition of big data. However, probably the most general definition of big data is any set of data that is too large or too complex to be handled by standard data-processing techniques using a typical desktop computer. People refer to the four $\mathrm{Vs}$ of big data:

• volume-the amount of data generated
• velocity-the speed at which the data are generated
• variety-the diversity in types and structures of data generated
• veracity-the reliability of the data generated
Volume and velocity can pose a challenge for processing analytics, including data visualization. Special data management software such as Hadoop and higher capacity hardware (increased server or cloud computing) may be required. The variety of the data is handled by converting video, voice, and text data to numerical data, to which we can then apply standard data visualization techniques.
In summary, the type of data you have will influence the type of graph you should use to convey your message. The zoo attendance data in Figure $1.1$ are time series data. We used a column chart in Figure $1.1$ because the numbers are the total attendance for each month, and we wanted to compare the attendance by month. The height of the columns allows us to easily compare attendance by month. Contrast Figure $1.1$ with Figure 1.4, which is also time series data. Here we have the value of the Dow Jones Index. These data are a snapshot of the current value of the DJI on the first trading day of each month. They provide what is essentially a time path of the value, and so we use a line graph to emphasize the continuity of time.

## CS代写|数据可视化代写Data visualization代考|Data Visualization in Practice

Data visualization is used to explore and explain data and to guide decision making in all areas of business and science. Even the most analytically advanced companies such as Google, Uber, and Amazon rely heavily on data visualization. Consumer goods giant Procter \& Gamble (P\&G), the maker of household brands such as Tide, Pampers, Crest, and Swiffer, has invested heavily in analytics, including data visualization. P\&G has built what it calls the Business Sphere ${ }^{\mathrm{TM}}$ in more than 50 of its sites around the world. The Business Sphere is a conference room with technology for displaying data visualizations on its walls. The Business Sphere displays data and information P\&G executives and managers can use to make better-informed decisions. Let us briefly discuss some ways in which the functional areas of business, engineering, science, and sports use data visualization.

Accounting is a data-driven profession. Accountants prepare financial statements and examine financial statements for accuracy and conformance to legal regulations and best practices, including reporting required for tax purposes. Data visualization is a part of every accountant’s tool kit. Data visualization is used to detect outliers that could be an indication of a data error or fraud. As an example of data visualization in accounting, let us consider Benford’s Law.
Benfords Law, also known as the First-Digit Law, gives the expected probability that the first digit of a reported number takes on the values one through nine, based on many real-life numerical data sets such as company expense accounts. A column chart displaying Benford’s Law is shown in Figure 1.5. We have rounded the probabilities to four digits. We see, for example, that the probability of the first digit being a 1 is $0.3010$. The probability of the first digit being a 2 is $0.1761$, and so forth.

Benford’s Law can be used to detect fraud. If the first digits of numbers in a data set do not conform to Bedford’s Law, then further investigation of fraud may be warranted. Consider the accounts payable (money owed the company) for Tucker Software. Figure $1.6$ is a clustered column chart (also known as a side-by-side column chart). A clustered column chart is a column chart that shows multiple variables of interest on the same chart, with the different variables usually denoted by different colors or shades of a color. In Figure 1.6, the two variables are Benford’s Law probability and the first digit data for a random sample of 500 of Tucker’s accounts payable entries. The frequency of occurrence in the data is used to estimate the probability of the first digit for all of Tucker’s accounts payable entries. It appears that there are an inordinate number of first digits of 5 and 9 and a lower than expected number of first digits of 1 . These might warrant further investigation by Tucker’s auditors.

## CS代写|数据可视化代写Data visualization代考|Big Data

• volume——产生的数据量
• 速度——生成数据的速度
• 多样性——生成的数据类型和结构的多样性
• 准确性——生成的数据的可靠性
总之，您拥有的数据类型将影响您应该用来传达信息的图表类型。动物园出勤数据如图1.1是时间序列数据。我们在图中使用了柱形图1.1因为这些数字是每个月的总出勤率，我们想按月比较出勤率。列的高度使我们可以轻松地按月比较出勤率。对比图1.1图 1.4 也是时间序列数据。这里我们有道琼斯指数的价值。这些数据是每个月第一个交易日 DJI 当前价值的快照。它们提供了本质上是价值的时间路径，因此我们使用折线图来强调时间的连续性。

## CS代写|数据可视化代写Data visualization代考|Data Visualization for Exploration

Data visualization is a powerful tool for exploring data to more easily identify patterns, recognize anomalies or irregularities in the data, and better understand the relationships between variables. Our ability to spot these types of characteristics of data is much stronger and quicker when we look at a visual display of the data rather than a simple listing.
As an example of data visualization for exploration, let us consider the zoo attendance data shown in Table $1.1$ and Figure 1.1. These data on monthly attendance to a zoo can be found in the file Zoo. Comparing Table $1.1$ and Figure 1.1, observe that the pattern in the data is more detectable in the column chart of Figure $1.1$ than in a table of numbers. A column chart shows numerical data by the height of the column for a variety of categories or time periods. In the case of Figure 1.1, the time periods are the different months of the year.

Our intuition and experience tells us that we would expect zoo attendance to be highest in the summer months when many school-aged children are out of school for summer break. Figure $1.1$ confirms this, as the attendance at the zoo is highest in the summer months of June, July, and August. Furthermore, we see that attendance increases gradually each month from February through May as the average temperature increases, and attendance gradually decreases each month from September through November as the average temperature decreases. But why does the zoo attendance in December and January not follow these patterns? It turns out that the zoo has an event known as the “Festival of Lights” that runs from the end of November through early January. Children are out of school during the last half of December and early January for the holiday season, and this leads to increased attendance in the evenings at the zoo despite the colder winter temperatures.
Visual data exploration is an important part of descriptive analytics. Data visualization can also be used directly to monitor key performance metrics, that is, measure how an organization is performing relative to its goals. A data dashboard is a data visualization tool that gives multiple outputs and may update in real time. Just as the dashboard in your car measures the speed, engine temperature, and other important performance data as you drive, corporate data dashboards measure performance metrics such as sales, inventory levels, and service levels relative to the goals set by the company. These data dashboards alert management when performances deviate from goals so that corrective actions can be taken.
Visual data exploration is also critical for ensuring that model assumptions hold in predictive and prescriptive analytics. Understanding the data before using that data in modeling builds trust and can be important in determining and explaining which type of model is appropriate.

## CS代写|数据可视化代写Data visualization代考|Data Visualization for Explanation

Data visualization is also important for explaining relationships found in data and for explaining the results of predictive and prescriptive models. More generally, data visualization is helpful in communicating with your audience and ensuring that your audience understands and focuses on your intended message.

Let us consider the article, “Check Out the Culture Before a New Job,” which appeared in The Wall Street Journal. ${ }^3$ The article discusses the importance of finding a good cultural fit when seeking a new job. Difficulty in understanding a corporate culture or misalignment with that culture can lead to job dissatisfaction. Figure $1.3$ is a re-creation of a bar chart that appeared in this article. A bar chart shows a summary of categorical data using the length of horizontal bars to display the magnitude of a quantitative variable.

The chart shown in Figure $1.3$ shows the percentage of the 10,002 survey respondents who listed a factor as the most important in seeking a job. Notice that our attention is drawn to the dark blue bar, which is “Company culture” (the focus of the article). We immediately see that only “Salary and bonus” is more frequently cited than “Company culture.” When you first glance at the chart, the message that is communicated is that corporate culture is the second most important factor cited by job seekers. And as a reader, based on that message, you then decide whether the article is worth reading.

## 有限元方法代写

## 统计代写|数据可视化代写Data visualization代考|Another Asymmetry

There is still one more small, but nagging, problem with this description of Galton’s development of regression and the idea of correlation. In Figure 6.13, which shows Galton’s sweet pea data, we were careful to plot the size of child seeds on the vertical $y$ axis against that of their parent seeds on the horizontal $x$ axis, as is the modern custom for a scatterplot, whose goal is to show how $y$ depends on, or varies with, $x$. Modern statistical methods that flow from Galton and Pearson are all about directional relationships, and they try to predict $y$ from $x$, not vice-versa. It makes sense to ask how a child’s height is related to that of its parents, but it stretches the imagination to go in the reverse direction and contemplate how a child’s height might influence that of its parents.

So, why didn’t Galton put child height on the $y$ axis and parent height on the $x$ axis in Figure 6.16, as one would do today? One suggestion is that such graphs were in their infancy, so the convention of plotting the outcome variable on the ordinate had not yet been established. Yet in Playfair’s timeseries graphs (Plate 10) and in all other not-quite-scatterplots such as Halley’s (Figure 6.2), the outcome variable was always shown on the $y$ axis.

The answer is surely that Galton’s Figure $6.16$ started out as a table, listing mid-parent heights in the rows and heights of children in the columns. Parent height was the first grouping variable, and he tallied the heights of their children in the columns.

In a table, the rows are typically displayed in increasing order (of $y$ ) from top to bottom; a plot does the reverse, showing increasing values of $y$ from bottom to top. Hence, it seems clear that Galton constructed his Table I (Figure 6.14) and figures based on it (Figure $6.15$ and Figure 6.16) as if he thought of them as plots.

## 统计代写|数据可视化代写Data visualization代考|Some Remarkable Scatterplots

As Galton’s work shows, scatterplots had advantages over earlier graphic forms: the ability to see clusters, patterns, trends, and relations in a cloud of points. Perhaps most importantly, it allowed the addition of visual annotations (point symbols, lines, curves, enclosing contours, etc.) to make those relationships more coherent and tell more nuanced stories. This $2 \mathrm{D}$ form of the scatterplot allows these higher-level visual explanations to be placed firmly in the foreground. John Tukey later expressed this as, “The greatest value of a picture is when it forces us to notice what we never expected to see” (1977, p. vi).

In the first half of the twentieth century, data graphics entered the mainstream of science, and the scatterplot soon became an important tool in new discoveries. Two short examples must serve to illustrate applications in physical science and economics.

One key feature was the idea that discovery of something interesting could come from the perception-and understanding-of classifications of objects based on clusters, groupings, and patterns of similarity, rather than direct relations, linear or nonlinear. Observations shown in a scatterplot could belong to different groups, revealing other laws. The most famous example concerns the Hertzsprung-Russell (HR) diagram, which revolutionized astrophysics.
The original version of the Hertzsprung-Russell diagram, shown here in Figure 6.17, is not a graph of great beauty, but nonetheless it radically changed thinking in astrophysics by showing that scatterplots of measurements of stars could lead to a new understanding of stellar evolution.

Astronomers had long noted that stars varied, not only in brightness (luminosity), but also in color, from blue-white to orange, yellow, and red. But until the early 1900 s, they had no general way to classify them or interpret variations in color. In 1905, the Danish astronomer Ejnar Hertzsprung presented tables of luminosity and star color. He noted some apparent correlations and trends, but the big picture-an interpretable classification, leading to theory-was lacking, probably because his data were displayed in tables.

## 统计代写|数据可视化代写数据可视化代考|一些显著的散点图

Galton的工作表明，散点图比早期的图形形式有优势:能够在点云中看到集群、模式、趋势和关系。也许最重要的是，它允许添加视觉注释(点符号、线、曲线、外围轮廓等)，使这些关系更连贯，讲述更微妙的故事。这种$2 \mathrm{D}$形式的散点图可以让这些更高层次的视觉解释牢牢地放在前景中。约翰·杜克(John Tukey)后来将其表达为:“一幅画的最大价值在于它迫使我们注意到我们从未期望看到的东西”(1977,p. vi)

## 有限元方法代写

## 统计代写|数据可视化代写Data visualization代考|Francis Galton and the Idea of Correlation

Francis Galton [1822-1911] was among the first to show a purely empirical bivariate relation in graphical form using actual data with his work on questions of heritability of traits. He began with plots showing the relationship between physical characteristics of people (head circumference and height) or between parents and their offspring, as a means to study the association and inheritance of traits: Do tall people have larger heads than average? Do tall parents have taller than average children?

Inspecting and calculating from his graphs, he discovered a phenomenon he called “regression toward the mean,” and his work on these problems can be considered to be the foundation of modern statistical methods. His insight from these diagrams led to much more: the ideas of correlation and linear regression; the bivariate normal distribution; and eventually to nearly all of classical statistical linear models (analysis of variance, multiple regression, etc.).
The earliest known example is a chart of head circumference compared to stature from Galton’s notebook (circa 1874) “Special Peculiarities,” shown in Figure 6.11. ${ }^{16}$ In this hand-drawn chart, the intervals of height are shown horizontally against head circumference vertically. The entries in the body are the tallies in the pairs of class intervals. Galton included the total counts of height and head circumference in the right and bottom margins and drew smooth curves to represent their frequency distributions. The conceptual origin of this chart as a table rather than a graph can be seen in the fact that the smallest values of the two variables are shown at the top left (first row and first column), rather than in the bottom right, as would be more natural in a graph. One may argue that Galton’s graphic representations of bivariate relations were both less and more than true scatterplots of data, as these are used today. They are less because at first glance they look like little more than tables with some graphic annotations. They are more because he used these as devices to calculate and reason with. ${ }^{17} \mathrm{He}$ did this because the line of regression he sought was initially defined as the trace of the mean of the vertical variable $y$ as the horizontal variable $x$ varied $^{18}$ (what we now think of as the conditional mean function, $\mathcal{E}(y \mid x)$ ), and so required grouping at least the $x$ variable into class intervals. Galton’s displays of these data were essentially graphic transcriptions of these tables, using count-symbols $(/, / /, / / /, \ldots)$ or numbers to represent the frequency in each cell-what in 1972 the Princeton polymath John Tukey called “semi-graphic displays,” making them a visual chimera: part table, part plot.

## 统计代写|数据可视化代写Data visualization代考|Galton’s Elliptical Insight

Galton’s next step on the problem of filial correlation and regression turned out to be one of the most important in the history of statistics. In 1886, he published a paper titled “Regression Towards Mediocrity in Hereditary Stature” containing the table shown in Figure 6.14. The table records the frequencies of the heights of 928 adult children born to 205 pairs of fathers and mothers, classified by the average height of their father and mother (“mid-parent” height).$^{22}$

If you look at this table, you may see only a table of numbers with larger values in the middle and some dashes (meaning 0 ) in the upper left, and bottom right corners. But for Galton, it was something he could compute with, both in his mind’s eye and on paper.

I found it hard at first to catch the full significance of the entries in the table, which had curious relations that were very interesting to investigate. They came out distinctly when I “smoothed” the entries by writing at each intersection of a horizontal column with a vertical one, the sum of the entries in the four adjacent squares, and using these to work upon. (Galton, 1886, p. 254)

Consequently, Galton first smoothed the numbers in this table, which he did by the simple step of summing (or averaging) each set of four adjacent cells. We can imagine that he wrote that average number larger in red ink, exactly at the intersection of these four cells. When he had completed this task, we can imagine him standing above the table with a different pen and trying to connect the dots-to draw curves, joining the points of approximately equal frequency. We tried to reproduce these steps in Figure 6.15, except that we did the last step mechanically, using a computer algorithm, whereas Galton probably did it by eye and brain, in the manner of Herschel, with the aim that the resulting curves should be gracefully smooth.

## 统计代写|数据可视化代写数据可视化代考|Francis Galton和相关性的思想

Francis Galton[1822-1911]是第一个使用实际数据以图形形式展示纯经验二元关系的人之一，他的工作涉及性状的遗传力问题。他从展示人的身体特征(头围和身高)之间的关系，或父母和他们的后代之间的关系的图表开始，作为一种研究特征的关联和遗传的手段:高个子的人的头比普通人大吗?个子高的父母生的孩子比一般人高吗?

## 有限元方法代写

## 统计代写|数据可视化代写Data visualization代考|John Herschel and the Orbits of Twin Stars

In the hundred years from 1750 to 1850 , during which most of the modern graphic forms were invented, fundamentally important problems of measurement attracted the best mathematical minds, including Euler, Laplace,

Legendre, Newton, and Gauss, and led to the inventions of calculus, least squares, curve fitting, and interpolation. ${ }^8$ In these scientific and mathematical domains, graphs had begun to play an increasing role in the explanation of scientific phenomena, as we described earlier in the case of Johann Lambert.
Among this work, we find the remarkable paper of Sir John Frederick W. Herschel [1792-1871], On the Investigation of the Orbits of Revolving Double Stars, which he read to the Royal Astronomical Society on January 13, 1832, and published the next year. Double stars had long played a particularly important role in astrophysics because they provided the best means to measure stellar masses and sizes, and this paper was prepared as andendum to another 1833 paper, in which Herschel had meticulously cataloged observations on the orbits of 364 double stars.

The printed paper refers to four figures, presented at the meeting. Alas, the version printed in the Memoirs of the Royal Astronomical Society did not include them, presumably owing to the cost of engraving. Herschel noted, “The original charts and figures which accompanied the paper being all deposited with the Society.” These might have been lost to historians, but Thomas Hankins discovered copies of them in research for his insightful 2006 paper on Herschel’s graphical method.

To see why Herschel’s paper is remarkable, we must follow his exposition of the goals, the construction of scatterplots, and the idea of visual smoothing to produce a more satisfactory solution for parameters of the orbits of double stars than were available by analytic methods.

## 统计代写|数据可视化代写Data visualization代考|Herschel’s Graphical Impact Factor

The critical reader may object, thinking that Herschel’s graphical method, as ingenious as it might be, did not produce true scatterplots in the modern sense because the horizontal axis in Figure $6.9$ is time rather than a separate variable. Thus one might argue that all we have is another time-series graph,so priority really belongs to Playfair, or further back, to Lambert, who stated the essential ideas. On the surface this is true.

But it’s only true on the surface. We argue that a close and appreciative reading of Herschel’s description of his graphical method can, at the very least, be considered a true innovation in visual thinking, worthy of note in the present account. More importantly, Herschel’s true objective was to calculate the parameters of the orbits of twin stars based on the relation between position angle and separation distance; the use of time appears in the graph as a proxy or indirect means to overcome the scant observations and perhaps extravagant errors in the data on separation distance.

Yet Herschel’s graphical development of calculation based on a scatterplot attracted little attention outside the field of astronomy, where his results were widely hailed as groundbreaking in the Royal Astronomical Society. But this notice was for his scientific achievement rather than for his contribution of a graphical method, which scientists probably rightly considered just a means to an end.

It took another 30-50 years for graphical methods to be fully welcomed into the family of data-based scientific explanation, and seen as something more than mere illustrations. This change is best recorded in presentations at the Silver Jubilee of the Royal Statistical Society in 1885 . Even at that time, most British statisticians still considered themselves “statists”, mere recorders of statistical facts in numerical tables; but “graphists” had finally been invited to the party.

On June 23, the influential British economist Alfred Marshall [1842-1924] addressed the attendees on the benefits of the graphic method, a radical departure for a statist. His French counterpart Émile Levasseur [1828-1911] presented a survey of the wide variety of graphs and statistical maps then in use. Yet even then, the scientific work of Lambert and Herschel, and the concept of the scatterplot as a new graphical form remained largely unknown. This would soon change with Francis Galton.

## 统计代写|数据可视化代写数据可视化代考|赫歇尔的图形影响因子

6月23日，有影响力的英国经济学家阿尔弗雷德·马歇尔(Alfred Marshall, 1842-1924)向与会者发表了关于图表方法的好处的演讲，这是对中央集权主义者的一种激进的背离。他的法国同行Émile Levasseur[1828-1911]对当时使用的各种各样的图表和统计地图进行了调查。然而，即使在那时，兰伯特和赫歇尔的科学工作，以及散点图作为一种新的图形形式的概念，在很大程度上仍然是未知的。弗朗西斯·高尔顿很快改变了这种情况

## 有限元方法代写

## 统计代写|数据可视化代写Data visualization代考|The Broad Street Pump

Snow’s opportunity to test his theory came with the new eruption that began toward the end of August in 1854. His celebrated 1855 report, On the Mode of Communication of Cholera, ${ }^{16}$ describes it dramatically:
The most terrible outbreak of cholera which ever occurred in this kingdom, is probably that which took place in Broad Street, Golden Square, and the adjoining streets, a few weeks ago. Within two hundred and fifty yards of the spot where Cambridge Street joins Broad Street, there were upwards of five hundred fatal attacks of cholera in ten days. The mortality in this limited area probably equals any that was ever caused in this country, even by the plague; and it was much more sudden, as the greater number of cases terminated in a few hours. (p. 38)

The full story of Snow’s discovery of the waterborne cause of cholera has been told in rich detail many times, by medical historians ${ }^{17}$ and cartographers, ${ }^{18}$ and it was brought to the attention of statisticians and those interested in the history of data visualization by Edward Tufte. ${ }^{19}$

The short, if slightly apocryphal, version of this story is that, during the outbreak of cholera in Soho in 1854, Snow created a dot map of the locations of deaths and immediately noticed that they clustered on Broad Street, near the site of one of the public pumps from which residents drew their water. This narrative continues: Snow recognized that cases of death were strongly associated with drinking water from this pump. He petitioned the Board of Guardians of St. James Parish to remove the pump handle, whereupon the cholera epidemic subsided.

## 统计代写|数据可视化代写Data visualization代考|The Neighborhoods Map

The version of Snow’s map shown in Figure $4.7$ is the most famous, but a second version is more interesting graphically and scientifically. The Cholera Inquiry Committee, appointed by the Vestry of St. James, submitted its report on July 25, 1855. ${ }^{22}$ The section titled “Dr. Snow’s Report” contained a new map that attempted a more detailed and direct visual analysis of the association of death with the Broad Street pump.

This new map, shown in Figure 4.8, states and tests a geospatial hypothesis: people are most likely to draw their water from the nearest pump (by walking distance). The outlined region in this map “shews the various points which have been found by careful measurement to be at an equal distance by the nearest road from the pump in Broad Street and the surrounding pumps.” 23

If the source of the outbreak was indeed the Broad Street pump, one should expect to find the highest concentration of deaths within this area, and also a low prevalence outside it. He stated his conclusion as “it will be observed that the deaths either very much diminish, or cease altogether, at every point where it becomes decidedly nearer to send to another pump than to the one in Broad street.” 24

The final explanation for the source of the outbreak came slightly later, through the work of Reverend Henry Whitehead, the curate at a local church and a member of the Cholera Inquiry Committee. He identified the first (“index”) case as the death of a five-month-old infant, Frances Lewis, whose family lived at 40 Broad Street, immediately adjacent to the pump. When severe diarrhea struck the child, her mother, Sarah Lewis, soaked the diapers and emptied the pails into the cesspool at the front of their house, only three feet from the well. Unfortunately, the cesspool walls had decayed and the effluent flowed directly into the pump well. Thomas Lewis, the baby’s father and a local constable, suffered a fatal attack of the disease on September 8, the same day that the pump handle was removed.

## 有限元方法代写

## 统计代写|数据可视化代写Data visualization代考|The Transcendent Effect of Water

Farr was certainly meticulous in evaluating the impact of potential causes on mortality from cholera. But he lacked an effective method for doing so, even for one potential cause, and the idea of accounting for the combination of several causes stretched him to the limit. His general method was to prepare tables of cholera mortality in the districts of London, broken down and averaged over classes of a possible explanatory variable.

For example, Farr divided the 38 districts into the 19 highest and 19 lowest values on other variables and calculated the ratio of cholera mortality for each; elevation had the largest ratio (3:1), while all other variables showed smaller ratios (e.g., 2.1:1 for house values). Having hit on elevation above the Thames as his principal cause, he prepared many other tables showing mortality by districts also in relation to density of the population, value of houses and shops, relief to the poor, and geographical features.

Figure $4.5$ illustrates the depth of this inquiry, in an ingenious semigraphic combination of small tables for each district overlaid on a schematic map of their spatial arrangement along the Thames. The tables show the numbers for elevation, cholera deaths, deaths from all causes, and population density, and identify the water companies supplying each district. Unfortunately, this lovely diagram concealed more than it revealed: the signal was there, but the wealth of detail provided too much noise.

It would later turn out that the direct cause of cholera could be traced to contamination of the water supply from which people drew. It was probably confusing that water was provided by nine water companies, as Farr shows in Figure 4.5, so he divided the registration districts into three groups based on the region along the Thames for their water supply: Thames, between Kew and Hammersmith bridges (western London), between Battersea and Waterloo bridges (central London), and districts that obtained their water from tributaries of the Thames (New River, Lea River, and Ravensbourne River).

## 统计代写|数据可视化代写Data visualization代考|John Snow on Cholera

Another terrible wave of cholera struck London toward the end of summer 1854, concentrated in the parish of St. James, Westminster (the present-day district of Soho). This time, a correct explanation of the cause would eventually be found with the aid of meticulous data collection, a map of disease incidence, keen medical detective work, and logical reasoning to rule out alternative explanations. It is useful to understand why John Snow succeeded while William Farr did not.

The physician John Snow [1813-1858] lived in the Soho district at the time of this new outbreak. He had been an eighteen-year-old medical assistant in Newcastle upon Tyne in 1831 when cholera first struck there with great loss of life. At the time of the second great epidemic, in 1848-1849, Snow observed the severity of the disease in his district. In 1849, in a two-part paper in the Medical Gazette and Times ${ }^{11}$ and a longer monograph ${ }^{12}$ he proposed that cholera was transmitted by water rather than through the air and passed from person to person through the intestinal discharges of the sick, either transmitted directly or entering the water supply.

Snow’s reasoning was entirely that of a clinician based on the form of pathology of the disease, rather than that of a statistician seeking associations with potential causal factors. Had cholera been an airborne disease, one would expect to see its effects in the lungs and then perhaps spread to others by respiratory discharge. But the disease clearly acted mainly in the gut, causing vomiting, intense diarrhea, and the massive dehydration that led to death. Whatever causal agent was responsible, it must have been something ingested rather than something inhaled.

William Farr was well aware of Snow’s theory when he wrote his 1852 report. ${ }^{13} \mathrm{He}$ described it quite politely but rejected Snow’s theory of the pathology of cholera. He could not understand any mechanism whereby something ingested by one individual could be passed to a larger community. To Farr, who was then considered the foremost authority on the outbreak and contagion of cholera, Snow’s contention of a single causal agent (some unknown poisonous matter, materies morbi) and a limited vector of transmission (water) was too circumscribed, too restrictive. Snow presented his argument and the evidence to support it as if ingestion and waterborne transmission could be the only causes; he also lacked the crucial data, either from a natural experiment or from direct knowledge of the water that cholera victims drank.

## 统计代写|数据可视化代写Data visualization代考|John Snow on Cholera

1854 年夏末，另一波可怕的霍乱袭击了伦敦，集中在威斯敏斯特的圣詹姆斯教区（现在的苏活区）。这一次，在细致的数据收集、疾病发病率地图、敏锐的医学侦探工作以及排除其他解释的逻辑推理的帮助下，最终将找到对原因的正确解释。理解为什么约翰·斯诺成功而威廉·法尔没有成功是很有用的。

## 有限元方法代写

## 统计代写|数据可视化代写Data visualization代考|Vital Statistics

In the previous chapter we explained how concerns in France about crime led to the systematic collection of social data. This combination of important social issues and available data led Guerry to new developments involving data display in graphs, maps, and tables.

A short time later, an analogous effort began in the United Kingdom, in the context of social welfare, poverty, public health, and sanitation. These efforts produced two new heroes of data visualization, William Farr and John Snow, who were influential in the attempt to understand the causes of several epidemics of cholera and how the disease could be mitigated.

In the United Kingdom the Age of Data can be said to have begun with the creation of the General Register Office (GRO) by an Act of Parliament in 1836. ${ }^1$ The initial intent was simply to track births and deaths in England and Wales as the means of ensuring the lawful transfer of property rights between generations of the landed gentry.

But the 1836 act did much more. It required that every single child of an English parent, even those born at sea, have the particulars reported to a local registrar on standard forms within fifteen days. It also required that every marriage and death be reported and that no dead body could be buried without a certificate of registration, and it imposed substantial fines (10-50£) for failure in this reporting duty. The effect was to create a complete data base of the entire population of England, which is still maintained by the GRO today.

The following year, William Farr [1807-1883], a 30-year-old physician, was hired, initially to handle the vital registration of live births, deaths, marriages, and divorces for the upcoming Census of 1841. After he wrote a chapter ${ }^2$ on “Vital statistics; or, the statistics of health, sickness, diseases, and death,” he was given a new post as the “compiler of scientific abstracts”, becoming the first official statistician of the UK.

Like Guerry at the Ministry of Justice in France, Farr had access to, and had to make sense of, a huge mountain of data. Farr quickly realized that these data could serve a far greater purpose: saving lives. Life expectancy could be broken down and compared over geographic regions, down to the county level. Information about the occupations of deceased persons was also recorded, so Farr could also begin to tabulate life expectancy according to economic and social station. Information about the cause of death was lacking, and Farr probably exceeded his initial authority by adding instructions to list the cause(s) of death on the standard form. This simple addition opened a vast new world of medical statistics and public health that would eventually be called epidemiology, involving the study of patterns of incidence, causes, and control of disease conditions in a population.

## 统计代写|数据可视化代写Data visualization代考|Farr’s Diagrams

Figure $4.1$ is one of five lithographed plates (three in color) that appear in Farr’s report. Farr takes many liberties with the vertical scales (we would now call these graphical sins) to try to show any relation between the daily numbers of deaths from cholera and diarrhea to metrological data on those days. Most apparent are the spikes of cholera deaths in August and September. Temperature was also elevated, but perhaps no more than in the adjacent months. The weather didn’t seem to be a sufficient causal factor in 1849 . Or was it? Plate 2 takes a longer view, showing the possible relationship between temperature and mortality for every week over the eleven years from 1840 to 1850. This is a remarkable chart-a new invention in the language of statistical graphs. This graphical form, now called a radial diagram (or windrose), is ideally suited to showing and comparing several related series of events having a cyclical structure, such as weeks or months of the year or compass directions. The radial lines in Plate 2 serve as axes for the fifty-two weeks of each year. The outer circles show the average weekly number of deaths (corrected for increase in population) in relation to the mean number of deaths over all years. When these exceed the average, the area is shaded black (excess mortality); they are shaded yellow when they are below the average (salubrity).
Similarly, the inner circles show average weekly temperature against a baseline of the mean temperature $\left(48^{\circ} \mathrm{F}\right)$ of the seventy-nine years from 1771 to 1849 . Weeks exceeding this average are outside the baseline circle and shaded red, while those weeks that were colder than average are said to be shaded blue (but appear as gray).

In this graph we can immediately see that something very bad happened in London in summer 1849 (row 3, column 2), leading to a huge spike in deaths from July through September, and the winter months in 1847 (row 2, column 3) also stand out. This larger view, using the idea later called “small multiples” by Tufte, ${ }^6$ does something more, which might not be noticed in a series of separate charts: it shows a general pattern across years of fewer deaths on average in the warmer months of April (at 9:00) through September (at 3:00), but the dramatic spikes point to something huge that can not be explained by temperature.

## 有限元方法代写

## 统计代写|数据可视化代写Data visualization代考|Reconstruct Test DATA

We try to reconstruct the test data using the lower dimensional representation $y$ (of the point $x$ ) such that
$$x^{\prime}=U y$$
We know that $y=U^{T} x$. By substituting y with $U^{T} x$ in the (3.14), we get
$$x^{\prime}=U U^{T} x$$
We know that $U=X V D^{-1}$. By substituting $U$ with $X V D^{-1}$ and $U^{T}$ with $D^{-1} V^{T} X^{T}$
$$\begin{gathered} x^{\prime}=X V D^{-1} D^{-1} V^{T} X^{T} x \ x^{\prime}=X V D^{-2} V^{T} X^{T} x \end{gathered}$$
Hence, test data $x^{\prime}$ can be reconstructed from the lower dimensional representation $y$.
Dual PCA is a variant of PCA used when the number of features is greater than the number of data points. Since it is just a variant of PCA, it follows all the advantages and limitations of PCA.

## 统计代写|数据可视化代写Data visualization代考|EXPLANATION AND WORKING

After understanding the concept of dimensionality reduction and a few algorithms for the same, let us now examine some plots given in Figure 4.1. What is the true dimensionality of these plots?

All dimensionality reduction techniques are based on the implicit assumption that the data lies along some low dimensional manifold. This is the case for the first three examples in Figure 4.1, which lie along a one-dimensional manifold even though it is plotted in a two-dimensional plane. In the fourth example in Figure 4.1, the data has been randomly plotted on a two-dimensional plane, so dimensionality reduction without losing information is not possible.

For the first two examples, we can use Principal Component Analysis (PCA) to find the approximate lower dimensional linear subspace. However, PCA will make no difference in the case of the third and fourth example because the structure is nonlinear and PCA only aims at finding the linear subspace. However, there are ways to find nonlinear lower dimensional manifolds.

Any form of linear projection to one dimension on this nonlinear data will result in linear principal components and we might lose information about the original dataset. This is because we need to consider nonlinear projection to one dimension to obtain the manifold on which the data points lie. So, how do we modify the PCA algorithm to solve for the nonlinear subspace in which the data points lie? In short, how do we make PCA nonlinear?

This is done using an idea similar to Support Vector Machines. Instead of using the original two-dimensional data points in one dimension using linear projections, we first write the data points as points in higher dimensional space. For example, say we write every two-dimensional point $x_{t}=\left(X_{r}, Y_{t}\right)$ into a 3-dimensional point given by mapping $\Phi$ as
$$\Phi\left(x_{t}\right)=\left(X_{t}, Y_{t}, X_{t}^{2}+Y_{t}^{2}\right)$$
After this, instead of doing PCA on the original dataset, we perform PCA on $\Phi\left(x_{1}\right)$, $\Phi\left(x_{2}\right), \ldots, \Phi\left(x_{n}\right)$. This process is known as Kernel PCA. So, the basic idea of Kernel PCA is to take the original data set and implicitly map it to a higher dimensional space using mapping $\Phi$. Then we perform PCA on this space, which is linear projection in this higher dimensional space that already captures non-linearities in the original dataset $[1,2,3]$.

## 统计代写|数据可视化代写Data visualization代考|Reconstruct Test DATA

$$x^{\prime}=U y$$

$$x^{\prime}=U U^{T} x$$

$$x^{\prime}=X V D^{-1} D^{-1} V^{T} X^{T} x x^{\prime}=X V D^{-2} V^{T} X^{T} x$$

Dual PCA 是在特征数量大于数据点数量时使用的 PCA 的一种变体。由于它只是 PCA 的一种变体，因此它遵循了 PCA 的所有优点和局限性。

## 统计代写|数据可视化代写Data visualization代考|EXPLANATION AND WORKING

$$\Phi\left(x_{t}\right)=\left(X_{t}, Y_{t}, X_{t}^{2}+Y_{t}^{2}\right)$$

## 有限元方法代写

