### 统计代写|应用统计代写applied statistics代考|Exploratory Data Analysis and Data Summarization

The purpose of this chapter is to introduce you to working with and manipulating data in R, exploring data, and plotting. I strongly believe in learning by doing, so let’s start doing some things so you can start learning!

In this book, we will use data from an experiment I conducted when I was a postdoc at the Smithsonian Tropical Research Institute in Panama in 2010 , and which was published in the journal Ecology in 2013 (https:// www.jstor.org/stable/23436298). The experiment was part of a National Science Foundation (NSF) funded project to Drs. Karen Warkentin (Boston University) and James Vonesh (Virginia Commonwealth University) studying the effects of flexible hatching timing by red-eyed treefrog (Agalychnis callidryas) embryos on interactions with predators and food levels and subsequent phenotype development of tadpoles. In order to follow along with the examples in this chapter, and the rest of the

book, you should download from the Github page for this book (https:// github.com/jtouchon/Applied-Statistics-with-R) a .csv file titled “RxP.csv”” The data are called by the short name “RxP” which stands for “Resourceby-Predation, which was the nature of the experiment (we were studying the interaction of resources and predators). This brings up a chance to reiterate a small but important point: since $\mathrm{R}$ is entirely based on typing commands by hand, you should give your datasets and variables short names so that they are quick and easy to type.
First, let’s get a handle on what the data are.

## 统计代写|应用统计代写applied statistics代考|READING IN THE DATA FILE

If you are loading a data file from somewhere on your computer, you will read it into the active workspace with the command read.csv(). If your dataset has categorical variables, you will want to include the argument stringsAsFactors= $T$, which tells the function to automatically make any columns that have character data in them factors. If the data are in your working directory, you would simply do the following (and remember, you should be working in a script window!). Note that you have to assign the data to an object. What happens if you do not? What is the working directory you ask? It is the directory that $R$ will look in by default. To find out where $R$ is looking, type getwd( $)$ at the prompt.

You can set your working directory with the function setwd( $)$, where you would put in the parenthesis a path to a folder on your computer. Make sure to put the path in quotes. Some folks like to set a working directory for each project they have. Others prefer to keep the working directly in a single place and just code the path to certain files. Choose whichever works for you. If your data are not in your working directory, you will need to specify exactly where to find the file on your computer.

## 统计代写|应用统计代写applied statistics代考|DATA EXPLORATION AND ERROR CHECKING

Whenever you start working with a dataset in $\mathrm{R}$, you should first devote substantial time to checking it for errors. Questions you should ask yourself include:

• Did the data import correctly?
• Are the column names correct?
• Are the types of data appropriate? (e.g., factor vs numerical)
• Are the numbers of columns and rows appropriate?
• Are there typos?
If, for example, a column that is supposed to be numerical shows up as a factor, that likely indicates a typo where you accidentally have text in place of a number (remember, each column in a data frame is a vector, and vectors can only have one mode, so a vector with both numbers and characters is treated as if it is all characters). Similarly, if you have a factor that should have 3 categories, but imports with 4 , you likely have a typo (e.g., “predator” vs “predtaor”), and the misspelled version is showing up as a separate category. These sorts of mistakes are very common!

Because this dataset has been thoroughly examined (very thoroughly!), these types of errors are not present. However, you might want to change the names of columns or remove outliers, which we will cover in the subsequent sections.

• 数据是否正确导入？
• 列名是否正确？
• 数据类型是否合适？（例如，因子与数值）
• 列数和行数是否合适？
• 有错别字吗？
例如，如果一个应该是数字的列显示为一个因素，这可能表示您不小心用文本代替数字的错字（请记住，数据框中的每一列都是一个向量，向量可以只有一种模式，因此同时包含数字和字符的向量被视为全部字符）。类似地，如果您有一个应该有 3 个类别的因子，但使用 4 导入，则您可能有拼写错误（例如，“predator”与“predtaor”），并且拼写错误的版本显示为单独的类别。这类错误很常见！

