统计代写|网络分析代写Network Analysis代考|Biomolecular interaction network database

The biomolecular interaction network database $(\mathrm{BIND})^{11}[2]$ contains protein interactions annotated with molecular function information extracted from literature. It is based on three main types of data records: interaction, molecular complex, and pathway. An interaction record stores a description of the reaction event between two objects. Molecular complexes are stored through the use of interactions, temporally sorted, producing them. When the reactions generating a complex are unknown, the complex is defined more loosely. A pathway, defined as a network of interactions usually mediating some cellular functions, is described as a series of reactions with information, such as cell cycle and associated phenotypes.

The database permits different modes of search: using identifiers from other biological databases, or by using specific fields, such as literature information, molecule structure, and gene information, including functions. The extracted information can be displayed with a BIND interaction viewer. Networks are rendered as graphs, where nodes, representing molecules, are labeled with some ontological information.

The IntAct $[12]^{12}$ database is a database of interactions that is based completely on open-source software. It contains not only protein interactions data, but also DNA and molecular interaction data. IntAct uses a set of controlled vocabularies and ontologies to provide a semantically consistent annotation method. A researcher can submit an interaction, using the PSI-MI format [13], by sending an e-mail to the database curators.

统计代写|网络分析代写Network Analysis代考|The human connectome project

The human connectome project (HCP) [27] is a big project that aims to provide the community with insight into brains related to connectivity, functions, and variability among individuals. HCP is an effort of more than 5 years based on a data acquisition plan and a subsequent pipeline of analysis held by a consortium of investigators. The HCP focuses a cohort of 1200 subjects (twins and their nontwin siblings) using multiple imaging modalities (i.e., diffusion imaging, functional MRI, weighted MRI, electroencephalography, behavioral and genetic data.

Bringing together multiple resonance imaging modalities from different laboratories has been one of the significant challenges of the HCP. Therefore they developed a template pipeline for acquiring and storing data described in [9]. The pipeline is based on a set of minimal preprocessing pipelines that must be followed by all the participants to accomplish many low-level tasks. This allows the data interchange and, more important, the possibility of an easy comparison among different connectomes, reducing both storage and processing requirements.

Starting from data of the human connectome project, Kerepesi et al. [18] computed structural connectomes of 426 human subjects. For each individual, they used five different resolution scales, yielding (83, 129, 234, 463, and 1015 nodes) and many edge weights. All data are available in the GraphML language for download and authors also provide anatomically relevant annotations. Authors also offer for a subset of subjects the anatomical classification of subgraphs for some region of interest of the brain.

统计代写|网络分析代写Network Analysis代考|人类连接组项目

. The human connectome project

Kerepesi et al.[18]从人类连接组项目的数据出发，计算了426个人类受试者的结构连接组。对于每个个体，他们使用五种不同的分辨率尺度，产生(83、129、234、463和1015个节点)和许多边的权重。所有数据都可以通过GraphML语言下载，作者还提供了解剖学相关的注释。作者还为一部分受试者提供了大脑某些感兴趣区域的子图的解剖分类

统计代写|网络分析代写Network Analysis代考|Genetic interaction network databases

The number of reported genetic interactions are relatively less in comparison to other biological networks. This may due to the involvement of various indirect factors that determine true physical interactions, and hence not possible to elucidate true relationship based on a single source of information. Majority of the interactions are predicted in silico and reported in the databases. Microarray and RNA sequence reads are the most popularly used data sources for predicting such interactions. However, they are sensitive towards the quality, reliability, and availability of the data. Also, the interactions largely depend on the merit of the inference method used. Below, we discuss a few databases dealing with gene-gene relationship networks.

Transcriptional regulatory relationships unraveled by sentence based text mining [10], is a database which consists of human and mouse transcriptional regulatory networks. It comprises of 8,444 and 6,552 TF-target regulatory relationships of 800 human TFs and 828 mouse TFs, respectively.Transcriptional regulatory element database [29], consists of a number of promoters and genes of human, mouse, and rat. This database focuses on GRN’s for each TF-target gene pairs involved in cancer. This database also consists of other features: it contains the genome-wide promoter annotation, gene transcriptional regulation. It also provides an interface, which is user-friendly for extraction of data for all the three species.

Biological general repository [26] for interaction datasets consists of interaction of 70 different organisms, such as the horse, tomato, and castor bean. This repository searches 71,178 for $1,753,686$ protein and genetic interactions, 28,093 chemical associations, and 874,796 posttranslational modifications from major model organism species.

统计代写|网络分析代写Network Analysis代考|Protein-protein network databases

The management of protein-protein interaction (PPI) data presents similar issues as those faced in other domains, i.e., PPI data need to be stored, exchanged, queried, and analyzed. PPI data are the constitutive building blocks for protein interaction networks (PINs). This section discusses main phases and issues of PPI data management [6].

Regarding PPI data storage, main efforts were devoted to the definition of standards for data exchange, such as HUPO PSI-MI, but currently, PPI data are stored as large sets of binary interactions, without taking into account XML-based languages and related XML databases. The storage of PPI data could exploit some already developed storage systems for other graph-based data, such as the triple stores used for storing RDF data or the emerging modeled as graphs, and data manipulation is expressed by graphoriented operations. A graph database proposal for genomics is reported, ${ }^9$ and a project for biochemical pathways is reported in [7].
Moreover, a naming mechanism to identify interactions in a unique way has not been yet been developed, and (binary) interactions are named by naming the interacting proteins.

Also, PPI data querying could benefit from semi-structured or graph databases as summarized below; existing PPI data offer only very simple retrieval mechanisms allowing the retrieval of proteins interacting with a target protein. Current PPI databases surveyed in this paper do not offer sophisticated query mechanisms based on graph manipulation, but, on the other hand, they con-stitute the only available structured repository for interaction data and allow an easy sharing and annotation of such data. Moreover, all the existing databases go beyond the storing of the interaction, but integrates it with functional annotations, sequence information and references to corresponding genes. Finally, they generally provide some visualization tools that presents a subset of interactions in a comprehensive graph.

.

统计代写|网络分析代写Network Analysis代考|No-SQL and graph databases

Relational databases (RDB) were developed in the early 1970 s, and they rapidly became the standard for database management systems. RDBs are currently the best choice for modeling data with relational properties, whereas more recently the production of data with low structures and big dimensions (e.g., biological network data, social network data, and, in general, big data) is growing.

Consequently, the use of traditional RDB systems has some drawbacks, i.e., to obtain complex information from multiple relations, RDB sometimes needs to perform expensive SQL (Structured Query Language) join operation to merge two or more relations at the same time. To mitigate, besides traditional data storage format, other data storage formats have been proposed, often referred to as No-SQL (not only SQL) databases. There exist many different structures of No-SQL databases, such as key-value pairs, document-oriented, time series, and we focus in particular on graph databases [3].
Among the others, we focus here on graph databases (GdB), i.e., a database that uses a graph structure for expressing queries based on nodes, edges and properties for storing attributes related to nodes and edges. The core of a graph database model is the concept of graph used to associate data items stored as nodes using tips representing the relationship among them. Relationships link data together in an easy way, and it results faster data retrieval (i.e., with constant time in many cases).

In a GdB, nodes represent entities, such as proteins, biological molecules, people or patients. Each node may be seen as the translation of a row (or record) of a relational database. Similarly, edges connecting nodes represent relationships among two records, and they can either be directed or undirected. When graphs are directed, the direction of the edge represents, in general, a different meaning. In a GdB, edges constitute the key concept, since they represent an abstraction that is not representable easily in the relational model. Each node may have a set of asnociated properties, i.e., the GdB represents a protein interaction tein, cross-referenced to an external database and other biological information.

统计代写|网络分析代写Network Analysis代考|Pros and cons of using No-SQL databases

The pros and cons of using a graph database instead of relational databases is thus an important research area. Have and Jenses [11] compared the use of Neo4J databases concerning PostgreSQL on the human interaction network imported from the STRING database. The network used in the experiment has 20,140 proteins and $2.2$ million interactions.

Neo4J stores the edges as pointers between two nodes, thus enabling the traversal of nodes in constant time. Properties associated with nodes and edges (such as node name, confidence scores of interactions, source of communications, etc.) are stored together with nodes and edges, since Neo4J uses the property graph model. In such a model, data is organized as nodes, relationships, and properties (data stored on the nodes or relationships). Authors [11] stored the graph in PostgreSQL ${ }^7$ as a table of node pairs. or constant time based on the index used.

The comparison of databases has been made measuring the speed of Cypher and SQL queries for solving three problems:

• finding immediate neighbors and their interactions,
• finding the best scoring path between two proteins,
• finding the shortest path between them.
Authors measured a great speedup of No-SQL over a relational database. Despite this, it does not necessarily imply that the nonrelational databases are the best choice always. They note that when queries are formulated in terms of paths, then graph databases are more concise and clear. Conversely, relational databases are more evident when set operations are needed.
• A plethora of databases is available publicly and privately, storing extensive biological experimental data maintained in various database formats. With the advent of high throughput experimental setup and advanced database technologies, it is now possible to generate, store, and access a high volume of experimental data in various repositories conveniently. Practical data analysis is now possible to elucidate previously unknown biological facts on applying various data analytic and inference tools on the stored data.
• Next, we discuss few popularly used data sources for three biological networks: gene interactions, protein interactions, and brain connectomes.

统计代写|网络分析代写网络分析代考|使用No-SQL数据库的优缺点

.使用No-SQL数据库的优缺点

Neo4J将边缘存储为两个节点之间的指针，从而支持在固定时间内遍历节点。与节点和边相关的属性(如节点名称、交互的置信度分数、通信源等)与节点和边一起存储，因为Neo4J使用属性图模型。在这样的模型中，数据被组织为节点、关系和属性(存储在节点或关系上的数据)。作者[11]将图作为节点对表存储在PostgreSQL ${ }^7$中。或基于所使用的索引的常数时间

• 寻找相邻蛋白质及其相互作用，
• 寻找两个蛋白质之间的最佳评分路径，
• 寻找它们之间的最短路径。作者测量了No-SQL在关系数据库上的极大加速。尽管如此，这并不一定意味着非关系数据库总是最佳选择。他们指出，当查询以路径的形式表述时，图形数据库会更加简洁和清晰。相反，当需要集合操作时，关系数据库更加明显。大量的数据库可以公开和私下使用，存储着以各种数据库格式维护的大量生物实验数据。随着高吞吐量实验设置和先进的数据库技术的出现，现在可以在各种存储库中方便地生成、存储和访问大量的实验数据。通过对存储的数据应用各种数据分析和推理工具，实际的数据分析现在可以阐明以前未知的生物事实。

统计代写|网络分析代写Network Analysis代考|Array representation

The sequential representation of a graph using an array data structure uses a two-dimensional array or matrix called adjacency matrix.

Definition 2.2.1 (Adjacency matrix). Given a graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$, an adjacency matrix, say $A d j$ is a square matrix of size $|\mathcal{V}| \times|\mathcal{V}|$. Each cell of $A d j$ indicates an edge between any two vertices or nodes:
$$A d j[i][j]= \begin{cases}\omega, & \text { if }\left(v_i, v_j\right) \in \mathcal{G}(\mathcal{E}) \ 0, & \text { otherwise }\end{cases}$$
where $\omega$ is the weight of the edge between the nodes $v_i$ and $v_j$. In the case of an unweighted graph, $\omega$ is considered as 1, whereas for weighted graph it may be any value according to the problem in hand. See Fig. 2.10.

Adjacency matrices of undirected graphs are symmetric, where $A d j[i][j]=A d j[j][i]$, for $i, j$. In other words, we may say that $A d j$ and its transpose $A d j^{\prime}$ is the same. Unlike undirected graph, digraph produces asymmetric matrix.
Finding degree of a node
One of the important operations on a graph is finding the degree of a given node. From the adjacency matrix, it is easy to determine the connection of any nodes. The degree of a node in an undirected graph can be calculated as follows:
$$\operatorname{deg}\left(v_i\right)=\sum_{j=1}^n A d j[i][j],$$

where values in the $i^{t h}$ row in the adjacency matrix indicates the connections to $n$ different nodes from the node $i$ in the graph. Similarly, in the case of digraph, the indegree and outdegree of a node can be calculated as follows:
$$\text { indeg }\left(v_i\right)=\sum_{j=1}^n \operatorname{Adj}[j][i] \text { and outdeg }\left(v_i\right)=\sum_{j=1}^n \operatorname{Adj}[i][j] \text {. }$$

统计代写|网络分析代写Network Analysis代考|List representation

Array data structures are easy to access and fast in traversing. However, for a large graph, it is not always feasible to use adjacency matrix representation, due to large memory requirements. It is even more ineffective if a graph contains more nodes with relatively few connections or edges (sparse graph); this leads to the formation of a sparse matrix. To overcome such situation, list representation is an effective alternative for memory representation of large and dense graphs. An advantage of list representation is that it can be used for dynamic graphs, where vertices and edges are growing and shrinking with time. It is commonly implemented in any programming languages as an array of a singly-linked list. The size of the array is the number of vertices in the graph. Each singly linked list keeps track of the neighbors of a vertex. In the case of a weighted graph, the weights of an edge between a pair of vertices are stored in the nodes of singly-linked list itself as a separate entry together with vertex level. It is easy to calculate the degree of a vertex by looking into the number of nodes in the list of the vertex. For example, the degree of the vertex $\mathbf{C}$, which is four (04), can easily be calculated by finding the length of the list headed by $\mathrm{C}$, as given in Fig. $2.11$.

统计代写|网络分析代写Network Analysis代考|Array representation

$$\operatorname{Adj}[i][j]=\left{\omega, \quad \text { if }\left(v_i, v_j\right) \in \mathcal{G}(\mathcal{E}) 0, \quad\right. \text { otherwise }$$

$$\operatorname{deg}\left(v_i\right)=\sum_{j=1}^n \operatorname{Adj}[i][j],$$

$$\operatorname{indeg}\left(v_i\right)=\sum_{j=1}^n \operatorname{Adj}[j][i] \text { and outdeg }\left(v_i\right)=\sum_{j=1}^n \operatorname{Adj}[i][j] \text {. }$$

统计代写|网络分析代写Network Analysis代考|Organization of the book

We organize our book into the following chapters:

• Chapter 2: We introduce mathematical graph and properties. A graph is the basis of the entire graph theoretic modeling and analysis of biological networks. We even discuss the R scripting for handling graph data structures, briefly.
• Chapter 3: Various algorithms popularly studied in graph theory, such as graph traversal algorithms are discussed. In a biological network, power graph analysis is an important graph analysis method that we discuss with examples. Also, various node centrality measures are introduced and demonstrated with the help of $\mathrm{R}$ scripts.
• Chapter 4: Real-world networks follow certain special topological properties, which makes them different from the usual graph. Accordingly, they are classified into various network models. We use different models and their properties, and implement them using the R package.
• Chapter 5: The sources of three biological network repositories, which are publicly available databases, are discussed. The chapter starts with a basic introduction to popular and recently used database formats. It is a resourceful chapter for the biological network-related researches.
• Chapter 6: Gene expression networks have been introduced along with data generation sources for the expression networks. The overall discussion has been divided into two parts, in-silico network inference and post inference analysis. How gene network modules can be identified and how to rank important genes in an expression network has been discussed in the light of various algorithms. We even discuss various online and offline software tools to carry out gene expression network inference and analysis.
• Chapter 7: Protein and their physical interaction networks are vital to establishing true macromolecular connectivity in biological systems. How such interactions can be generated experimentally and predicted computationally has been highlighted. Recently, protein network alignment has gained importance in comparative network analysis for finding evolutionarily conserved proteins, which we include in this chapter. Few of the algorithms dealing with functional protein complex detection is discussed.
• Chapter 8: Finally, we introduce brain connectome networks with the input data sources and present trends in brain connectome network analysis.

统计代写|网络分析代写Network Analysis代考|Basic concepts

A graph [3] is a pictorial representation of a set of objects and their association with each other. The objects are popularly termed as nodes or vertices, and the associations are depicted using interconnections between pair of nodes, called edges. Mathematically, graphs are represented as a set of edges and vertices.
Definition 2.1.1 (Graph). A graph $\mathcal{G}$ is a pair of finite set of vertices and edges, $\mathcal{G}=(\mathcal{V}, \mathcal{E})$, such that $\mathcal{V}=\left{v_1, v_2, \cdots, v_n\right}$ and $\mathcal{E}=$ $\left{e_1, e_2, \cdots, e_m\right}$. An edge $e_k=\left(v_i, v_j\right)$ connects vertices $v_i$ and $v_j$

In the graph (Fig. 2.1), $\mathcal{V}={A, B, C, D, E, F}$ and $\mathcal{E}={(A, B)$, $(B, C),(C, D),(C, E),(E, E),(E, F),(E, D),(F, B)}$, where edges are an unordered pair of nodes having interconnections among them. Graph $\mathcal{G}$ is termed as undirected graph. The node $E$ is connected with itself through loop edge. A graph without my loop structure is called a simple graph.

A graph with an ordered pair of nodes, where edges are associated with directions is called a directed graph or digraph.

Definition 2.1.2 (Directed graph). A directed graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ is a set of vertices $\mathcal{V}$ and edges $\mathcal{E}$, such that, for any edge $\left(v_i, v_j\right)$ posses direction denoted by arrow. Unlike undirected graph, for any edge $v_i \rightarrow v_j$, the edge $\left(v_i, v_j\right) \neq\left(v_j, v_i\right)$. The node $v_i$ is called tail, and $v_j$ is referred to as head of the edge $v_i \rightarrow v_j$. For example, see Fig. 2.2.
Definition 2.1.3 (Path). A path is a sequence of distinct vertices that are connected by edges. In other words, given a set of vertices, $\left{v_1, v_2, \cdots, v_k\right} \in \mathcal{G}(\mathcal{V})$ is a path if for every pair of vertices $v_i$ and $v_{i+1}$ have an edge $\left(v_i, v_{i+1}\right) \in \mathcal{G}(\mathcal{E})$. However, in case of a directed graph, a directed path connects the sequence of vertices with the added restriction that all edges are oriented towards the same direction.

In a path, if sequences of vertices are not distinct, it is referred to as a walk.

Two nodes, $v_i$ and $v_j$, are reachable from each other if there is a path that exists between $v_i$ and $v_j$.

A path is called a closed path or cycle if two terminal nodes, $v_1$ and $v_k$, are connected in a path, i.e., $\left(v_k, v_1\right) \in \mathcal{G}(\mathcal{E})$.

统计代写|网络分析代写Network Analysis代考|Organization of the book

• 第 2 章：我们介绍数学图和属性。图是整个图论建模和分析生物网络的基础。我们甚至简要讨论了用于处理图形数据结构的 R 脚本。
• 第三章：讨论图论中广泛研究的各种算法，如图遍历算法。在生物网络中，功率图分析是一种重要的图分析方法，我们将通过实例进行讨论。此外，还引入并演示了各种节点中心性度量R脚本。
• 第 4 章：现实世界的网络遵循某些特殊的拓扑属性，这使得它们不同于通常的图。因此，它们被分类为各种网络模型。我们使用不同的模型及其属性，并使用 R 包实现它们。
• 第 5 章：讨论了三个生物网络存储库的来源，它们是公开可用的数据库。本章从对流行和最近使用的数据库格式的基本介绍开始。这是生物网络相关研究的丰富篇章。
• 第 6 章：介绍了基因表达网络以及表达网络的数据生成源。整体讨论分为两个部分，in-silico network inference 和 post inference analysis。已经根据各种算法讨论了如何识别基因网络模块以及如何对表达网络中的重要基因进行排序。我们甚至讨论了各种在线和离线软件工具来进行基因表达网络推断和分析。
• 第 7 章：蛋白质及其物理相互作用网络对于在生物系统中建立真正的大分子连接至关重要。已经强调了如何通过实验产生这种相互作用并通过计算进行预测。最近，蛋白质网络比对在比较网络分析中变得越来越重要，以寻找进化上保守的蛋白质，我们将在本章中介绍。很少讨论处理功能性蛋白质复合物检测的算法。
• 第 8 章：最后，我们介绍了具有输入数据源的脑连接组网络，并介绍了脑连接组网络分析的趋势。

统计代写|网络分析代写Network Analysis代考|Basic concepts

$\mathcal{E}=(A, B) \$ \$(B, C),(C, D),(C, E),(E, E),(E, F),(E, D),(F, B)$ ，其中边是一对无序的节点，它们 之间有互连。图形 $\mathcal{G}$ 称为无向图。节点 $E$ 通过环边与自身相连。没有我的循环结构的图称为简单图。

$\left(v_i, v_{i+1}\right) \in \mathcal{G}(\mathcal{E})$. 但是，在有向图的情况下，有向路径将顶点序列连接起来，并增加了所有边都朝向同一方向 的限制。

有限元方法代写

统计代写|网络分析代写Network Analysis代考|Technologies for network data production

The first pillar produces a lot of experimental data to gain insight into the properties of the systems, their properties, and dynamics. For instance, primary PPI data are produced in a wet lab by using different technological platforms. Technologies that enable the determination of protein interactions can be categorized in experiments investigating the presence of physical interactions, and experiments investigating kinetic constants of the reactions. Moreover, based on the number of the interacting partners revealed in a single assay, we can distinguish in technologies that characterize binary relations, such as yeast two-hybrid, and technologies elucidating multiple relations, such as mass spectrometry.

The experiments based on these technologies share a general schema, in which a so-called bait protein is used as a test to demonstrate its relations with one or more proteins preys. Both single interactions and exhaustive screenings have been realized following this schema. However, an interesting aspect is the reliability of discovered interactions. In particular, each assay can be evaluated on the basis of some ad hoc defined quality measurement.

Considering the human brain connectome of neural cells, the main technologies for data productions are brain imaging techniques, such as magnetic resonance imaging (MRI). Once images have been captured, a set of post-processing techniques are applied to analyze their content and derive brain graphs representing both static and dynamical aspects of the brain.

统计代写|网络分析代写Network Analysis代考|Network analysis models

The second pillar has introduced novel tools to build models starting from raw data, and to analyze such models to understand complex systems. Consequently, from a computational science point of view, the need for the introduction of methods and tools for data storage, representation, exchange, and analysis has led the research in such area.

Independent of any network specific application, the flow of data and analysis in this area of research follows standard structure. The process starts with the accumulation of a significant amount of data using high-throughput technologies, such as microarray or next generation sequencing in molecular biology or nuclear magnetic resonance in brain research. Data are then analyzed to build networks, starting from experimental data using network identification methods that result in the building of static or dynamics networks. Networks are mined to elucidate the organization of the biological elements on a system-level scale. Consequently, scientists try to investigate both the global and local organizational principles aiming to discover the difference between subjects or among the healthy and diseased state. After obtaining the networks, the need for the analysis and the comparison of networks of different subjects has led to the development of novel comparison algorithms based on graph and subgraph isomorphism [9].

In case of molecular interactions, after the wet-lab experiments, data are usually collected and preserved in databases [6]. Currently, there exist many publicly available databases that offer the user the possibility to retrieve data easily. Querying interfaces enables both the retrieval of simple information and a particular subnetwork (see Chapter 5 for a more detailed discussion). Many databases can be searched by inserting one or more protein identifiers. The output of such a query is a list of related proteins. Some recent databases offer a semantically more expressive language than simple interaction retrieval, whereas recent research directions are based on the use of a high-level language (e.g., using graph formalism), in suitable graph structures, and search for those by applying appropriate algorithms. Main challenges in this area are (i) expressiveness of the query language that should be able to capture biologically meaningful queries, (ii) efficiency and coverage of the retrieval method, and (iii) simplicity to capture and use results.

有限元方法代写

