# Formal definitions

## Directed graph definitions

The components of a directed graph $$G(V,E)$$, and the attributes of these components relevant for modeling intracellular networks, are defined as follows.

Notation

Definition

$$V=\{v_1,v_2,...,v_N\}$$

a set of $$N$$ nodes $$v_i (i=1,...,N)$$, each assigned to one model element, where each element represents a component of the system being modeled

$$E=\{e_1,e_2,...,e_M\}$$

a set of $$M$$ directed edges $$e_j (j=1,...,M)$$, each assigned to an interaction between elements

$$v_i=v(\mathbf{a}_i^v)$$

each node $$v_i$$ has an attribute vector $$\mathbf{a}^v≡(a^{\mathrm{name}},a^{\mathrm{type}},a^{\mathrm{subtype}},a^{\mathrm{HGNCsymbol}},a^{\mathrm{database}},a^{\mathrm{ID}},a^{\mathrm{compartment}},a^{\mathrm{compartmentID}})$$

$$e_j=e(V_{s_j},v_{t_j},\mathbf{a}_j^e)$$

each edge $$e_j$$ has one or more source nodes forming a set $$V_{s_j}$$, a target node $$v_{t_j}$$, and an attribute vector $$\mathbf{a}^e≡(a^{\mathrm{sign}},a^{\mathrm{connectiontype}},a^{\mathrm{mechanism}},a^{\mathrm{site}},a^{\mathrm{cellline}},$$ $$a^{\mathrm{celltype}},a^{\mathrm{tissuetype}},a^{\mathrm{organism}},a^{\mathrm{score}},a^{\mathrm{source}},a^{\mathrm{statements}},a^{\mathrm{paperIDs}})$$

We note here that the BioRECIPE representation format can also be used for undirected graphs - in that case, the distinction between source and target nodes will not be relevant. An undirected graph can be represented in the BioRECIPE format as a list of interactions (edges).

## Interaction definitions

In the following, we provide formal definitions of the components of an interaction, and the attributes of these components. We also include additional details about attributes and examples of their values.

Definition 1 - Element (node)

An element(node), $$v=v(\mathbf{a}^v)$$, is defined by its name, type, and unique identifier (ID) and these attributes are written as a vector $$\mathbf{a}^v=(a^{\mathrm{name}},a^{\mathrm{type}},a^{\mathrm{database}},a^{\mathrm{ID}})$$. These are required element attributes in the BioRECIPE format.

The attribute $$a^{\mathrm{name}}$$ is an element name, usually following the standard nomenclature used by biologists and in the literature (e.g., acronym ERK1 is used instead of a longer name “extracellular signal-regulated kinase 1”). The attribute $$a^{\mathrm{type}}$$ represents element type, usually genes, RNAs, proteins, chemicals, or biological processes. Biological entity names often have multiple synonyms (e.g., ERK1 may also be referred to as MAPK3), and therefore, unique identifiers (IDs) are used, which are stored in attribute $$a^{\mathrm{ID}}$$. These IDs can be obtained from standard databases such as UniProt, PubChem, or the Gene Ontology Databases (GO). The unique ID attribute is often written as two attributes, the name of the database from which the ID is retrieved, $$a^{\mathrm{database}}$$, and the ID, $$a^{\mathrm{ID}}$$. In addition to the required attributes, we include an optional ID attribute, $$a^{\mathrm{HGNCsymbol}}$$, the gene symbol from the HGNC database, as this is recognized by experts, in contrast to e.g., numbers used by UniProt, and therefore, it can assist in human-driven curation.

The node attribute vector $$a^v$$ may also include other attributes that help describe the element. For example, attributes $$a^{\mathrm{location}}$$ and $$a^{\mathrm{locationID}}$$ hold information about the cellular compartment, where the element is found, and the compartment ID, respectively. We use the GO database to obtain these location IDs. A subtype attribute, $$a^{\mathrm{subtype}}$$, may be used to indicate additional type of an element, such as $$a^{\mathrm{subtype}}$$ = receptor for an element with $$a^{\mathrm{type}}$$ = protein.

An element usually represents a biomolecular species, a chemical, or a biological process.

Definition 2 - Interaction (edge)

A directed signed interaction (also referred to as a directed edge) $$e=e(v_s,v_t,\mathbf{a}^e)$$ is defined with its source element $$v_s$$, target element $$v_t$$, and vector of attributes $$a^e$$. The interaction attribute vector always includes at least the sign $$a^{\mathrm{sign}}$$ and connection type $$a^{\mathrm{connectiontype}}$$ attributes: $$\mathbf{a}^e=(a^{\mathrm{sign}},a^{\mathrm{connectiontype}})$$. The direction of an interaction is implicitly defined with source and target nodes, and therefore, not explicitly listed among its attributes.

The $$a^{\mathrm{sign}}$$ attribute indicates the sign (also referred to as polarity) of the influences, and it can take two values, $$a^{\mathrm{sign}}$$ = positive (e.g., activation) or $$a^{\mathrm{sign}}$$ = negative (e.g., inhibition). Sometimes, only the information about indirect influences on pathways of interest is known, and therefore, the attribute $$a^{\mathrm{connectiontype}}$$ is used to indicate whether the interaction $$e$$ is a direct physical interaction ($$a^{\mathrm{connectiontype}}$$ = direct) or an indirect influence from the source node to the target node ($$a^{\mathrm{connectiontype}}$$ = indirect). Since the interaction definition allows for indirect interactions, it is possible that source and target node are not in the same compartment, and this is the reason we assign the location attribute to nodes and not to the interaction.

The list of other attributes is not necessarily fixed; the components in it may vary, dependent on the goals of the analysis. A more specific information about the biological mechanism and the molecular site of an interaction can be included in the $$a^{\mathrm{mechanism}}$$ and the $$a^{\mathrm{site}}$$ attributes, respectively. We note here that, in some cases, $$a^{\mathrm{sign}}$$ is not explicitly stated in statements about influences that describe mechanisms (e.g., A phosphorylates B). In this case, it would be up to the user to either fill in this information from other sources or accept a default attribute assignment. For example, the default assignment could be positive for phosphorylation, although this may not always be the case, and would require curation.

The edge attribute vector can also include the $$a^{\mathrm{cellline}}$$, $$a^{\mathrm{celltype}}$$, $$a^{\mathrm{tissuetype}}$$, $$a^{\mathrm{organism}}$$ attributes, which hold the context information about the cell line, cell type, tissue type, and organism where the interaction is observed, respectively.

Finally, provenance attributes can be used. The $$a^{\mathrm{score}}$$ attribute provides a summary score for confidence in the interaction, or the amount of available evidence for the interaction. The $$a^{\mathrm{source}}$$ attribute indicates the source of evidence, which can be literature, expert knowledge, databases, or data. The $$a^{\mathrm{statements}}$$ attribute is used to store the statements, parts of sentences or sentences where the interaction is mentioned. The $$a^{\mathrm{paperIDs}}$$ attribute holds paper IDs (e.g., PMCID) where the sentences mentioning the interaction are found.

Whenever the information about the non-essential attributes is not available, these attributes are assigned an “empty” value.

## Model definitions

The BioRECIPE format is most often used to represent models that have a directed graph as their underlying structure, although undirected graphs can be represented as well.

Definition 3 - Model structure (static)

Models that have a directed graph, $$G(V,E)$$, as their underlying structure, include a set of nodes $$V=\{v_1,v_2,...,v_N\}$$, where each node $$v_i=v(\mathbf{a}_i^v) (i=1,...,N)$$ is one model element, and a set of directed edges $$E=\{e_1,e_2,...,e_M\}$$, where an edge $$e_j=e(v_{s_j},v_{t_j},\mathbf{a}_j^e), (v_{s_j},v_{t_j}\in V,j=1,...,M)$$ indicates a directed interaction between elements $$v_{s_j}$$ and $$v_{t_j}$$, in which source node $$v_{s_j}$$ influences target node $$v_{t_j}$$. Vectors $$\mathbf{a}_i^v$$ and $$\mathbf{a}_j^e$$ are formed following the definitions of node and edge attribute vectors.

Definition 4 - Input and output nodes

An input node is a node that is not a target node of any edge in the model, and an output node is a node that is not a source node of any edge in the model. In the graph, input and output nodes are “hanging” from the rest of the model.

Definition 5 - Path

We define a path in a model as $$n>1$$ connected edges: $$p(v_{s_p},v_{t_p},a^{\mathrm{sign}_p})=(e(v_{k_1}=v_{s_p},v_{k_2},\mathbf{a}_{k_1}^e),e(v_{k_2},v_{k_3},\mathbf{a}_{k_2}^e),...,e(v_{k_n},v_{k_{n+1}}=v_{t_p},\mathbf{a}_{k_n}^e))$$. The direction of the path is implicitly defined with the source node $$v_{s_p}$$ and target node $$v_{t_p}$$. The regulation sign $$a^{\mathrm{sign}_p}$$ is considered positive when the number of negative signs in the set $$\{a_{k_1}^{\mathrm{sign}},a_{k_2}^{\mathrm{sign}},...,a_{k_n}^{\mathrm{sign}}\}$$ is even, and negative when this number is odd. Cycles and feedback loops may be defined in cases where the path source is also the path target, i.e., $$p(v_{s_p},v_{s_p},a^{\mathrm{sign}_p})$$.

For example, in the following figure, on the path from source node $$v_6$$ to target node $$v_{13}$$, the number of negative regulations is odd, due to only one negative regulation from node $$v_8$$ to $$v_9$$, and so the sign of this overall path is negative.

Definition 6 - Element-based executable model

An element-based executable model is a triple $$\mathcal{M}(G,\mathcal{X},\mathcal{F})$$, where $$G(V,E)$$ is a static network structure of the model (defined earlier in Definition 3), $$\mathcal{X}=\{x_1,x_2,...,x_N\}$$ is a set of $$N$$ state variables corresponding to nodes in $$V=\{v_1,v_2,...,v_N\}$$, and $$\mathcal{F}=\{f_1,f_2,...,f_N\}$$ is a set of $$N$$ regulatory (update) functions such that each element $$v_i \in V$$ has a corresponding function $$f_i \in \mathcal{F}$$.

Definition 7 - Hybrid element-based executable model

When element update functions $$f_i \in \mathcal{F}$$ have different mathematical form across elements $$v_i \in V$$ within the same model, for example, logical, discrete, or continuous functions, we refer to these models as hybrid element-based executable models.

Definition 8 - Regulator

A source node $$v_j$$ of an edge in graph $$G(V,E)$$ that has $$v_i$$ as a target node is called a regulator of $$v_i$$. In other words, for each element $$v_i$$, any element $$v_j$$ that influences the state of $$v_i$$ such that the function $$f_i$$ is sensitive to the value of $$x_j$$ is called a regulator of $$v_i$$.

Definition 9 - Influence set

For each element $$v_i$$, an influence set, denoted as $$V_i^{\mathrm{influence}} \in V$$, consists of all regulators of $$v_i$$. The state variables that correspond to the elements in $$V_i^{\mathrm{influence}}$$ form set $$\mathcal{X}_i^{\mathrm{influence}}$$

Definition 10 - Positive and negative regulators

Any element $$v_j \in V_i^{\mathrm{influence}}$$, for which the edge $$e(v_j,v_i,\mathbf{a}^e)$$ has a positive sign, $$a_e^{\mathrm{sign}}$$ = positive, belongs to the positive regulator list for element $$v_i$$, denoted as $$v_j \in V_i^{\mathrm{influence},+} \subset V_i^{\mathrm{influence}}$$, represented with attribute $$a^{\mathrm{posreglist}}$$. Any element $$v_j \in V_i^{\mathrm{influence}}$$, for which the edge $$e(v_j,v_i,\mathbf{a}^e)$$ has a negative sign, $$a_e^{\mathrm{sign}}$$ = negative, belongs to the negative regulator list for element $$v_i$$, denoted as $$v_j \in V_i^{\mathrm{influence},-} \subset V_i^{\mathrm{influence}}$$, represented with attribute $$a^{\mathrm{negreglist}}$$.

Definition 11 - Element state variable

For each element $$v_i \in V$$, its state variable $$x_i \in \mathcal{X}$$ can take any value from a set or an interval of values $$X_i$$. The state variable $$x_i$$ is represented with attribute $$a^{\mathrm{variable}}$$, and is assigned either the amount or activity value of $$v_i$$, represented with attribute $$a^{\mathrm{valuetype}}$$.

Definition 12 - Positive and negative regulation rules

The state variables $$x_j$$ that correspond to elements in $$V_i^{\mathrm{influence},+}$$ form set $$X_i^{\mathrm{influence},+} \subset X_i^{\mathrm{influence}}$$, and are used for creating a positive regulation rule for $$v_i$$, represented with attribute $$a^{\mathrm{posregrule}}$$. The state variables $$x_j$$ that correspond to elements in $$V_i^{\mathrm{influence},-}$$ form set $$X_i^{\mathrm{influence},-} \subset X_i^{\mathrm{influence}}$$, and are used for creating a negative regulation rule for $$v_i$$, represented with attribute $$a^{\mathrm{negregrule}}$$.

Definition 13 - Number of levels

When $$X_i$$ is a set of discrete values, $$|X_i|$$ is referred to as the number of levels of $$v_i$$, represented with attribute $$a^{\mathrm{levels}}$$.

Definition 14 - State list

An array of $$k$$ state values $$X_i^{t_0},X_i^{t_1},X_i^{t_2},...,X_i^{t_{k-1}}$$ that are assigned to $$v_i$$ at $$\{t_0,t_1,t_2,...,t_{k-1}\}$$ time steps during simulation, where $$t_0$$ is the initial time step, and $$t_0<t_1<t_2<...<t_{k-1}$$, is called “state list” and is represented with attribute $$a^{\mathrm{statelist}}$$.

Multiple state lists are allowed within the BioRECIPE table, in consecutive columns. Columns header include “State list #” where #=0,1,2,…

Definition 15 - Constant OFF state

When the state variable $$x_i$$ has a constant 0 value throughout the entire simulation, this is referred to as a constant OFF state, and represented with attribute $$a^{\mathrm{a^constOFF}}$$.

Definition 16 - Constant ON state

When the state variable $$x_i$$ has a constant non-0 value (e.g., the highest value from $$X_i$$) throughout the entire simulation, this is referred to as a constant ON state, and represented with attribute $$a^{\mathrm{constON}}$$.

Definition 17 - Next state

The next state of element $$v_i$$, denoted as $$x_i^{*}$$, is computed using the element update rule f_i and current states of all elements in its influence set, that is, current values of all variables in $$X_i^{\mathrm{influence}}$$: $$x_i^{*}=f_i(X_i^{\mathrm{influence}})$$.

## Further notes on element-based models

As stated earlier, functions in $$\mathcal{F}$$ can have different types, discrete or continuous, and moreover, individual elements within the same model could have very different update functions, thus forming hybrid models. The set or interval of possible values, $$X_i$$, assigned to each model element $$x_i$$ can also vary. The function and element types are usually decided based on the knowledge or the information available about the modeled system and its components. In other words, the element-based modeling approach can represent indirect influences between elements, and it can model systems where the knowledge about element interaction mechanisms is incomplete.

Using such hybrid collection of element update rules within a single model enables model simulation and studies of cell dynamics, state transitions, and feedback loops, while utilizing the available information, in the absence of complete knowledge of interaction mechanisms. These hybrid element-based models enable integration of both prior knowledge and data and analysis of hybrid networks (systems involving protein-protein interactions, gene regulations, and/or metabolic pathways).

An example of element-based models are discrete models, where each element state variable $$x_i$$ is assigned a discrete set of values. Following Definition 11, $$x_i$$ can take any value from the set $$X_i:\{0,1,2,…,n_{i-1}\}$$, where $$n_i$$ is the number of different states that element, $$v_i$$ can have. Often, these different states represent different levels of activity or concentration for element $$v_i$$. Element update functions in discrete models can be of different type, some examples are min and max functions, and (rounded) weighted sums.

Boolean models are a subset of discrete models, where elements can have only two values, 0 (also referred to as OFF or False) and 1 (also referred to as ON or True). In Boolean models, value 0 represents states such as “inactive”, “absent”, or “low concentration” and value 1 represents states such as “active”, “present”, or “high concentration”. Element update functions in these models are Boolean functions where logic operators such as AND, OR, and NOT are used. As an extension of Boolean networks, in the Probabilistic Boolean Network (PBN), randomness is introduced by assigning multiple candidate Boolean functions to the variables. At each time step during simulation, one of element’s candidate functions is chosen at random to determine its state.

Other examples of commonly used element-based models are Bayesian Networks and Dynamic Bayesian Networks. Bayesian networks introduce probability distributions into the governing rules of elements, increasing the freedom in updating element states. Similar to Bayesian Networks are structural equation models (SEMs).

Given that the element-based modeling approach can be used for indirect influences and it can abstract away from detailed reaction mechanisms, additional methods have been introduced to account for the timing in biological systems, rates at which elements change, or delays in element updating and delays in pathways.