Skip to content

Commit 9221a87

Browse files
authored
Move spec back into this repo (#766)
* Move spec back into this repo. * Fix images.
1 parent 500274d commit 9221a87

7 files changed

+124
-0
lines changed

README.md

+124
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,130 @@ cpg.method.name("getAccountList").definingTypeDecl.toList.head
142142
// TypeDecl(Some(v[464]),AccountController,io.shiftleft.controller.AccountController,false,List(java.lang.Object))
143143
```
144144

145+
### Base Schema Specification
146+
147+
The base schema provides the minimum requirements all valid CPGs must satisfy. The base specification is concerned with three aspects of the program:
148+
149+
* Program structure
150+
* Type declarations
151+
* Method declarations
152+
153+
where a declaration comprises a formal signature, along with defining content such as a method body for methods or a literal value in the definition of a variable.
154+
155+
Property graphs alone are comparable in generality to hash tables and linked lists. To tailor them towards storing, transmitting, and analyzing code, the main challenge is to specify a suitable graph schema. In particular, a schema must define the valid node and edge types, node and edge keys, together with each of their domains. Finally, a schema puts constraints on the edges that may connect nodes, depending on their type.
156+
157+
158+
The base schema of the CPG is specified in the JSON file `base.json`. The file contains a JSON object with the following members:
159+
160+
* `nodeKeys/edgeKeys`. List of all valid node/edge attributes. Each list element is an object specifying the attribute's ID, name, its type, and a comment.
161+
162+
* `nodeTypes/edgeTypes`. List of all node/edge types (i.e. edge labels in the original property graph definition), where each node/edge type is given by an object that specifies an ID, name,keys, comment, and, for node types (each note is required to have a mandatory note type, represented by a node attribute), valid outgoing edge types.
163+
164+
There are 19 node types across five categories:
165+
166+
| **Category** | **Names** |
167+
| - | - |
168+
| Program structure | FILE, NAMESPACE_BLOCK |
169+
| Type declarations | TYPE_DECL, TYPE_PARAMETER, MEMBER, TYPE, TYPE_ARGUMENT |
170+
| Method header | METHOD, METHOD_PARAMETER_IN, METHOD_RETURN, LOCAL, BLOCK, MODIFIER |
171+
| Method body | LITERAL, IDENTIFIER, CALL, RETURN, METHOD_REF |
172+
| Meta data | META_DATA |
173+
174+
There are eight edge types:
175+
176+
| **Name** | **Usage** |
177+
| - | - |
178+
| AST | Syntax tree edge - structure |
179+
| CFG | Control flow edge - execution order and conditions |
180+
| REF | Reference edge - references to type/method/identifier declarations |
181+
| EVAL_TYPE | Type edge - attach known types to expressions |
182+
| CALL | Method invocation edge - caller/callee relationship |
183+
| VTABLE | Virtual method table edge - represents vtables |
184+
| INHERITS_FROM | Type inheritance edge - models OOP inheritance |
185+
| BINDS_TO | Binding edge - provides type parameters |
186+
187+
There are 17 node keys across three categories:
188+
189+
| **Category** | **Names** |
190+
| - | - |
191+
| Declarations | NAME, FULL_NAME, IS_EXTERNAL |
192+
| Method header | SIGNATURE, MODIFIER_TYPE |
193+
| Method body | PARSER_TYPE_NAME, ORDER, CODE, DISPATCH_TYPE, EVALUATION_STRATEGY,LINE_NUMBER, LINE_NUMBER_END, COLUMN_NUMBER,COLUMN_NUMBER_END, ARGUMENT_INDEX |
194+
| Meta data | LANGUAGE, VERSION |
195+
196+
There are zero edge keys in the base specification.
197+
198+
Notice the deviation from the JSON standard by allowing inline comments. Any line for which the first two non-whitespace characters are equal to `/` are treated as comments, and need to be stripped prior to passing the definitions to standard JSON parsers.
199+
200+
## Program Structure
201+
202+
Node types: FILE, NAMESPACE_BLOCK
203+
204+
Program structure is concerned with the organization of programs into files, namespaces, and packages. A program is composed of zero or more files (type FILE), each of which contains one or more namespace blocks (type NAMESPACE_BLOCK). Namespace blocks contain type and method declarations (type TYPE_DECL and METHOD). Abstract syntax tree (AST) edges must exist from files to namespace blocks. Structural elements below namespace blocks are not connected to their AST parents by an edge. Instead the AST_PARENT_TYPE and AST_PARENT_FULL_NAME properties are used to inform the backend about the
205+
AST relation of methods (type METHOD) and type declarations (type TYPE_DECL) to their parents. The property FULL_NAME thereby must be a unique identifier for the three node types METHOD, TYPE_DECL and NAMESPACE_BLOCK. This figure shows how a
206+
Java class definition is represented in a CPG.
207+
208+
![Program Structure]('/img/program-structure.jpg')
209+
210+
The concept of namespace blocks correspond to the equivalent concepts in the C++ programming language, where namespace blocks are used to place declarations into a namespace. Other languages, e.g., Java or Python, do not provide the same type of namespace blocks. However, they allow package declarations at the start of source files that serve the
211+
purpose of placing all remaining declarations of the source file into a namespace. Package declarations are translated into corresponding namespace blocks for these languages. So the name of a namespace block is the complete namespace of all the elements within the block and the full name of a namespace block is a unique identifier for a specific block. For Java, it is enough to prefix the file name to the namespace because there is only one namespace describing package statement per file.
212+
213+
## Type Declarations
214+
215+
Node types: TYPE_DECL, TYPE_PARAMETER, MEMBER, TYPE, TYPE_ARGUMENT
216+
217+
Language constructs are expressed that declare types via *type declarations*. Examples of these constructs include classes, interfaces, structures, and enumerations. A type declaration consists of a name, an optional list for type parameters, member variables and methods. Inheritance relations with other types may be encoded in *type declarations*.
218+
219+
![Program Structure]('/img/type-declaration.jpg')
220+
221+
In the CPG, each type declaration is represented by a designated type-declaration node (type TYPE_DECL), with at least a
222+
full-name attribute. Member variables (type MEMBER), method declarations (type METHOD), and type parameters (type TYPE_PARAMETER) are connected to the type declaration via AST edges, originating at the type declaration. Inheritance relations are expressed via INHERITS_FROM edges to zero or more other type declarations (type TYPE_DECL), which indicate that the source type declaration inherits from the destination declaration.
223+
224+
Usage of a type, for example in the declaration of a variable, is indicated by a type node (type TYPE). The type node is connected to the corresponding type declaration via a reference edge (type REF), and to type arguments through AST edges (type AST). Type-argument nodes are connected to type parameters by binding edges (type BINDS_TO).
225+
226+
## Method Declarations
227+
228+
A method declaration consists of a method header and a method body, where the declaration is a CPG representation of the method's input and output parameters, and the method body contains the
229+
instructions/statements of the method.
230+
231+
### Method Header
232+
233+
Node types: METHOD, METHOD_PARAMETER_IN, METHOD_RETURN, LOCAL, BLOCK, MODIFIER
234+
235+
The term *method* is used in object-oriented programming languages to refer to a procedure that is associated with a class. The term is used here in a broader sense to refer to any named block of code. This code may or may not be defined to be associated with a type. The method consists of a method header and a method body. The method header is given by a name, a formal return parameter and a list of formal input parameters and corresponding output parameters. The method body is simply a block of statements.
236+
237+
![Method Header]('/img/method-header.jpg')
238+
239+
In the CPG, each method is represented by a designated method node (type METHOD) that contains the method name in particular. Methods are connected to their method input parameters (type METHOD_PARAMETER_IN), return parameter (type METHOD_RETURN), modifiers (type MODIFIER) and locals (type LOCAL) through AST edges. The method node is connected to a block node (type BLOCK), which represents the method body.
240+
241+
### Method Body
242+
243+
Node types: LITERAL, IDENTIFIER, CALL, RETURN, METHOD_REF
244+
245+
Method bodies contain the method implementation, given by the operations the method carries out. Method bodies are represented as Control Flow graphs over method invocations, a representation used to provide a common ground for the instruction, statement, and expression concepts used across machine-level and high-level programming languages. The core elements of the method body representation are thus method invocations ("calls") and control flow edges.
246+
247+
In the CPG, a method invocation is represented by a designated call node (type CALL). Arguments are either identifier (type IDENTIFIER), literals (type LITERAL), other calls (type CALL) or method references (type METHOD_REF). Each argument has an argument index property (type ARGUMENT_INDEX) to indicate with which parameter it is associated. Calls are connected to their arguments through outgoing AST edges, and are associated to the called method via their METHOD_FULL_NAME property.
248+
249+
![Call Site]('/img/call-site.jpg')
250+
251+
In addition to identifiers, literals and calls, method references (type METHOD_REF) are allowed to represent locations in the code where a method is not called, but referenced, as is the case for programming languages where methods are first-class citizens. Method references are connected to method instances by reference edges (type REF).
252+
253+
Return nodes (type RETURN) are created for each location in the method body where control is returned to the caller. Unconditional control flow edges are created from preceding calls to return nodes. All remaining nodes are connected by control flow edges (type CFG) according to execution order and constraints. The method node is treated as the entry node of the Control Flow graph. Finally, a designated block node (type BLOCK) is created for the method body, and creates outgoing AST edges to all expressions that correspond to statements.
254+
255+
### Background on the Method Body Representation
256+
257+
In machine-level languages, procedure bodies are defined by instructions and connected by control flow edges to form a Control Flow graph. Each instruction represents an operation carried out by the machine, which can modify the program state. In contrast, higher-level languages (C and above) typically eliminate the instruction concept in favor of statements. As for instructions, statements can modify the program state. They differ from instructions in that they can consist of multiple expressions. Expressions are anonymous blocks of code that receive input and produce an output value upon evaluation. Inputs to an expression can be literals and identifiers, but they may also be other expressions. In fact, the semantics of statements can be fully encoded via expression trees, and control flow edges attached to the roots of these trees, to represent the statement's control flow semantics.
258+
259+
The ability of a statement to represent several expressions allows for concise program formulation. However, it presents challenges to program analysis. While it is possible to create a Control Flow graph by introducing control flow edges between statements, this graph does not encode the intra-statement control flow. Fortunately, the tree combined
260+
with disambiguation rules of the programming language fully encodes the evaluation order of expressions within a statement. This allows unambiguously representation of their evaluation order with control flow edges.
261+
262+
Expressions consist of method evaluations and applications of operators provided by the language. By expressing operators as methods, and allowing methods to receive the return values of other methods as input, all expressions can be represented as method invocations. A program representation for the method body is thus produced, which consists of method invocations connected by control flow edges.
263+
264+
## Meta Data Block
265+
266+
A metadata block (type META_DATA) is included in CPG with two fields: a language field (key LANGUAGE) to indicate the programming language from which the graph was generated, and a version field (key VERSION) holding the specification version. Both fields are free-text strings.
267+
268+
145269
# References
146270

147271
[1] Rodriguez and Neubauer - The Graph Traversal Pattern:

img/call-site.jpg

18.8 KB
Loading

img/call-site.odg

14.4 KB
Binary file not shown.

img/program-structure.jpg

49.2 KB
Loading

img/program-structure.odg

18.3 KB
Binary file not shown.

img/type-declaration.jpg

42.1 KB
Loading

img/type-declaration.odg

14.5 KB
Binary file not shown.

0 commit comments

Comments
 (0)