2. Compilers
Compiler is a translator program that translates a program written in (HLL) the
source program and translate it into an equivalent program in (MLL) the target
program. As an important part of a compiler is error showing to the
programmer.
Executing a program written n HLL programming language is basically of two
parts. The source program must first be compiled and translated into a object
program. Then the resulting object program is loaded into a memory executed
3. Compilers
A compiler bridges the semantic gap between a programming language
domain and an execution domain and generates a target program.the
target program may be a machine language program or an object
module.
4. TYPES OF COMPILERS
Based on the specific input it takes and the output it produces, the Compilers
can be classified into the following types;
●
Traditional Compilers(C, C++, Pascal): These Compilers convert a source
program in a HLL into its equivalent in native machine code or object code.
●
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert
Source code into intermediate code, and then interprets (emulates) it to its
equivalent machine code.
●
Cross-Compilers: These are the compilers that run on one machine and
produce code for another machine
5. ●
Incremental Compilers: These compilers separate the source into user defined–
steps; Compiling/recompiling step- by- step; interpreting steps in a given order
●
Converters (e.g. COBOL to C++): These Programs will be compiling from one
high level language to another.
●
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime
compilers from intermediate language (byte code, MSIL) to executable code or
native machine code. These perform type –based verification which makes the
executable code more trustworthy
●
Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-
compilers to the native code for Java and .NET
●
Binary Compilation: These compilers will be compiling object code of one
platform into object code of another platform.
6. PHASES OF A COMPILER
Due to the complexity of compilation task, a Compiler typically proceeds
in a Sequence of compilation phases. The phases communicate with
each other via clearly defined interfaces. Generally an interface contains
a Data structure (e.g., tree), Set of exported functions. Each phase works
on an abstract intermediate representation of the source program, not
the source program text itself (except the first phase).
Compiler Phases are the individual modules which are chronologically
executed to perform their respective Sub-activities, and finally integrate
the solutions to give target code.
8. LEXICAL ANALYZER (SCANNER):
The Scanner is the first phase that works as interface between the compiler and
the Source language program and performs the following functions:
●
Reads the characters in the Source program and groups them into a stream of
tokens in which each token specifies a logically cohesive sequence of
characters, such as an identifier , a Keyword , a punctuation mark, a multi
character operator like := .
●
The character sequence forming a token is called a lexeme of the token.
●
The Scanner generates a token-id, and also enters that identifiers name in the
Symbol table if it doesn‘t exist.
●
Also removes the Comments, and unnecessary spaces.
The format of the token is < Token name, Attribute value>
9. SYNTAX ANALYZER (PARSER)
The Parser interacts with the Scanner, and its subsequent
phase Semantic Analyzer and performs the following functions:
●
Groups the above received, and recorded token stream into syntactic
structures, usually into a structure called Parse Tree whose leaves are
tokens.
●
The interior node of this tree represents the stream of tokens that
logically belongs together.
●
It means it checks the syntax of program elements
10. SEMANTIC ANALYZER
This phase receives the syntax tree as input, and checks the semantically
correctness of the program. Though the tokens are valid and syntactically
correct, it may happen that they are not correct semantically. Therefore the
semantic analyzer checks the semantics (meaning) of the statements formed.
●
The Syntactically and Semantically correct structures are produced here in
the form of a Syntax tree or DAG or some other sequential representation
like matrix
11. INTERMEDIATE CODE GENERATOR(ICG)
This phase takes the syntactically and semantically correct structure as
input, and produces its equivalent intermediate notation of the source
program. The Intermediate Code should have two important properties
specified below:
●
It should be easy to produce,and Easy to translate into the target
program. Example intermediate code forms are:
●
Three address codes,
●
Polish notations, etc.
12. CODE OPTIMIZER
This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs
the following specific functions:
●
Attempts to improve the IC so as to have a faster machine code.
Typical functions include –Loop Optimization, Removal of redundant
computations, Strength reduction, Frequency reductions etc.
●
Sometimes the data structures used in representing the intermediate
forms may also be changed.
13. CODE GENERATOR
This is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or
absolute machine code.
●
Memory locations are selected for each variable used, and assignment of
variables to registers is done.
●
Intermediate instructions are translated into a sequence of machine
instructions.
The Compiler also performs the Symbol table management and Error handling
throughout the compilation process. Symbol table is nothing but a data structure
that stores different source language constructs, and tokens generated during
the compilation. These two interact with all phases of the Compiler.
18. OVER VIEW OF LEXICAL ANALYSIS
●
To identify the tokens we need some method of describing the possible
tokens that can appear in the input stream. For this purpose we introduce
regular expression, a notation that can be used to describe essentially all the
tokens of programming language.
●
Secondly , having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers,
which are designed using transition diagrams and finite automata.
19. ROLE OF LEXICAL ANALYZER
The main task of the lexical analyzer is to read the input characters of the
source program, group them into lexemes, and produce as output tokens for
each lexeme in the source program. This stream of tokens is sent to the parser
for syntax analysis. It is common for the lexical analyzer to interact with the
symbol table as well. When the lexical analyzer discovers a lexeme constituting
an identifier, it needs to enter that lexeme into the symbol table
20. Token
Token is a sequence of characters that can be treated as a single logical
entity. The token name is an abstract symbol representing a kind of
lexical unit, e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that
the parser processes.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
21. Pattern
●
A set of strings in the input for which the same token is produced as
output. This set of strings is described by a rule called a pattern
associated with the token.In the case of a keyword as a token, the
pattern is just the sequence of characters that form the Keyword.
●
A pattern is a rule describing the set of lexemes that can represent a
particular token in source program
●
[a-zA-Z][a-zA-Z0-9]* identifier pattern
22. Lexeme
A lexeme is a sequence of characters in the source program that is
matched by the pattern for a token.
A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical analyzer
as an instance of that token.
Example: In the following C language statement ,
printf ("Total = %dn‖, score) ;
both printf and score are lexemes matching the pattern for token id, and
"Total = %dn‖ is a lexeme matching literal [or string]
23. Input Buffering
The lexical analyzer scans the input from left to right one character at a
time. It uses two pointers begin ptr(bp) and forward ptr(fp) to keep track
of the pointer of the input scanned.
input buffering is an important concept in compiler design that refers to
the way in which the compiler reads input from the source code. In many
cases, the compiler reads input one character at a time, which can be a
slow and inefficient process. Input buffering is a technique that allows the
compiler to read input in larger chunks, which can improve performance
and reduce overhead.
24. The basic idea behind input buffering is to read a block of input from the
source code into a buffer, and then process that buffer before reading
the next block. The size of the buffer can vary depending on the specific
needs of the compiler and the characteristics of the source code being
compiled.
25. LEX the Lexical Analyzer generator
Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool
is referred to as the Lex language and the tool itself is the Lex compiler.Lex
compiler transforms the input patterns into a transition diagram and generates
code, in a file called lex .yy .c, it is a c program given for C Compiler, gives the
Object code
26. The declarations section : includes declarations of variables, manifest
constants (identifiers declared to stand for a constant, e.g., the name of a
token), and regular definitions. It appears between %{. . .%}
In the Translation rules section, We place Pattern Action pairs where
each pair have the form
Pattern {Action}
The auxiliary function definitions section includes the definitions of
functions used to install identifiers and numbers in the Symbol tale
27. LEX Program Example:
%{
/* definitions of manifest constants LT,LE,EQ,NE,GT,GE, IF,THEN,
ELSE,ID, NUMBER,
RELOP */
%}
LT LE.... for relational operators for <,><= etc
IF then for conditional statement
ID for identifiers(variable name)
NUMBER for numeric literals
RELOP for relational operators
30. Recognition of tokens:
Recognition of tokens is the process of identifying and classifying the basic
symbols or building blocks of a programming language, such as keywords,
identifiers, literals, and symbols
Token Recognition Techniques_
1. *Regular Expressions*: Used to define patterns for matching tokens, such as
keywords, identifiers, and literals.
2. *Finite State Machines*: Used to recognize tokens by traversing a finite
state machine, where each state corresponds to a specific token.
3. *Lexical Analysis*: Involves scanning the input stream and identifying tokens
based on their syntax and semantics.
31. Token Recognition Process_
1. *Scanning*: The input stream is scanned character by character.
2. *Pattern Matching*: The scanner applies regular expressions or finite
state machines to match the input characters against known token
patterns.
3. *Token Identification*: When a match is found, the scanner identifies
the token and returns its type and value.
4. *Error Handling*: If the input stream contains invalid characters or
token patterns, the scanner reports an error.
32. Token Types_
1. *Keywords*: Reserved words with special meanings, such as "if,"
"while," or "function."
2. *Identifiers*: Names given to variables, functions, or labels.
3. *Literals*: Values that are represented directly in the code, such as
numbers, strings, or booleans.
4. *Symbols*: Special characters used in the programming language,
such as operators (+, -, *, /), separators (,), or punctuation (.).