top

Search

C Tutorial

.

UpGrad

C Tutorial

Tokens in C

In C, tokens are the smallest meaningful elements used to create a program. They include keywords, identifiers, constants, string literals, operators, punctuation marks, and special symbols. When a C program is compiled, it is broken down into these tokens, enabling the compiler to analyse and understand the program's structure. 

Tokenization is a crucial step of the compilation process, as it allows the compiler to generate executable code from the provided C program by organising and categorising its individual elements.

How Many Tokens in C

Tokens are fundamental building blocks used in the C language to construct programs. In C, a token is defined as the smallest individual element that holds significance to the compiler's functioning.

  • Keywords: These are reserved words in the C language with predefined meanings. Examples of keywords in c tokens include "int," "float," "if," "for," and "while." There are 32 keywords in C language -

For example, if: The keyword "if" is used to define a conditional statement that executes a block of code if a certain condition is true.

if (x > 0) {
   printf("x is positive");
}
  • Identifiers: These tokens are user-defined names representing variables, functions, or entities. An identifier must follow certain naming rules and should not be the same as any keyword.

Certain rules are commonly used to recognise identifiers - 

1. The first character of an identifier should either be an underscore or an alphabet. It cannot start with a numerical digit.

2. Identifiers in C are case-sensitive, so letters with lowercase and uppercase are considered distinct.

3. The length of identifiers should not exceed 31 characters. However, it is implementation specific.

4. Commas and blank spaces are not allowed within an identifier.

5. Using C keywords as identifiers is not permissible since they have reserved meanings for specific purposes in the language.

Examples of identifiers

  • Constants: Constants represent fixed values that cannot be altered during program execution. They can be numeric constants (e.g., 10, 3.14) or character/string constants (e.g., 'A', "Hello").

Types of Constants 

Examples

Integer constant

20, 41, 94, etc.

Octal constant

011, 033, 077, etc.

Floating-point constant

13.9, 25.7, 87.4, etc.

Character constant

'p', 'q', 'r', etc.

String constant

"c++", ".net", "java", etc.

Hexadecimal constant

0x5x, 0x1A, 0x8z, etc.

  • String literals: These tokens represent sequences of characters enclosed within double quotes. They are commonly used to represent text or messages in a program.

In C, strings are represented as arrays of characters, terminated by a null character '\0'. The null character denotes the end of the string. String literals are always enclosed within double quotes (" ").

When describing a string in C, you can use different syntaxes. For example:

1. Using character array initialization:

char string[10] = {'s', 'c', 'a', 'l', 'e', 'r', '\0'};

Here, string[10] indicates that 10 bytes of memory space are allocated to hold the string value. Each string character is explicitly specified within single quotes, and the null character '\0' marks the end of the string.

2. Using string literal initialization:

char string[10] = "scaler";

The string is directly initialized with the literal "scaler" in this case. The compiler automatically appends the null character '\0' at the end of the string. Again, string[10] indicates that 10 bytes of memory space are allocated.

3. Using dynamic memory allocation:

char string[] = "scaler";

Here, the string is declared without specifying the size. The memory space is allocated dynamically based on the length of the string during program execution. The null character '\0' is automatically included at the end of the string.

  • Operators: Operators are symbols used to perform various operations on data. Tokens in c example include arithmetic operators (+, -, *, /), logical operators (&&, ||), and relational operators (==, >, <).

There are three types of operators - 

  • Unary Operator: Operates on a single operand. Examples include ++ (increment), -- (decrement), ! (logical negation), and sizeof.

  • Binary Operator: Operates on two operands. Examples include arithmetic operators (+, -, *, /), relational operators (==, !=, >, <), logical operators (&&, ||), and assignment operator (=).

  • Ternary Operator: Takes three operands and allows for conditional decision-making. Syntax: condition ? expression1 : expression2. It provides a concise alternative to if-else statements.

  • Special symbols: These tokens encompass special characters like escape sequences (\n, \t) and the backslash () used for specific purposes, such as representing newlines or special characters within strings.

    • () Parentheses: Used for function calling and declaration.
    • [] Square brackets: Represent array subscripts.
    • , Comma: Used to separate statements, function parameters, and variables in a printf statement.
    • {} Curly braces: Used to define code blocks and enclose loops.
    • * Asterisk: Used to represent pointers and as a multiplication operator.
    • # Hash/Preprocessor: Used for preprocessor directives and including header files.
    • . Period: Used to access members of a structure or union.
    • ~ Tilde: Not commonly used in relation to pointers.

Classification of Tokens

In the context of token classification in programming languages like C, tokens can be categorised into primary and secondary tokens. Here's an elaboration on each:

Primary Tokens

These are the fundamental elements of a programming language. They are directly recognised by the lexer or tokenizer, the component responsible for breaking down the source code into tokens. Primary tokens include 

  • Keywords

  • Identifiers

  • Constants

  • Operators. 

Secondary Tokens

Secondary tokens are derived from primary tokens during the tokenization process. They are created by combining or modifying primary tokens to represent additional syntactic elements in a program. Secondary tokens include 

  • Strings,

  • Special Characters,

  • Compound Operators.

Rules for Naming Identifiers

There are specific rules to be followed when naming identifiers -

  • Valid characters: An identifier can include letters (uppercase and lowercase), digits, and underscores.

  • First letter: The first character of an identifier should be a letter or an underscore.

  • Avoid keywords: You cannot use reserved keywords, such as int, while, etc., as identifiers.

  • Length limitation: There is no specific limit on the length of an identifier, but it's recommended to keep it within a reasonable length. Some compilers may have limitations if the identifier exceeds 31 characters.

As long as these rules are followed, any name can be chosen for an identifier; however, it is important to ensure that the chosen name is valid and makes sense. 

Some examples of identifiers include - 

  • age

  • studentName

  • _count

  • total_marks

  • PI

  • MAX_VALUE

  • numberOfStudents

  • myVariable

  • isValid

  • bookTitle

These examples demonstrate valid identifiers that follow the rules mentioned earlier. They consist of a combination of letters (both uppercase and lowercase), digits, and underscores. The first character is either a letter or an underscore, and they do not conflict with reserved keywords. Identifiers are essential for naming variables, functions, and other elements in a C program, providing meaningful names to represent data and logic.

Tokens and Expressions

In the C programming language, an expression is a combination of operands, operators, and function calls that are evaluated to produce a value. It represents a computation or a calculation that yields a result. Expressions can involve variables, constants, arithmetic operations, logical operations, function calls, etc.

An expression can be as simple as a single constant, variable, or complex, involving multiple operators and operands. Expressions can also be used as parts of larger expressions or as function arguments.

Examples of Expressions:

Arithmetic Expression:

int result = 2 + 3 * 4;

In this example, the expression 2 + 3 * 4 is an arithmetic expression that performs addition and multiplication. The result of this expression is stored in the variable ‘result’.

Relational Expression:

int x = 5, y = 7;
int isGreater = x > y;

Here, the expression x > y is a relational expression that compares the values of x and y. The result of this expression is either true (1) or false (0), depending on whether ‘x’ is greater than ‘y’. The result is stored in the variable ‘isGreater’.

Tokenization Process

Lexical Analysis:

Lexical analysis, also known as scanning, is the initial phase of the compiler where the source code is divided into individual tokens or lexemes. It analyses the characters of the source code to form these tokens, which are meaningful units such as keywords, identifiers, constants, operators, and punctuation marks.

Check out this C code example to better understand the tokenizing process - 

#include <stdio.h>

int main() {
    int x = 5;
    printf("The value of x is %d\n", x);
    return 0;
}

During lexical analysis, the source code is divided into tokens:

  • Keywords: include, stdio.h, int, main, return

  • Identifiers: x

  • Punctuation marks: {, }, (, ), ;, =

  • Operators: =

  • Constants: 5

  • Strings: "The value of x is %d\n"

Syntax Analysis:

Syntax analysis, also known as parsing, is the second phase of the compiler. It checks whether the sequence of tokens formed during lexical analysis follows the syntax rules defined by the programming language. It builds a parse tree or syntax tree that represents the hierarchical structure of the program based on the language's grammar rules.

Example:

Continuing from the previous example, during syntax analysis, the compiler verifies if the tokens and their arrangement follow the syntax rules of the C language. It checks for the

  • Correct placement of keywords

  • Proper use of operators and punctuation marks

  • Adherence to language-specific grammar.

If the syntax analysis is successful, the program is considered syntactically correct. Otherwise, syntax errors are reported, indicating that the program violates the language's grammar rules.

Practice Problems on Tokens in C

1. Which of the following is not a valid C Token?

A. Identifier

B. Whitespace

C. Punctuation

D. Keyword

Answer: B. Whitespace

2. Which of these is not a valid identifier?

A. myVariable

B. 123cdd

C. _grade

D. variable_start

Answer: B. 123cdd

3. Find the number of Tokens in the following C statement.

printf("Hello, %s!", Bill);

A. 6

B. 8

C. 9

D. 11

Answer: A. 6

Conclusion

Tokens in C are the smallest elements that make up a program. Understanding and using tokens correctly is essential for writing error-free C programs. They enable compilers to process and analyse codes effectively. Knowledge of tokens empowers programmers to express logic, perform computations, manipulate data, and create efficient software solutions. A solid understanding of tokens is crucial for harnessing the power of the C programming language.

Learners are encouraged to enrol in upGrad’s Master of Science in Machine Learning and AI - Now with Generative AI lectures to better understand in-demand skills like NLP, Machine Learning and Reinforcement Learning by leveraging their programming expertise. With more than 12 industry projects, an immersive learning experience and an AI-powered curriculum, aspirants are just a click away to future-proof their careers!

FAQs

1. What are the six types of Tokens in C?

The six types of Tokens in C programming include Keywords, Identifiers, Operators, Constants, Strings and Special Characters.

2. What is the role of operators in C programming?

In C programming, operators play a key role in manipulating values and regulating the flow of a program, performing a wide range of operations by implementing Arithmetic, Logical and Relational operators. 

3. Can an identifier start with a numerical digit in C?

No, in C, an identifier must start with either an underscore or an alphabet character. Starting with a numerical digit will return your identifier to be invalid according to the C programming’s language rules.

Leave a Reply

Your email address will not be published. Required fields are marked *