Seed7 Manual: Tokens

Manual

Tokens

10. TOKENS

A program consists of a sequence of tokens which may be delimited by white space. There are two types of tokens:

identifiers

literals

Syntax:

program ::= { white_space | token } . token ::= identifier | literal .

Characters that introduce neither white_space nor a token trigger a parsing error:

*** tst255.sd7(1):3: Illegal character in text "\8;" (U+0008)
(* Illegal character *) \b
------------------------^

10.1 White space

There are three types of white space

spaces

comments

line comments

White space always terminates a preceding identifier, integer, bigInteger or float literal. Some white space is required to separate otherwise adjacent tokens.

Syntax:

white_space ::= ( space | comment | line_comment ) { space | comment | line_comment } .

10.1.1 Spaces

There are several types of space characters which are ignored except as they separate tokens:

blanks, horizontal tabs, carriage returns and new lines.

Syntax:

space ::= ' ' | TAB | CR | NL .

10.1.2 Comments

Comments are introduced with the characters (* and are terminated with the characters *) . For example:

(* This is a comment *)

Comments can span over multiple lines and comment nesting is allowed:

(* This is a comment that continues
   in the next line (* and has a nesting comment inside *) *)

This allows commenting out a larger section of the program, which itself contains comments. Comments cannot occur within string and character literals.

Syntax:

comment ::= '(*' { any_character } '*)' . any_character ::= simple_literal_character | apostrophe | '"' | '\' | control_character . control_character ::= NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | TAB | LF | VT | FF | CR | SO | SI | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US | DEL .

If a comment is not closed at the end of the main file a parsing error is triggered:

*** tst256.sd7(2):4: Unclosed comment
(* Unclosed comment

10.1.3 Line comments

Line comments are introduced with the character # and are terminated with the end of the line.
For example:

# This is a comment

Comments cannot occur within string, character and numerical literals.

Syntax:

line_comment ::= '#' { any_character } NL .

10.2 Identifiers

There are three types of identifiers

name identifiers

special identifiers

bracket

Identifiers can be written adjacent except that between two name identifiers and between two special identifiers white space must be used to separate them.

Syntax:

identifier ::= name_identifier | special_identifier | bracket .

10.2.1 Name identifiers

A name identifier is a sequence of letters, digits and underscores ( _ ). The first character must be a letter or an underscore. Examples of name identifiers are:

NUMBER  integer  const  if  UPPER_LIMIT  LowerLimit  x5  _end

Upper and lower case letters are different. Name identifiers may have any length and all characters are significant. The name identifier is terminated with a character which is neither a letter (or _ ) nor a digit. The terminating character is not part of the name identifier.

Syntax:

name_identifier ::= ( letter | underscore ) { letter | digit | underscore } . letter ::= upper_case_letter | lower_case_letter . upper_case_letter ::= 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' . lower_case_letter ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' . digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' . underscore ::= '_' .

10.2.2 Special identifiers

A special identifier is a sequence of special characters. Examples of special identifiers are:

+  :=  <=  *  ->  ,  &

Here is a list of all special characters:

! $ % & * + , - . / : ; < = > ? @ \ ^ ` | ~

Special identifiers may have any length and all characters are significant. The special identifier is terminated with a character which is not a special character. The terminating character is not part of the special identifier.

Syntax:

special_identifier ::= special_character { special_character } . special_character ::= '!' | '$' | '%' | '&' | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '\' | '^' | '`' | '|' | '~' .

10.2.3 Brackets

A bracket is one of the following characters:

( ) [ ] { }

Note that a bracket consists of only one character. Except for the character sequence (* (which introduces a comment) a bracket is terminated with the next character.

Syntax:

bracket ::= '(' | ')' | '[' | ']' | '{' | '}' .

10.3 Literals

There are several types of literals

Syntax:

literal ::= integer_literal | biginteger_literal | float_literal | character_literal | string_literal .

10.3.1 Integer literals

An integer literal is a sequence of digits which is taken to be decimal. The sequence of digits may be followed by the letter E or e an optional + sign and a decimal exponent. Based numbers can be specified when the sequence of digits is followed by the # character and a sequence of extended digits. The decimal number in front of the # character specifies the base of the number which follows the # character. As base a number between 2 and 36 is allowed. As extended digits the letters A or a can be used for 10, B or b can be used for 11 and so on to Z or z which can be used as 35.

Syntax:

integer_literal ::= decimal_integer [ exponent | based_integer ] . decimal_integer ::= digit { digit } . exponent ::= ( 'E' | 'e' ) [ '+' ] decimal_integer . based_integer ::= '#' extended_digit { extended_digit } . extended_digit ::= letter | digit .

If an integer literal cannot be read a parsing error is triggered:

*** tst256.sd7(2):14: Integer "12345678901234567890" too big
const integer: tooBig is 12345678901234567890;
---------------------------------------------^
*** tst256.sd7(3):15: Negative exponent in integer literal
const integer: negativeExponent is 1e-1;
-------------------------------------^
*** tst256.sd7(4):16: Digit expected found ";"
const integer: digitExpected is 1e;
----------------------------------^
*** tst256.sd7(5):17 Integer "1E20" too big
const integer: integerWithExponentTooBig is 1e20;
------------------------------------------------^
*** tst256.sd7(6):18: Integer base "37" not between 2 and 36
const integer: baseNotBetween2To36 is 37#0;
----------------------------------------^
*** tst256.sd7(7):19: Extended digit expected found ";"
const integer: extendedDigitExpected is 16#;
-------------------------------------------^
*** tst256.sd7(8):20: Illegal digit "G" in based integer "16#G"
const integer: illegalBasedDigit is 16#G;
----------------------------------------^
*** tst256.sd7(9):21: Based integer "16#ffffffffffffffff" too big
const integer: basedIntegerTooBig is 16#ffffffffffffffff;
--------------------------------------------------------^

10.3.2 BigInteger literals

A bigInteger literal is a sequence of digits followed by the underline character. The sequence of digits is taken to be decimal. Based numbers can be specified when a sequence of digits is followed by the # character, a sequence of extended digits and the underline character. The decimal number in front of the # character specifies the base of the number which follows the # character. As base a number between 2 and 36 is allowed. As extended digits the letters A or a can be used for 10, B or b can be used for 11 and so on to Z or z which can be used as 35.

Syntax:

biginteger_literal ::= decimal_integer [ based_integer ] '_' .

10.3.3 Float literals

A float literal consists of two decimal integer literals separated by a decimal point. The basic float literal may be followed by the letter E or e an optional + or - sign and a decimal exponent.

Syntax:

float_literal ::= decimal_integer '.' decimal_integer [ float_exponent ] . float_exponent ::= ( 'E' | 'e' ) [ '+' | '-' ] decimal_integer .

10.3.4 String literals

A string literal is a sequence of UTF-8 encoded Unicode characters surrounded by double quotes. For example:

""   " "   "\""   "'"   "\'"   "String"   "ch=\" "   "\n\n"
"Euro: \8364;"   "\16#ff;"

In order to represent non-printable characters and certain printable characters the following escape sequences may be used.

audible alert	BEL	`\a`
backspace	BS	`\b`
escape	ESC	`\e`
formfeed	FF	`\f`
newline	NL (LF)	`\n`
carriage return	CR	`\r`
horizontal tab	HT	`\t`
vertical tab	VT	`\v`
backslash	(\)	`\\`
apostrophe	(')	`\'`
double quote	(")	`\"`
control-A		`\A`
...
control-Z		`\Z`

Additionally there are the following possibilities:

Two backslashes with a sequence of blanks, horizontal tabs, carriage returns, new lines and line comments between them are completely ignored. The ignored characters are not part of the string. This can be used to continue a string in the following line. Note that in this case the leading spaces in the new line are not part of the string. It is an error if a backslash is followed by a sequence of white-space and there is not a second backslash which ends the sequence.
A backslash followed by an integer literal and a semicolon is interpreted as character with the specified ordinal number. Note that the integer literal is interpreted decimal unless it is written as based integer.

Strings are implemented with length field and UTF-32 encoding. Strings are not '\0;' terminated and therefore can also contain binary data.

Syntax:

string_literal ::= '"' { string_literal_element } '"' . string_literal_element ::= simple_literal_character | escape_sequence | apostrophe . simple_literal_character ::= letter | digit | bracket | special_literal_character | utf8_encoded_character . special_literal_character ::= ' ' | '!' | '#' | '$' | '%' | '&' | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '^' | '_' | '`' | '|' | '~' . escape_sequence ::= '\a' | '\b' | '\e' | '\f' | '\n' | '\r' | '\t' | '\v' | '\\' | '\''' | '\"' | '\' upper_case_letter | '\' { space | line_comment } '\' | '\' integer_literal ';' . apostrophe ::= ''' .

If a string literal cannot be read a parsing error is triggered:

*** tst256.sd7(2):24: Use \" instead of "" to represent " in a string
const string: wrongQuotationRepresentation is "double "" quotations";
-------------------------------------------------------^
*** tst256.sd7(3):25: Illegal string escape "\z"
const string: illegalStringEscape is "\z";
---------------------------------------^
*** tst256.sd7(4):26: Numerical escape sequences should end with ";" not "x"
const string: wrongNumericEscape is "\1234xyz";
------------------------------------------^
*** tst256.sd7(5):27: The numerical escape sequence "\1234678123467892346;" is too big
const string: numericEscapeTooBig is "asd\1234678123467892346;dfdfg";
-------------------------------------------------------------^
*** tst256.sd7(6):28: String continuations should end with "\" not "c"
const string: backslashExpected is "string \      continuation";
--------------------------------------------------^
*** tst256.sd7(7):29: String literal exceeds source line
const string: exceedsSourceLine is "abc
---------------------------------------^
*** tst256.sd7(8):31: Integer literal expected found "1.5"
const string: integerExpected is "\1.5;";
--------------------------------------^

10.3.5 Character literals

A character literal is an UTF-8 encoded Unicode character enclosed in apostrophes. For example:

'a'   ' '   '\n'   '!'   '\\'   '2'   '"'   '\"'   '\''   '\8;'

To represent control characters and certain other characters in character literals the same escape sequences as for string literals may be used.

Syntax:

character_literal ::= apostrophe char_literal_element apostrophe . char_literal_element ::= simple_literal_character | escape_sequence | apostrophe | '"' .

If a char literal cannot be read a parsing error is triggered:

*** tst256.sd7(2):22: "'" expected found ";"
const char: apostropheExpected is 'x;
------------------------------------^
*** tst256.sd7(3):23: Character literal exceeds source line
const char: charExceeds is '
----------------------------^

10.4 Unicode characters

Seed7 source code may contain UTF-8 encoded Unicode characters. Unicode is allowed in string and char literals. The pragma names can be used to allow Unicode in name identifiers:

$ names unicode;

Comments and line comments may also contain Unicode, but they are not checked for valid UTF-8. This way code parts with invalid UTF-8 can be commented out. Invalid UTF-8 encodings in identifiers and literals trigger a parsing error:

*** err.sd7(90):61: Overlong UTF-8 encoding used for character "\0;" (U+0000)
ignore("\0;");
-----------^
*** err.sd7(91):62: UTF-16 surrogate character found in UTF-8 encoding "\55296;" (U+d800)
ignore("\55296;");
---------------^
*** err.sd7(92):63: Non Unicode character found "\1114112;" (U+110000)
"\1114112;");
----------^
*** err.sd7(93):64: UTF-8 continuation byte expected found "A"
ignore("í\128;A");
--------------^
*** err.sd7(94):65: Unexpected UTF-8 continuation byte found "\128;" (U+0080)
ignore("\128;");
--------^
*** err.sd7(95):66: Solitary UTF-8 start byte found "\237;" (U+00ed)
ignore("íA");
---------^
*** bom16(1):67: UTF-16 byte order mark found "\65279;" (U+feff)
þÿ
-^