Manual |
|
Tokens |
|
10. TOKENS
A program consists of a sequence of tokens which may be delimited by white space. There are two types of tokens:
- Syntax:
- program ::=
- { white_space | token } .
- token ::=
- identifier | literal .
Characters that introduce neither white_space nor a token trigger a parsing error:
*** tst255.sd7(1):5: Illegal character in text "\8;" (U+0008) (* Illegal character *) \b ------------------------^
10.1 White space
There are three types of white space
White space always terminates a preceding identifier, integer, bigInteger or float literal. Some white space is required to separate otherwise adjacent tokens.
- Syntax:
- white_space ::=
-
( space | comment | line_comment )
{ space | comment | line_comment } .
10.1.1 Spaces
There are several types of space characters which are ignored except as they separate tokens:
- blanks, horizontal tabs, carriage returns and new lines.
- Syntax:
- space ::=
- ' ' | TAB | CR | NL .
10.1.2 Comments
Comments are introduced with the characters (* and are terminated with the characters *) . For example:
(* This is a comment *)
Comments can span over multiple lines and comment nesting is allowed:
(* This is a comment that continues
in the next line (* and has a nesting comment inside *) *)
This allows commenting out a larger section of the program, which itself contains comments. Comments cannot occur within string and character literals.
- Syntax:
- comment ::=
- '(*' { any_character } '*)' .
- any_character ::=
-
simple_literal_character | apostrophe | '"' | '\' |
control_character . - control_character ::=
-
NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL |
BS | TAB | LF | VT | FF | CR | SO | SI |
DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB |
CAN | EM | SUB | ESC | FS | GS | RS | US |
DEL .
If a comment is not closed at the end of the main file a parsing error is triggered:
*** tst256.sd7(2):6: Unclosed comment (* Unclosed comment
10.1.3 Line comments
Line comments are introduced with the character # and are
terminated with the end of the line.
For example:
# This is a comment
Comments cannot occur within string, character and numerical literals.
- Syntax:
- line_comment ::=
- '#' { any_character } NL .
10.2 Identifiers
There are three types of identifiers
Identifiers can be written adjacent except that between two name identifiers and between two special identifiers white space must be used to separate them.
- Syntax:
- identifier ::=
- name_identifier | special_identifier | bracket .
10.2.1 Name identifiers
A name identifier is a sequence of letters, digits and underscores ( _ ). The first character must be a letter or an underscore. Examples of name identifiers are:
NUMBER integer const if UPPER_LIMIT LowerLimit x5 _end
Upper and lower case letters are different. Name identifiers may have any length and all characters are significant. The name identifier is terminated with a character which is neither a letter (or _ ) nor a digit. The terminating character is not part of the name identifier.
- Syntax:
- name_identifier ::=
- ( letter | underscore ) { letter | digit | underscore } .
- letter ::=
- upper_case_letter | lower_case_letter .
- upper_case_letter ::=
-
'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' |
'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' |
'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' . - lower_case_letter ::=
-
'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' |
'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' |
'u' | 'v' | 'w' | 'x' | 'y' | 'z' . - digit ::=
- '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' .
- underscore ::=
- '_' .
10.2.2 Special identifiers
A special identifier is a sequence of special characters. Examples of special identifiers are:
+ := <= * -> , &
Here is a list of all special characters:
! $ % & * + , - . / : ; < = > ? @ \ ^ ` | ~
Special identifiers may have any length and all characters are significant. The special identifier is terminated with a character which is not a special character. The terminating character is not part of the special identifier.
- Syntax:
- special_identifier ::=
- special_character { special_character } .
- special_character ::=
-
'!' | '$' | '%' | '&' | '*' | '+' | ',' | '-' | '.' | '/' |
':' | ';' | '<' | '=' | '>' | '?' | '@' | '\' | '^' | '`' |
'|' | '~' .
10.2.3 Brackets
A bracket is one of the following characters:
( ) [ ] { }
Note that a bracket consists of only one character. Except for the character sequence (* (which introduces a comment) a bracket is terminated with the next character.
- Syntax:
- bracket ::=
- '(' | ')' | '[' | ']' | '{' | '}' .
10.3 Literals
There are several types of literals
- Syntax:
10.3.1 Integer literals
An integer literal is a sequence of digits which is taken to be decimal. The sequence of digits may be followed by the letter E or e an optional + sign and a decimal exponent. Based numbers can be specified when the sequence of digits is followed by the # character and a sequence of extended digits. The decimal number in front of the # character specifies the base of the number which follows the # character. As base a number between 2 and 36 is allowed. As extended digits the letters A or a can be used for 10, B or b can be used for 11 and so on to Z or z which can be used as 35.
- Syntax:
- integer_literal ::=
- decimal_integer [ exponent | based_integer ] .
- decimal_integer ::=
- digit { digit } .
- exponent ::=
- ( 'E' | 'e' ) [ '+' ] decimal_integer .
- based_integer ::=
- '#' extended_digit { extended_digit } .
- extended_digit ::=
- letter | digit .
If an integer literal cannot be read a parsing error is triggered:
*** tst256.sd7(2):10: Integer "12345678901234567890" too big const integer: tooBig is 12345678901234567890; ---------------------------------------------^ *** tst256.sd7(3):11: Negative exponent in integer literal const integer: negativeExponent is 1e-1; -------------------------------------^ *** tst256.sd7(4):12: Digit expected found ";" const integer: digitExpected is 1e; ----------------------------------^ *** tst256.sd7(5):13: Integer "1E20" too big const integer: integerWithExponentTooBig is 1e20; ------------------------------------------------^ *** tst256.sd7(6):14: Integer base "37" not between 2 and 36 const integer: baseNotBetween2To36 is 37#0; ----------------------------------------^ *** tst256.sd7(7):15: Extended digit expected found ";" const integer: extendedDigitExpected is 16#; -------------------------------------------^ *** tst256.sd7(8):16: Illegal digit "G" in based integer "16#G" const integer: illegalBasedDigit is 16#G; ----------------------------------------^ *** tst256.sd7(9):17: Based integer "16#ffffffffffffffff" too big const integer: basedIntegerTooBig is 16#ffffffffffffffff; --------------------------------------------------------^
10.3.2 BigInteger literals
A bigInteger literal is a sequence of digits followed by the underline character. The sequence of digits is taken to be decimal. Based numbers can be specified when a sequence of digits is followed by the # character, a sequence of extended digits and the underline character. The decimal number in front of the # character specifies the base of the number which follows the # character. As base a number between 2 and 36 is allowed. As extended digits the letters A or a can be used for 10, B or b can be used for 11 and so on to Z or z which can be used as 35.
- Syntax:
- biginteger_literal ::=
- decimal_integer [ based_integer ] '_' .
10.3.3 Float literals
A float literal consists of two decimal integer literals separated by a decimal point. The basic float literal may be followed by the letter E or e an optional + or - sign and a decimal exponent.
- Syntax:
- float_literal ::=
- decimal_integer '.' decimal_integer [ float_exponent ] .
- float_exponent ::=
- ( 'E' | 'e' ) [ '+' | '-' ] decimal_integer .
10.3.4 String literals
A string literal is a sequence of UTF-8 encoded Unicode characters surrounded by double quotes. For example:
"" " " "\"" "'" "\'" "String" "ch=\" " "\n\n" "Euro: \8364;" "\16#ff;"
In order to represent non-printable characters and certain printable characters the following escape sequences may be used.
audible alert BEL \a backspace BS \b escape ESC \e formfeed FF \f newline NL (LF) \n carriage return CR \r horizontal tab HT \t vertical tab VT \v backslash (\) \\ apostrophe (') \' double quote (") \" control-A \A ... control-Z \Z
Additionally there are the following possibilities:
- Two backslashes with a sequence of blanks, horizontal tabs, carriage returns and new lines between them are completely ignored. The ignored characters are not part of the string. This can be used to continue a string in the following line. Note that in this case the leading spaces in the new line are not part of the string.
- A backslash followed by an integer literal and a semicolon is interpreted as character with the specified ordinal number. Note that the integer literal is interpreted decimal unless it is written as based integer.
Strings are implemented with length field and UTF-32 encoding. Strings are not '\0;' terminated and therefore can also contain binary data.
- Syntax:
- string_literal ::=
- '"' { string_literal_element } '"' .
- string_literal_element ::=
- simple_literal_character | escape_sequence | apostrophe .
- simple_literal_character ::=
-
letter | digit | bracket | special_literal_character |
utf8_encoded_character . - special_literal_character ::=
-
' ' | '!' | '#' | '$' | '%' | '&' | '*' | '+' | ',' | '-' |
'.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '^' |
'_' | '`' | '|' | '~' . - escape_sequence ::=
-
'\a' | '\b' | '\e' | '\f' | '\n' | '\r' | '\t' | '\v' |
'\\' | '\''' | '\"' | '\' upper_case_letter |
'\' { space } '\' | '\' integer_literal ';' . - apostrophe ::=
- ''' .
If a string literal cannot be read a parsing error is triggered:
*** tst256.sd7(2):20: Use \" instead of "" to represent " in a string const string: wrongQuotationRepresentation is "double "" quotations"; -------------------------------------------------------^ *** tst256.sd7(3):21: Illegal string escape "\z" const string: illegalStringEscape is "\z"; ---------------------------------------^ *** tst256.sd7(4):22: Numerical escape sequences should end with ";" not "x" const string: wrongNumericEscape is "\1234xyz"; ------------------------------------------^ *** tst256.sd7(5):23: The numerical escape sequence "\1234678123467892346;" is too big const string: numericEscapeTooBig is "asd\1234678123467892346;dfdfg"; -------------------------------------------------------------^ *** tst256.sd7(6):24: String continuations should end with "\" not "c" const string: backslashExpected is "string \ continuation"; --------------------------------------------------^ *** tst256.sd7(7):25: String literal exceeds source line const string: exceedsSourceLine is "abc ---------------------------------------^ *** tst256.sd7(8):27: Integer literal expected found "1.5" const string: integerExpected is "\1.5;"; --------------------------------------^
10.3.5 Character literals
A character literal is an UTF-8 encoded Unicode character enclosed in apostrophes. For example:
'a' ' ' '\n' '!' '\\' '2' '"' '\"' '\'' '\8;'
To represent control characters and certain other characters in character literals the same escape sequences as for string literals may be used.
- Syntax:
If a char literal cannot be read a parsing error is triggered:
*** tst256.sd7(2):18: "'" expected found ";" const char: apostropheExpected is 'x; ------------------------------------^ *** tst256.sd7(3):19: Character literal exceeds source line const char: charExceeds is ' ----------------------------^
10.4 Unicode characters
Seed7 source code may contain UTF-8 encoded Unicode characters. Unicode is allowed in string and char literals. The pragma names can be used to allow Unicode in name identifiers:
$ names unicode;
Comments and line comments may also contain Unicode, but they are not checked for valid UTF-8. This way code parts with invalid UTF-8 can be commented out. Invalid UTF-8 encodings in identifiers and literals trigger a parsing error:
*** err.sd7(90):58: Overlong UTF-8 encoding used for character "\0;" (U+0000) ignore("\0;"); -----------^ *** err.sd7(91):59: UTF-16 surrogate character found in UTF-8 encoding "\55296;" (U+d800) ignore("\55296;"); ---------------^ *** err.sd7(92):60: Non Unicode character found "\1114112;" (U+110000) "\1114112;"); ----------^ *** err.sd7(93):61: UTF-8 continuation byte expected found "A" ignore("í\128;A"); --------------^ *** err.sd7(94):62: Unexpected UTF-8 continuation byte found "\128;" (U+0080) ignore("\128;"); --------^ *** err.sd7(95):63: Solitary UTF-8 start byte found "\237;" (U+00ed) ignore("íA"); ---------^ *** bom16(1):64: UTF-16 byte order mark found "\65279;" (U+feff) þÿ -^
|
|