-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Codebase Compiler Scanner
One of the smallest parts of the compiler's critical path to AST then type-checking. It exists to create a stream of syntax tokens for use by another object. The scanner gets most of its usage via a parser instance.
You create a Scanner instance via createScanner
, there are two main modes of scanning: Standard and JSX.
Then there are specific functions for scanning inside JSDoc comments.
First you set the text of a scanner to be your JS source code via setText
, this takes the source code and
an optional start and end range in which you want to scan. This range is an important part of what keeps
TypeScript's server mode fast via incremental parsing.
At a high level, the scanner works by a having another object call scanner.scan
this:
-
Pulls out the character at the current position (
pos
) in the text -
Runs a very big switch statement against known characters which start syntax
-
Then move look forwards till the end of that list of chars to decode if it's what we think it is. here's an example of what happens when the character found is a
!
:case CharacterCodes.exclamation: // Look to see if it's "!=" if (text.charCodeAt(pos + 1) === CharacterCodes.equals) { // Also check if it's "!==" if (text.charCodeAt(pos + 2) === CharacterCodes.equals) { // for !== move the position forwards to the end of the symbol // then set the token return pos += 3, token = SyntaxKind.ExclamationEqualsEqualsToken; } // Move it two, set the token return pos += 2, token = SyntaxKind.ExclamationEqualsToken; } // Move forwards one character and set the token pos++; return token = SyntaxKind.ExclamationToken;
-
A
SyntaxKind
is returned fromscan
and it's up to the scanner owner to do work with those tokens. The scanner keeps track of a few useful values for that:-
getStartPos
- where the token was started including preceding whitespace -
getTextPos
- the end position of the current token -
getTokenText
- the text between the token start and end -
getTokenValue
- some syntax contains a value which can be represented as a string, a good example is literally a string. The"
or'
are not included in the value.
-
Tokens themselves have what we call a "full start" and a "token start". The "token start" is the more natural version, which is the position in the file where the text of a token begins. The "full start" is the point at which the scanner began scanning since the last significant token. When concerned with trivia, we are often more concerned with the full start.
Function | Description |
---|---|
ts.Node.getStart |
Gets the position in text where the first token of a node started. |
ts.Node.getFullStart |
Gets the position of the "full start" of the first token owned by the node. |
When creating a scanner you get to choose whether whitespace should be returned in the stream of tokens. This is nearly always off, but it is used inside the formatter and for syntax highlighting via the TSServer via a classifier.
Syntax trivia represent the parts of the source text that are largely insignificant for normal understanding of the code, such as whitespace, comments, and even conflict markers.
Because trivia are not part of the normal language syntax (barring ECMAScript ASI rules) and can appear anywhere between any two tokens, they are not included in the syntax tree. Yet, because they are important when implementing a feature like refactoring and to maintain full fidelity with the source text, they are still accessible through our APIs on demand.
Because the EndOfFileToken
can have nothing following it (neither token nor trivia), all trivia naturally precedes some non-trivia token, and resides between that token's "full start" and the "token start"
It is a convenient notion to state that a comment "belongs" to a Node
in a more natural manner though. For instance, it might be visually clear that the genie
function declaration owns the last two comments in the following example:
var x = 10; // This is x.
/**
* Postcondition: Grants all three wishes.
*/
function genie([wish1, wish2, wish3]: [Wish, Wish, Wish]) {
while (true) {
}
} // End function
This is despite the fact that the function declaration's full start occurs directly after var x = 10;
.
We follow Roslyn's notion of trivia ownership for comment ownership. In general, a token owns any trivia after it on the same line up to the next token. Any comment after that line is associated with the following token. The first token in the source file gets all the initial trivia, and the last sequence of trivia in the file is tacked onto the end-of-file token, which otherwise has zero width.
For most basic uses, comments are the "interesting" trivia. The comments that belong to a Node which can be fetched through the following functions:
Function | Description |
---|---|
ts.getLeadingCommentRanges |
Given the source text and position within that text, returns ranges of comments between the first line break following the given position and the token itself (probably most useful with ts.Node.getFullStart ). |
ts.getTrailingCommentRanges |
Given the source text and position within that text, returns ranges of comments until the first line break following the given position (probably most useful with ts.Node.getEnd ). |
As an example, imagine this portion of a source file:
debugger;/*hello*/
//bye
/*hi*/ function
The full start for the function
keyword begins at the /*hello*/
comment, but getLeadingCommentRanges
will only return the last 2 comments:
d e b u g g e r ; / * h e l l o * / _ _ _ _ _ [CR] [NL] _ _ _ _ / / b y e [CR] [NL] _ _ / * h i * / _ _ _ _ f u n c t i o n
↑ ↑ ↑ ↑ ↑
full start look for first comment second comment token start
leading comments
starting here
Appropriately, calling getTrailingCommentRanges
on the end of the debugger statement will extract the /*hello*/
comment.
In the event that you are concerned with richer information of the token stream, createScanner
also has a skipTrivia
flag which you can set to false
, and use setText
/setTextPos
to scan at different points in a file.
Some of the more complicated aspects of JSX support is mostly handled back in the parser, however JSX support in the scanner uses specific syntax tokens.
One way for the scanner to keep track of scan issues, or internal state is via TokenFlags
. Any example of
this is in scanning a number. TypeScript supports underscores in numbers 100_000
, when scanning a number literal
if it detects a CharacterCodes._
then the flag TokenFlags.ContainsSeparator
is set and later on that is used
to ensure the tokenValue
is set correctly.
Because the scanner is only interested in passing out tokens as it sees them, it doesn't really have a memory of previous tokens. This means that occasionally the controlling object will need to rewind and re-run the scanner with a different type of context. This is called rescanning.
Here's a scanner playground - adding TypeScript will show you the tokens generated by a single scanner. It's worth noting that this doesn't represent that actual results of the scanner when using TypeScript, because the parser controls re-scanning and this playground doesn't do that.
News
Debugging TypeScript
- Performance
- Performance-Tracing
- Debugging-Language-Service-in-VS-Code
- Getting-logs-from-TS-Server-in-VS-Code
- JavaScript-Language-Service-in-Visual-Studio
- Providing-Visual-Studio-Repro-Steps
Contributing to TypeScript
- Contributing to TypeScript
- TypeScript Design Goals
- Coding Guidelines
- Useful Links for TypeScript Issue Management
- Writing Good Design Proposals
- Compiler Repo Notes
- Deployment
Building Tools for TypeScript
- Architectural Overview
- Using the Compiler API
- Using the Language Service API
- Standalone Server (tsserver)
- TypeScript MSBuild In Depth
- Debugging Language Service in VS Code
- Writing a Language Service Plugin
- Docker Quickstart
FAQs
The Main Repo