PHPTools‎ > ‎

Tokenizer

The Tokenizer class assists with the tokenization of PHP source code and the subsequent navigation of those tokens. It provides standardized tokens and does helpful things for you, such as matching braces.

Terminology

Token

The PHP function token_get_all() is the basis for this class.  As such, when a token is returned, it will be an array with a very similar structure to the returned values from token_get_all().  A token is an array with the following keys:

$token = array(
    0 => T_STRING,  // Token constant
    1 => 'doSomething',  // Content of the token
    2 => 27,  // Line number
    4 => 108,  // Matching index, see below
);

Unlike PHP's version, every token returned by this class will be an array.  There are new defined constants as well to help cover the different situations.

Matching Index

The matching index will be set only if the token has another token that completes it.  This applies to braces, brackets, parenthesis, backtick and PHP tags.  For instance, an open brace token (T_CURLY_BRACE) would match with a close brace token (T_TOKENIZER_BRACE_RIGHT).  The open brace's index would be assigned to the close brace's matching index, and vice versa.  This way you can quickly jump to the end of a function or see if there are any parameters in a function call.

If there is an imbalance in the braces, such as with invalid PHP code, the matching index may be set to false. If the token doesn't support matching, the match index won't be set in the token's array.

ArrayAccess and Iterator

The class implements ArrayAccess and Iterator so you can access tokens directly as though the object was an array.  This means $tokenizer[28] and foreach ($tokenizer as $token) will both work.

Public Methods

current()
Returns the index of the current token.  Implemented primarily for the Iterator interface.

findTokens($tokenList, $callback = null)

$tokenList can be an array of token constants or a single token constant.  The Tokenizer object will scan through all of the tokens and will build an array with each token listed.  If $callback is defined, it will call your callback function with the token array as the one and only parameter.  After done iterating over the tokens, this will return the array of token arrays back to the caller.

getFilename()

Returns the name of the file that was tokenized.  Only useful if you use Tokenizer::tokenizeFile().  Will return null if you used Tokenizer::tokenizeString().

getLine($token)

Returns the line number of the passed-in token.  Created so you don't need to remember which index of the token array is the line number.

getMatch($token)

Returns the index of the token that matches this one.  If the token has no match, this returns null.  See "Matching Index" above.

getName($tokenConstant) or getName($token)

Returns the string name of the token constant or the name of the token type in the token's array.  This also will return the string "T_DOUBLE_COLON" instead of "T_PAAMAYIM_NEKUDOTAYIM" because that's the only Hebrew token in a sea of English tokens.  Might as well make the language consistent as well.

$string = $tokenizer->getName(T_STRING);
$string == 'T_STRING';

getNameAt($index)

Gets the token at the specified index and returns the result of getName() on that token.

getNameRelative($offset)

Gets the token at the relative index to the current position and returns the result of getName() on that token.

getNextImportantToken()

Increments the current position and returns the next token that Tokenizer::isImportant() says is important.  Returns the token array if one was found, null otherwise.

getNextToken()

Increments the current position and returns the token array for the next token.

getPreviousImportantToken()

Decrements the current position until no more tokens are available or Tokenizer::isImportant() says the token is important.  Returns the token array if one was found, null otherwise.

getPreviousToken()

Decrements the current position and returns the token array for the previous token.

getTokenAt($index)

Retrieves a token array at a specific instance.  It's the same as using $tokenizer[$index] (via ArrayAccess).

getTokenRelative($offset)

Retrieves a token array at an offset from the current token.

key()

Returns the current index.  Used for the Iterator interface.

static initialize()

Called automatically by loading the class.  Only documented here so that people don't call it themselves.  Ensures that PHP has the tokenizer extension and defines custom token names.

isImportant($token = null)

Returns true if the specified token is not whitespace, a comment, nor a doc comment.  If a token is not specified, it uses the current token.

isValid()

Returns a boolean indicating if we think the code looks valid.  Right now it only checks if the braces match, but the class could be extended to make sure that tokens are not in invalid locations, similar to a lint check by PHP.

next()

Moves the current index to the next token.  Implemented for the Iterator interface.

offsetExists($offset)

Returns true if the specified index exists.  Implemented for the ArrayAccess interface.  Note that this is an index, not a relative offset value.

offsetGet($offset)

Returns the token array at the given index or null if it does not exist.  Implemented for the ArrayAccess interface.  Note that this is an index, not a relative offset value.

offsetSet($offset, $value)

Throws an ErrorException because changing tokens is not allowed.  Implemented for the ArrayAccess interface.

offsetUnset($offset)

Throws an ErrorException because changing tokens is not allowed.  Implemented for the ArrayAccess interface.

rewind()

Resets the internal pointer to the beginning.  Implemented for Iterator interface.

static tokenizeFile($filename)

Reads a given file, tokenizes it, and returns a Tokenizer instance.

static tokenizeString($string)

Tokenizes the given string and returns a Tokenizer instance.

valid()

Returns true if there is a token at the current index.  Implemented for Iterator interface.
Comments