Lexer ( JTidy r938 )

java.lang.Object
- org.w3c.tidy.Lexer

```
public class Lexer
extends java.lang.Object
```
Lexer for html parser.
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections

Version:

$Revision: 927 $ ($Author: aditsu $)

Author:

Dave Raggett dsr@w3.org , Andy Quick ac.quick@sympatico.ca (translation to Java), Fabrizio Giustina

Field Summary

Fields
Modifier and Type	Field and Description
`protected short`	`badAccess` for accessibility errors.
`protected short`	`badChars` for bad char encodings.
`protected boolean`	`badDoctype` set if html or PUBLIC is missing.
`protected short`	`badForm` for mismatched/mispositioned form tags.
`protected short`	`badLayout` for bad style errors.
`protected int`	`columns` at start of current token.
`protected Configuration`	`configuration` configuration.
`protected int`	`doctype` version as given by doctype (if any).
`protected short`	`errors` count of errors.
`protected java.io.PrintWriter`	`errout` error output stream.
`protected boolean`	`excludeBlocks` Netscape compatibility.
`protected boolean`	`exiled` true if moved out of table.
`static short`	`IGNORE_MARKUP` state: ignore markup.
`static short`	`IGNORE_WHITESPACE` state: ignore whitespace.
`protected StreamIn`	`in` file stream.
`protected Node`	`inode` Inline stack for compatibility with Mosaic.
`protected int`	`insert` for inferring inline tags.
`protected boolean`	`insertspace` when space is moved after end tag.
`protected java.util.Stack`	`istack` stack.
`protected int`	`istackbase` start of frame.
`protected boolean`	`isvoyager` true if xmlns attribute on html element.
`protected byte[]`	`lexbuf` Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements.
`protected int`	`lexlength` allocated.
`protected int`	`lexsize` used.
`protected int`	`lines` lines seen.
`static short`	`MIXED_CONTENT` state: mixed content.
`static short`	`PREFORMATTED` state: preformatted.
`protected boolean`	`pushed` true after token has been pushed back.
`protected Report`	`report` report.
`protected Node`	`root` Root node is saved here.
`protected boolean`	`seenEndBody` already seen end body tag?
`protected boolean`	`seenEndHtml` already seen end html tag?
`protected short`	`state` state of lexer's finite state machine.
`protected Style`	`styles` used for cleaning up presentation markup.
`protected Node`	`token` current node.
`protected int`	`txtend` end of current node.
`protected int`	`txtstart` start of current node.
`protected short`	`versions` bit vector of HTML versions.
`protected short`	`warnings` count of warnings in this document.
`protected boolean`	`waswhite` used to collapse contiguous white space.

Constructor Summary

Constructors
Constructor and Description

Lexer(StreamIn in, Configuration configuration, Report report)
Instantiates a new Lexer.

Constructors
Constructor and Description
`Lexer(StreamIn in, Configuration configuration, Report report)` Instantiates a new Lexer.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`addByte(int c)` Adds a byte to lexer buffer.
`void`	`addCharToLexer(int c)` Store char c as UTF-8 encoded byte stream.
`boolean`	`addGenerator(Node root)` Add meta element for Tidy.
`void`	`addStringLiteral(java.lang.String str)` calls addCharToLexer for any char in the string.
`void`	`addStringToLexer(java.lang.String str)` Adds a string to lexer buffer.
`short`	`apparentVersion()` Return the html version used in document.
`boolean`	`canPrune(Node element)` Can the given element be removed?
`void`	`changeChar(byte c)` Substitute the last char in buffer.
`boolean`	`checkDocTypeKeyWords(Node doctype)` Check system keywords (keywords should be uppercase).
`AttVal`	`cloneAttributes(AttVal attrs)` Clones an attribute value and add eventual asp or php node to node list.
`Node`	`cloneNode(Node node)` Clones a node and add it to node list.
`void`	`deferDup()` Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
`boolean`	`endOfInput()` Has end of input stream been reached?
`short`	`findGivenVersion(Node doctype)` Examine DOCTYPE to identify version.
`Node`	`findLastLI(Node list)`
`boolean`	`fixDocType(Node root)` Fixup doctype if missing.
`void`	`fixHTMLNameSpace(Node root, java.lang.String profile)` Fix xhtml namespace.
`void`	`fixId(Node node)` duplicate name attribute as an id and check if id and name match.
`boolean`	`fixXmlDecl(Node root)` Ensure XML document starts with `<?XML version="1.0"?>`.
`Node`	`getCDATA(Node container)` Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
`Node`	`getToken(short mode)` Gets a token.
`short`	`htmlVersion()` Choose what version to use for new doctype.
`java.lang.String`	`htmlVersionName()` Choose what version to use for new doctype.
`Node`	`inferredTag(java.lang.String name)` Generates and inserts a new node.
`int`	`inlineDup(Node node)` This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc.
`boolean`	`inlineDup1(Node node, Node element)`
`Node`	`insertedToken()`
`static boolean`	`isCSS1Selector(java.lang.String buf)` In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item).
`boolean`	`isPushed(Node node)` Is the node in the stack?
`boolean`	`isPushedLast(Node element, Node node)`
`static boolean`	`isValidAttrName(java.lang.String attr)` Check if attr is a valid name.
`Node`	`newLineNode()` Adds a new line node.
`Node`	`newNode()` Creates a new node and add it to nodelist.
`Node`	`newNode(short type, byte[] textarray, int start, int end)` Creates a new node and add it to nodelist.
`Node`	`newNode(short type, byte[] textarray, int start, int end, java.lang.String element)` Creates a new node and add it to nodelist.
`Node`	`parseAsp()` parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value.
`java.lang.String`	`parseAttribute(boolean[] isempty, Node[] asp, Node[] php)` consumes the '>' terminating start tags.
`AttVal`	`parseAttrs(boolean[] isempty)` Parse tag attributes.
`void`	`parseEntity(short mode)` Parse an html entity.
`Node`	`parsePhp()` PHP is like ASP but is based upon XML processing instructions, e.g.
`int`	`parseServerInstruction()` Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
`char`	`parseTagName()` Parses a tag name.
`java.lang.String`	`parseValue(java.lang.String name, boolean foldCase, boolean[] isempty, int[] pdelim)` Parse an attribute value.
`void`	`popInline(Node node)` Pop a copy of an inline node from the stack.
`protected boolean`	`preContent(Node node)` Is content acceptable for pre elements?
`void`	`pushInline(Node node)` Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
`boolean`	`setXHTMLDocType(Node root)` Adds a new xhtml doctype to the document.
`boolean`	`switchInline(Node element, Node node)`
`void`	`ungetToken()`
`protected void`	`updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)` Update `oldtextarray` in the current nodes.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - IGNORE_WHITESPACE
```
public static final short IGNORE_WHITESPACE
```
    state: ignore whitespace.
    
    See Also:
    Constant Field Values
  - MIXED_CONTENT
```
public static final short MIXED_CONTENT
```
    state: mixed content.
    
    See Also:
    Constant Field Values
  - PREFORMATTED
```
public static final short PREFORMATTED
```
    state: preformatted.
    
    See Also:
    Constant Field Values
  - IGNORE_MARKUP
```
public static final short IGNORE_MARKUP
```
    state: ignore markup.
    
    See Also:
    Constant Field Values
  - in
```
protected StreamIn in
```
    file stream.
  - errout
```
protected java.io.PrintWriter errout
```
    error output stream.
  - badAccess
```
protected short badAccess
```
    for accessibility errors.
  - badLayout
```
protected short badLayout
```
    for bad style errors.
  - badChars
```
protected short badChars
```
    for bad char encodings.
  - badForm
```
protected short badForm
```
    for mismatched/mispositioned form tags.
  - warnings
```
protected short warnings
```
    count of warnings in this document.
  - errors
```
protected short errors
```
    count of errors.
  - lines
```
protected int lines
```
    lines seen.
  - columns
```
protected int columns
```
    at start of current token.
  - waswhite
```
protected boolean waswhite
```
    used to collapse contiguous white space.
  - pushed
```
protected boolean pushed
```
    true after token has been pushed back.
  - insertspace
```
protected boolean insertspace
```
    when space is moved after end tag.
  - excludeBlocks
```
protected boolean excludeBlocks
```
    Netscape compatibility.
  - exiled
```
protected boolean exiled
```
    true if moved out of table.
  - isvoyager
```
protected boolean isvoyager
```
    true if xmlns attribute on html element.
  - versions
```
protected short versions
```
    bit vector of HTML versions.
  - doctype
```
protected int doctype
```
    version as given by doctype (if any).
  - badDoctype
```
protected boolean badDoctype
```
    set if html or PUBLIC is missing.
  - txtstart
```
protected int txtstart
```
    start of current node.
  - txtend
```
protected int txtend
```
    end of current node.
  - state
```
protected short state
```
    state of lexer's finite state machine.
  - token
```
protected Node token
```
    current node.
  - lexbuf
```
protected byte[] lexbuf
```
    Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars.
  - lexlength
```
protected int lexlength
```
    allocated.
  - lexsize
```
protected int lexsize
```
    used.
  - inode
```
protected Node inode
```
    Inline stack for compatibility with Mosaic. For deferring text node.
  - insert
```
protected int insert
```
    for inferring inline tags.
  - istack
```
protected java.util.Stack istack
```
    stack.
  - istackbase
```
protected int istackbase
```
    start of frame.
  - styles
```
protected Style styles
```
    used for cleaning up presentation markup.
  - configuration
```
protected Configuration configuration
```
    configuration.
  - seenEndBody
```
protected boolean seenEndBody
```
    already seen end body tag?
  - seenEndHtml
```
protected boolean seenEndHtml
```
    already seen end html tag?
  - report
```
protected Report report
```
    report.
  - root
```
protected Node root
```
    Root node is saved here.
- Constructor Detail
  - Lexer
```
public Lexer(StreamIn in,
     Configuration configuration,
     Report report)
```
    Instantiates a new Lexer.
    
    Parameters:
    in - StreamIn
    configuration - configuation instance
    report - report instance, for reporting errors
- Method Detail
  - newNode
```
public Node newNode()
```
    Creates a new node and add it to nodelist.
    
    Returns:
    Node
  - newNode
```
public Node newNode(short type,
           byte[] textarray,
           int start,
           int end)
```
    Creates a new node and add it to nodelist.
    
    Parameters:
    type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
    textarray - array of bytes contained in the Node
    start - start position
    end - end position
    
    Returns:
    Node
  - newNode
```
public Node newNode(short type,
           byte[] textarray,
           int start,
           int end,
           java.lang.String element)
```
    Creates a new node and add it to nodelist.
    
    Parameters:
    type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
    textarray - array of bytes contained in the Node
    start - start position
    end - end position
    element - tag name
    
    Returns:
    Node
  - cloneNode
```
public Node cloneNode(Node node)
```
    Clones a node and add it to node list.
    
    Parameters:
    node - Node
    
    Returns:
    cloned Node
  - cloneAttributes
```
public AttVal cloneAttributes(AttVal attrs)
```
    Clones an attribute value and add eventual asp or php node to node list.
    
    Parameters:
    attrs - original AttVal
    
    Returns:
    cloned AttVal
  - updateNodeTextArrays
```
protected void updateNodeTextArrays(byte[] oldtextarray,
                        byte[] newtextarray)
```
    Update oldtextarray in the current nodes.
    
    Parameters:
    oldtextarray - previous text array
    newtextarray - new text array
  - newLineNode
```
public Node newLineNode()
```
    Adds a new line node. Used for creating preformatted text from Word2000.
    
    Returns:
    new line node
  - endOfInput
```
public boolean endOfInput()
```
    Has end of input stream been reached?
    
    Returns:
    true if end of input stream been reached
  - addByte
```
public void addByte(int c)
```
    Adds a byte to lexer buffer.
    
    Parameters:
    c - byte to add
  - changeChar
```
public void changeChar(byte c)
```
    Substitute the last char in buffer.
    
    Parameters:
    c - new char
  - addCharToLexer
```
public void addCharToLexer(int c)
```
    Store char c as UTF-8 encoded byte stream.
    
    Parameters:
    c - char to store
  - addStringToLexer
```
public void addStringToLexer(java.lang.String str)
```
    Adds a string to lexer buffer.
    
    Parameters:
    str - String to add
  - parseEntity
```
public void parseEntity(short mode)
```
    Parse an html entity.
    
    Parameters:
    mode - mode
  - parseTagName
```
public char parseTagName()
```
    Parses a tag name.
    
    Returns:
    first char after the tag name
  - addStringLiteral
```
public void addStringLiteral(java.lang.String str)
```
    calls addCharToLexer for any char in the string.
    
    Parameters:
    str - input String
  - htmlVersion
```
public short htmlVersion()
```
    Choose what version to use for new doctype.
    
    Returns:
    html version constant
  - htmlVersionName
```
public java.lang.String htmlVersionName()
```
    Choose what version to use for new doctype.
    
    Returns:
    html version name
  - addGenerator
```
public boolean addGenerator(Node root)
```
    Add meta element for Tidy. If the meta tag is already present, update release date.
    
    Parameters:
    root - root node
    
    Returns:
    true if the tag has been added
  - checkDocTypeKeyWords
```
public boolean checkDocTypeKeyWords(Node doctype)
```
    Check system keywords (keywords should be uppercase).
    
    Parameters:
    doctype - doctype node
    
    Returns:
    true if doctype keywords are all uppercase
  - findGivenVersion
```
public short findGivenVersion(Node doctype)
```
    Examine DOCTYPE to identify version.
    
    Parameters:
    doctype - doctype node
    
    Returns:
    version code
  - fixHTMLNameSpace
```
public void fixHTMLNameSpace(Node root,
                    java.lang.String profile)
```
    Fix xhtml namespace.
    
    Parameters:
    root - root Node
    profile - current profile
  - setXHTMLDocType
```
public boolean setXHTMLDocType(Node root)
```
    Adds a new xhtml doctype to the document.
    
    Parameters:
    root - root node
    
    Returns:
    true if a doctype has been added
  - apparentVersion
```
public short apparentVersion()
```
    Return the html version used in document.
    
    Returns:
    version code
  - fixDocType
```
public boolean fixDocType(Node root)
```
    Fixup doctype if missing.
    
    Parameters:
    root - root node
    
    Returns:
    false if current version has not been identified
  - fixXmlDecl
```
public boolean fixXmlDecl(Node root)
```
    Ensure XML document starts with <?XML version="1.0"?>. Add encoding attribute if not using ASCII or UTF-8 output.
    
    Parameters:
    root - root node
    
    Returns:
    always true
  - inferredTag
```
public Node inferredTag(java.lang.String name)
```
    Generates and inserts a new node.
    
    Parameters:
    name - tag name
    
    Returns:
    generated node
  - getCDATA
```
public Node getCDATA(Node container)
```
    Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
    
    Parameters:
    container - container node
    
    Returns:
    cdata node
  - ungetToken
```
public void ungetToken()
```
  - getToken
```
public Node getToken(short mode)
```
    Gets a token.
    Parameters:
    mode - one of the following:
    
    MixedContent-- for elements which don't accept PCDATA
    
    Preformatted-- white spacepreserved as is
    
    IgnoreMarkup-- for CDATA elements such as script, style
    Returns:
    next Node
  - parseAsp
```
public Node parseAsp()
```
    parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. Here is an example of a work around for using ASP in attribute values: href='<%=rsSchool.Fields("ID").Value%>' where the ASP that generates the attribute value is masked from Tidy by the quotemarks.
    
    Returns:
    parsed Node
  - parsePhp
```
public Node parsePhp()
```
    PHP is like ASP but is based upon XML processing instructions, e.g. <?php ... ?>.
    
    Returns:
    parsed Node
  - parseAttribute
```
public java.lang.String parseAttribute(boolean[] isempty,
                              Node[] asp,
                              Node[] php)
```
    consumes the '>' terminating start tags.
    
    Parameters:
    isempty - flag is passed as array so it can be modified
    asp - asp Node, passed as array so it can be modified
    php - php Node, passed as array so it can be modified
    
    Returns:
    parsed attribute
  - parseServerInstruction
```
public int parseServerInstruction()
```
    Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
    
    Returns:
    delimiter
  - parseValue
```
public java.lang.String parseValue(java.lang.String name,
                          boolean foldCase,
                          boolean[] isempty,
                          int[] pdelim)
```
    Parse an attribute value.
    
    Parameters:
    name - attribute name
    foldCase - fold case?
    isempty - is attribute empty? Passed as an array reference to allow modification
    pdelim - delimiter, passed as an array reference to allow modification
    
    Returns:
    parsed value
  - isValidAttrName
```
public static boolean isValidAttrName(java.lang.String attr)
```
    Check if attr is a valid name.
    
    Parameters:
    attr - String to check, must be non-null
    
    Returns:
    true if attr is a valid name.
  - isCSS1Selector
```
public static boolean isCSS1Selector(java.lang.String buf)
```
    In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special meaning, by putting a backslash in front.
    
    Parameters:
    buf - css selector name
    
    Returns:
    true if the given string is a valid css1 selector name
  - parseAttrs
```
public AttVal parseAttrs(boolean[] isempty)
```
    Parse tag attributes.
    
    Parameters:
    isempty - is tag empty?
    
    Returns:
    parsed attribute/value list
  - pushInline
```
public void pushInline(Node node)
```
    Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance: <p><em> text <p><em> more text Shouldn't be mapped to <p><em> text </em></p><p><em><em> more text </em></em>
    
    Parameters:
    node - Node to be pushed
  - popInline
```
public void popInline(Node node)
```
    Pop a copy of an inline node from the stack.
    
    Parameters:
    node - Node to be popped
  - isPushed
```
public boolean isPushed(Node node)
```
    Is the node in the stack?
    
    Parameters:
    node - Node
    
    Returns:
    true is the node is found in the stack
  - isPushedLast
```
public boolean isPushedLast(Node element,
                   Node node)
```
  - inlineDup
```
public int inlineDup(Node node)
```
    This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as will be the case in: <i><h1>italic heading</h1></i> which is then treated as equivalent to <h1><i>italic heading</i></h1> This is implemented by setting the lexer into a mode where it gets tokens from the inline stack rather than from the input stream.
    
    Parameters:
    node - original node
    
    Returns:
    stack size
  - inlineDup1
```
public boolean inlineDup1(Node node,
                 Node element)
```
  - insertedToken
```
public Node insertedToken()
```
    Returns:
  - switchInline
```
public boolean switchInline(Node element,
                   Node node)
```
  - findLastLI
```
public Node findLastLI(Node list)
```
  - canPrune
```
public boolean canPrune(Node element)
```
    Can the given element be removed?
    
    Parameters:
    element - node
    
    Returns:
    true if he element can be removed
  - fixId
```
public void fixId(Node node)
```
    duplicate name attribute as an id and check if id and name match.
    
    Parameters:
    node - Node to check for name/it attributes
  - deferDup
```
public void deferDup()
```
    Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
  - preContent
```
protected boolean preContent(Node node)
```
    Is content acceptable for pre elements?
    
    Parameters:
    node - content
    
    Returns:
    true if node is acceptable in pre elements

Class Lexer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

IGNORE_WHITESPACE

MIXED_CONTENT

PREFORMATTED

IGNORE_MARKUP

in

errout

badAccess

badLayout

badChars

badForm

warnings

errors

lines

columns

waswhite

pushed

insertspace

excludeBlocks

exiled

isvoyager

versions

doctype

badDoctype

txtstart

txtend

state

token

lexbuf

lexlength

lexsize

inode

insert

istack

istackbase

styles

configuration

seenEndBody

seenEndHtml

report

root

Constructor Detail

Lexer

Method Detail

newNode

newNode

newNode

cloneNode

cloneAttributes

updateNodeTextArrays

newLineNode

endOfInput

addByte

changeChar

addCharToLexer

addStringToLexer

parseEntity

parseTagName

addStringLiteral

htmlVersion

htmlVersionName

addGenerator

checkDocTypeKeyWords

findGivenVersion

fixHTMLNameSpace

setXHTMLDocType

apparentVersion

fixDocType

fixXmlDecl

inferredTag

getCDATA

ungetToken

getToken

parseAsp

parsePhp