org.w3c.tidy
Class Lexer

java.lang.Object
  extended by org.w3c.tidy.Lexer

public class Lexer
extends java.lang.Object

Lexer for html parser.

Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections

Version:
$Revision: 927 $ ($Author: aditsu $)
Author:
Dave Raggett dsr@w3.org , Andy Quick ac.quick@sympatico.ca (translation to Java), Fabrizio Giustina

Field Summary
protected  short badAccess
          for accessibility errors.
protected  short badChars
          for bad char encodings.
protected  boolean badDoctype
          set if html or PUBLIC is missing.
protected  short badForm
          for mismatched/mispositioned form tags.
protected  short badLayout
          for bad style errors.
protected  int columns
          at start of current token.
protected  Configuration configuration
          configuration.
protected  int doctype
          version as given by doctype (if any).
protected  short errors
          count of errors.
protected  java.io.PrintWriter errout
          error output stream.
protected  boolean excludeBlocks
          Netscape compatibility.
protected  boolean exiled
          true if moved out of table.
static short IGNORE_MARKUP
          state: ignore markup.
static short IGNORE_WHITESPACE
          state: ignore whitespace.
protected  StreamIn in
          file stream.
protected  Node inode
          Inline stack for compatibility with Mosaic.
protected  int insert
          for inferring inline tags.
protected  boolean insertspace
          when space is moved after end tag.
protected  java.util.Stack istack
          stack.
protected  int istackbase
          start of frame.
protected  boolean isvoyager
          true if xmlns attribute on html element.
protected  byte[] lexbuf
          Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements.
protected  int lexlength
          allocated.
protected  int lexsize
          used.
protected  int lines
          lines seen.
static short MIXED_CONTENT
          state: mixed content.
static short PREFORMATTED
          state: preformatted.
protected  boolean pushed
          true after token has been pushed back.
protected  Report report
          report.
protected  Node root
          Root node is saved here.
protected  boolean seenEndBody
          already seen end body tag?
protected  boolean seenEndHtml
          already seen end html tag?
protected  short state
          state of lexer's finite state machine.
protected  Style styles
          used for cleaning up presentation markup.
protected  Node token
          current node.
protected  int txtend
          end of current node.
protected  int txtstart
          start of current node.
protected  short versions
          bit vector of HTML versions.
protected  short warnings
          count of warnings in this document.
protected  boolean waswhite
          used to collapse contiguous white space.
 
Constructor Summary
Lexer(StreamIn in, Configuration configuration, Report report)
          Instantiates a new Lexer.
 
Method Summary
 void addByte(int c)
          Adds a byte to lexer buffer.
 void addCharToLexer(int c)
          Store char c as UTF-8 encoded byte stream.
 boolean addGenerator(Node root)
          Add meta element for Tidy.
 void addStringLiteral(java.lang.String str)
          calls addCharToLexer for any char in the string.
 void addStringToLexer(java.lang.String str)
          Adds a string to lexer buffer.
 short apparentVersion()
          Return the html version used in document.
 boolean canPrune(Node element)
          Can the given element be removed?
 void changeChar(byte c)
          Substitute the last char in buffer.
 boolean checkDocTypeKeyWords(Node doctype)
          Check system keywords (keywords should be uppercase).
 AttVal cloneAttributes(AttVal attrs)
          Clones an attribute value and add eventual asp or php node to node list.
 Node cloneNode(Node node)
          Clones a node and add it to node list.
 void deferDup()
          Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
 boolean endOfInput()
          Has end of input stream been reached?
 short findGivenVersion(Node doctype)
          Examine DOCTYPE to identify version.
 boolean fixDocType(Node root)
          Fixup doctype if missing.
 void fixHTMLNameSpace(Node root, java.lang.String profile)
          Fix xhtml namespace.
 void fixId(Node node)
          duplicate name attribute as an id and check if id and name match.
 boolean fixXmlDecl(Node root)
          Ensure XML document starts with <?XML version="1.0"?>.
 Node getCDATA(Node container)
          Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
 Node getToken(short mode)
          Gets a token.
 short htmlVersion()
          Choose what version to use for new doctype.
 java.lang.String htmlVersionName()
          Choose what version to use for new doctype.
 Node inferredTag(java.lang.String name)
          Generates and inserts a new node.
 int inlineDup(Node node)
          This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc.
 Node insertedToken()
           
static boolean isCSS1Selector(java.lang.String buf)
          In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item).
 boolean isPushed(Node node)
          Is the node in the stack?
static boolean isValidAttrName(java.lang.String attr)
          Check if attr is a valid name.
 Node newLineNode()
          Adds a new line node.
 Node newNode()
          Creates a new node and add it to nodelist.
 Node newNode(short type, byte[] textarray, int start, int end)
          Creates a new node and add it to nodelist.
 Node newNode(short type, byte[] textarray, int start, int end, java.lang.String element)
          Creates a new node and add it to nodelist.
 Node parseAsp()
          parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value.
 java.lang.String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
          consumes the '>' terminating start tags.
 AttVal parseAttrs(boolean[] isempty)
          Parse tag attributes.
 void parseEntity(short mode)
          Parse an html entity.
 Node parsePhp()
          PHP is like ASP but is based upon XML processing instructions, e.g.
 int parseServerInstruction()
          Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
 char parseTagName()
          Parses a tag name.
 java.lang.String parseValue(java.lang.String name, boolean foldCase, boolean[] isempty, int[] pdelim)
          Parse an attribute value.
 void popInline(Node node)
          Pop a copy of an inline node from the stack.
protected  boolean preContent(Node node)
          Is content acceptable for pre elements?
 void pushInline(Node node)
          Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
 boolean setXHTMLDocType(Node root)
          Adds a new xhtml doctype to the document.
 void ungetToken()
           
protected  void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
          Update oldtextarray in the current nodes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

IGNORE_WHITESPACE

public static final short IGNORE_WHITESPACE
state: ignore whitespace.

See Also:
Constant Field Values

MIXED_CONTENT

public static final short MIXED_CONTENT
state: mixed content.

See Also:
Constant Field Values

PREFORMATTED

public static final short PREFORMATTED
state: preformatted.

See Also:
Constant Field Values

IGNORE_MARKUP

public static final short IGNORE_MARKUP
state: ignore markup.

See Also:
Constant Field Values

in

protected StreamIn in
file stream.


errout

protected java.io.PrintWriter errout
error output stream.


badAccess

protected short badAccess
for accessibility errors.


badLayout

protected short badLayout
for bad style errors.


badChars

protected short badChars
for bad char encodings.


badForm

protected short badForm
for mismatched/mispositioned form tags.


warnings

protected short warnings
count of warnings in this document.


errors

protected short errors
count of errors.


lines

protected int lines
lines seen.


columns

protected int columns
at start of current token.


waswhite

protected boolean waswhite
used to collapse contiguous white space.


pushed

protected boolean pushed
true after token has been pushed back.


insertspace

protected boolean insertspace
when space is moved after end tag.


excludeBlocks

protected boolean excludeBlocks
Netscape compatibility.


exiled

protected boolean exiled
true if moved out of table.


isvoyager

protected boolean isvoyager
true if xmlns attribute on html element.


versions

protected short versions
bit vector of HTML versions.


doctype

protected int doctype
version as given by doctype (if any).


badDoctype

protected boolean badDoctype
set if html or PUBLIC is missing.


txtstart

protected int txtstart
start of current node.


txtend

protected int txtend
end of current node.


state

protected short state
state of lexer's finite state machine.


token

protected Node token
current node.


lexbuf

protected byte[] lexbuf
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars.


lexlength

protected int lexlength
allocated.


lexsize

protected int lexsize
used.


inode

protected Node inode
Inline stack for compatibility with Mosaic. For deferring text node.


insert

protected int insert
for inferring inline tags.


istack

protected java.util.Stack istack
stack.


istackbase

protected int istackbase
start of frame.


styles

protected Style styles
used for cleaning up presentation markup.


configuration

protected Configuration configuration
configuration.


seenEndBody

protected boolean seenEndBody
already seen end body tag?


seenEndHtml

protected boolean seenEndHtml
already seen end html tag?


report

protected Report report
report.


root

protected Node root
Root node is saved here.

Constructor Detail

Lexer

public Lexer(StreamIn in,
             Configuration configuration,
             Report report)
Instantiates a new Lexer.

Parameters:
in - StreamIn
configuration - configuation instance
report - report instance, for reporting errors
Method Detail

newNode

public Node newNode()
Creates a new node and add it to nodelist.

Returns:
Node

newNode

public Node newNode(short type,
                    byte[] textarray,
                    int start,
                    int end)
Creates a new node and add it to nodelist.

Parameters:
type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
textarray - array of bytes contained in the Node
start - start position
end - end position
Returns:
Node

newNode

public Node newNode(short type,
                    byte[] textarray,
                    int start,
                    int end,
                    java.lang.String element)
Creates a new node and add it to nodelist.

Parameters:
type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
textarray - array of bytes contained in the Node
start - start position
end - end position
element - tag name
Returns:
Node

cloneNode

public Node cloneNode(Node node)
Clones a node and add it to node list.

Parameters:
node - Node
Returns:
cloned Node

cloneAttributes

public AttVal cloneAttributes(AttVal attrs)
Clones an attribute value and add eventual asp or php node to node list.

Parameters:
attrs - original AttVal
Returns:
cloned AttVal

updateNodeTextArrays

protected void updateNodeTextArrays(byte[] oldtextarray,
                                    byte[] newtextarray)
Update oldtextarray in the current nodes.

Parameters:
oldtextarray - previous text array
newtextarray - new text array

newLineNode

public Node newLineNode()
Adds a new line node. Used for creating preformatted text from Word2000.

Returns:
new line node

endOfInput

public boolean endOfInput()
Has end of input stream been reached?

Returns:
true if end of input stream been reached

addByte

public void addByte(int c)
Adds a byte to lexer buffer.

Parameters:
c - byte to add

changeChar

public void changeChar(byte c)
Substitute the last char in buffer.

Parameters:
c - new char

addCharToLexer

public void addCharToLexer(int c)
Store char c as UTF-8 encoded byte stream.

Parameters:
c - char to store

addStringToLexer

public void addStringToLexer(java.lang.String str)
Adds a string to lexer buffer.

Parameters:
str - String to add

parseEntity

public void parseEntity(short mode)
Parse an html entity.

Parameters:
mode - mode

parseTagName

public char parseTagName()
Parses a tag name.

Returns:
first char after the tag name

addStringLiteral

public void addStringLiteral(java.lang.String str)
calls addCharToLexer for any char in the string.

Parameters:
str - input String

htmlVersion

public short htmlVersion()
Choose what version to use for new doctype.

Returns:
html version constant

htmlVersionName

public java.lang.String htmlVersionName()
Choose what version to use for new doctype.

Returns:
html version name

addGenerator

public boolean addGenerator(Node root)
Add meta element for Tidy. If the meta tag is already present, update release date.

Parameters:
root - root node
Returns:
true if the tag has been added

checkDocTypeKeyWords

public boolean checkDocTypeKeyWords(Node doctype)
Check system keywords (keywords should be uppercase).

Parameters:
doctype - doctype node
Returns:
true if doctype keywords are all uppercase

findGivenVersion

public short findGivenVersion(Node doctype)
Examine DOCTYPE to identify version.

Parameters:
doctype - doctype node
Returns:
version code

fixHTMLNameSpace

public void fixHTMLNameSpace(Node root,
                             java.lang.String profile)
Fix xhtml namespace.

Parameters:
root - root Node
profile - current profile

setXHTMLDocType

public boolean setXHTMLDocType(Node root)
Adds a new xhtml doctype to the document.

Parameters:
root - root node
Returns:
true if a doctype has been added

apparentVersion

public short apparentVersion()
Return the html version used in document.

Returns:
version code

fixDocType

public boolean fixDocType(Node root)
Fixup doctype if missing.

Parameters:
root - root node
Returns:
false if current version has not been identified

fixXmlDecl

public boolean fixXmlDecl(Node root)
Ensure XML document starts with <?XML version="1.0"?>. Add encoding attribute if not using ASCII or UTF-8 output.

Parameters:
root - root node
Returns:
always true

inferredTag

public Node inferredTag(java.lang.String name)
Generates and inserts a new node.

Parameters:
name - tag name
Returns:
generated node

getCDATA

public Node getCDATA(Node container)
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.

Parameters:
container - container node
Returns:
cdata node

ungetToken

public void ungetToken()

getToken

public Node getToken(short mode)
Gets a token.

Parameters:
mode - one of the following:
  • MixedContent-- for elements which don't accept PCDATA
  • Preformatted-- white spacepreserved as is
  • IgnoreMarkup-- for CDATA elements such as script, style
Returns:
next Node

parseAsp

public Node parseAsp()
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. Here is an example of a work around for using ASP in attribute values: href='<%=rsSchool.Fields("ID").Value%>' where the ASP that generates the attribute value is masked from Tidy by the quotemarks.

Returns:
parsed Node

parsePhp

public Node parsePhp()
PHP is like ASP but is based upon XML processing instructions, e.g. <?php ... ?>.

Returns:
parsed Node

parseAttribute

public java.lang.String parseAttribute(boolean[] isempty,
                                       Node[] asp,
                                       Node[] php)
consumes the '>' terminating start tags.

Parameters:
isempty - flag is passed as array so it can be modified
asp - asp Node, passed as array so it can be modified
php - php Node, passed as array so it can be modified
Returns:
parsed attribute

parseServerInstruction

public int parseServerInstruction()
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.

Returns:
delimiter

parseValue

public java.lang.String parseValue(java.lang.String name,
                                   boolean foldCase,
                                   boolean[] isempty,
                                   int[] pdelim)
Parse an attribute value.

Parameters:
name - attribute name
foldCase - fold case?
isempty - is attribute empty? Passed as an array reference to allow modification
pdelim - delimiter, passed as an array reference to allow modification
Returns:
parsed value

isValidAttrName

public static boolean isValidAttrName(java.lang.String attr)
Check if attr is a valid name.

Parameters:
attr - String to check, must be non-null
Returns:
true if attr is a valid name.

isCSS1Selector

public static boolean isCSS1Selector(java.lang.String buf)
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special meaning, by putting a backslash in front.

Parameters:
buf - css selector name
Returns:
true if the given string is a valid css1 selector name

parseAttrs

public AttVal parseAttrs(boolean[] isempty)
Parse tag attributes.

Parameters:
isempty - is tag empty?
Returns:
parsed attribute/value list

pushInline

public void pushInline(Node node)
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance: <p><em> text <p><em> more text Shouldn't be mapped to <p><em> text </em></p><p><em><em> more text </em></em>

Parameters:
node - Node to be pushed

popInline

public void popInline(Node node)
Pop a copy of an inline node from the stack.

Parameters:
node - Node to be popped

isPushed

public boolean isPushed(Node node)
Is the node in the stack?

Parameters:
node - Node
Returns:
true is the node is found in the stack

inlineDup

public int inlineDup(Node node)
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as will be the case in: <i><h1>italic heading</h1></i> which is then treated as equivalent to <h1><i>italic heading</i></h1> This is implemented by setting the lexer into a mode where it gets tokens from the inline stack rather than from the input stream.

Parameters:
node - original node
Returns:
stack size

insertedToken

public Node insertedToken()
Returns:

canPrune

public boolean canPrune(Node element)
Can the given element be removed?

Parameters:
element - node
Returns:
true if he element can be removed

fixId

public void fixId(Node node)
duplicate name attribute as an id and check if id and name match.

Parameters:
node - Node to check for name/it attributes

deferDup

public void deferDup()
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.


preContent

protected boolean preContent(Node node)
Is content acceptable for pre elements?

Parameters:
node - content
Returns:
true if node is acceptable in pre elements