public class Lexer
extends java.lang.Object
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections
Modifier and Type | Field and Description |
---|---|
protected short |
badAccess
for accessibility errors.
|
protected short |
badChars
for bad char encodings.
|
protected boolean |
badDoctype
set if html or PUBLIC is missing.
|
protected short |
badForm
for mismatched/mispositioned form tags.
|
protected short |
badLayout
for bad style errors.
|
protected int |
columns
at start of current token.
|
protected Configuration |
configuration
configuration.
|
protected int |
doctype
version as given by doctype (if any).
|
protected short |
errors
count of errors.
|
protected java.io.PrintWriter |
errout
error output stream.
|
protected boolean |
excludeBlocks
Netscape compatibility.
|
protected boolean |
exiled
true if moved out of table.
|
static short |
IGNORE_MARKUP
state: ignore markup.
|
static short |
IGNORE_WHITESPACE
state: ignore whitespace.
|
protected StreamIn |
in
file stream.
|
protected Node |
inode
Inline stack for compatibility with Mosaic.
|
protected int |
insert
for inferring inline tags.
|
protected boolean |
insertspace
when space is moved after end tag.
|
protected java.util.Stack |
istack
stack.
|
protected int |
istackbase
start of frame.
|
protected boolean |
isvoyager
true if xmlns attribute on html element.
|
protected byte[] |
lexbuf
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of
all of the elements.
|
protected int |
lexlength
allocated.
|
protected int |
lexsize
used.
|
protected int |
lines
lines seen.
|
static short |
MIXED_CONTENT
state: mixed content.
|
static short |
PREFORMATTED
state: preformatted.
|
protected boolean |
pushed
true after token has been pushed back.
|
protected Report |
report
report.
|
protected Node |
root
Root node is saved here.
|
protected boolean |
seenEndBody
already seen end body tag?
|
protected boolean |
seenEndHtml
already seen end html tag?
|
protected short |
state
state of lexer's finite state machine.
|
protected Style |
styles
used for cleaning up presentation markup.
|
protected Node |
token
current node.
|
protected int |
txtend
end of current node.
|
protected int |
txtstart
start of current node.
|
protected short |
versions
bit vector of HTML versions.
|
protected short |
warnings
count of warnings in this document.
|
protected boolean |
waswhite
used to collapse contiguous white space.
|
Constructor and Description |
---|
Lexer(StreamIn in,
Configuration configuration,
Report report)
Instantiates a new Lexer.
|
Modifier and Type | Method and Description |
---|---|
void |
addByte(int c)
Adds a byte to lexer buffer.
|
void |
addCharToLexer(int c)
Store char c as UTF-8 encoded byte stream.
|
boolean |
addGenerator(Node root)
Add meta element for Tidy.
|
void |
addStringLiteral(java.lang.String str)
calls addCharToLexer for any char in the string.
|
void |
addStringToLexer(java.lang.String str)
Adds a string to lexer buffer.
|
short |
apparentVersion()
Return the html version used in document.
|
boolean |
canPrune(Node element)
Can the given element be removed?
|
void |
changeChar(byte c)
Substitute the last char in buffer.
|
boolean |
checkDocTypeKeyWords(Node doctype)
Check system keywords (keywords should be uppercase).
|
AttVal |
cloneAttributes(AttVal attrs)
Clones an attribute value and add eventual asp or php node to node list.
|
Node |
cloneNode(Node node)
Clones a node and add it to node list.
|
void |
deferDup()
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
|
boolean |
endOfInput()
Has end of input stream been reached?
|
short |
findGivenVersion(Node doctype)
Examine DOCTYPE to identify version.
|
Node |
findLastLI(Node list) |
boolean |
fixDocType(Node root)
Fixup doctype if missing.
|
void |
fixHTMLNameSpace(Node root,
java.lang.String profile)
Fix xhtml namespace.
|
void |
fixId(Node node)
duplicate name attribute as an id and check if id and name match.
|
boolean |
fixXmlDecl(Node root)
Ensure XML document starts with
<?XML version="1.0"?> . |
Node |
getCDATA(Node container)
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some
foo.
|
Node |
getToken(short mode)
Gets a token.
|
short |
htmlVersion()
Choose what version to use for new doctype.
|
java.lang.String |
htmlVersionName()
Choose what version to use for new doctype.
|
Node |
inferredTag(java.lang.String name)
Generates and inserts a new node.
|
int |
inlineDup(Node node)
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P,
TD, TH, DIV, PRE etc.
|
boolean |
inlineDup1(Node node,
Node element) |
Node |
insertedToken() |
static boolean |
isCSS1Selector(java.lang.String buf)
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they
cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a
numeric code (see next item).
|
boolean |
isPushed(Node node)
Is the node in the stack?
|
boolean |
isPushedLast(Node element,
Node node) |
static boolean |
isValidAttrName(java.lang.String attr)
Check if attr is a valid name.
|
Node |
newLineNode()
Adds a new line node.
|
Node |
newNode()
Creates a new node and add it to nodelist.
|
Node |
newNode(short type,
byte[] textarray,
int start,
int end)
Creates a new node and add it to nodelist.
|
Node |
newNode(short type,
byte[] textarray,
int start,
int end,
java.lang.String element)
Creates a new node and add it to nodelist.
|
Node |
parseAsp()
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to
dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to
tailor the attribute value.
|
java.lang.String |
parseAttribute(boolean[] isempty,
Node[] asp,
Node[] php)
consumes the '>' terminating start tags.
|
AttVal |
parseAttrs(boolean[] isempty)
Parse tag attributes.
|
void |
parseEntity(short mode)
Parse an html entity.
|
Node |
parsePhp()
PHP is like ASP but is based upon XML processing instructions, e.g.
|
int |
parseServerInstruction()
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this
routine recognizes ' and " quoted strings.
|
char |
parseTagName()
Parses a tag name.
|
java.lang.String |
parseValue(java.lang.String name,
boolean foldCase,
boolean[] isempty,
int[] pdelim)
Parse an attribute value.
|
void |
popInline(Node node)
Pop a copy of an inline node from the stack.
|
protected boolean |
preContent(Node node)
Is content acceptable for pre elements?
|
void |
pushInline(Node node)
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones
generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
|
boolean |
setXHTMLDocType(Node root)
Adds a new xhtml doctype to the document.
|
boolean |
switchInline(Node element,
Node node) |
void |
ungetToken() |
protected void |
updateNodeTextArrays(byte[] oldtextarray,
byte[] newtextarray)
Update
oldtextarray in the current nodes. |
public static final short IGNORE_WHITESPACE
public static final short MIXED_CONTENT
public static final short PREFORMATTED
public static final short IGNORE_MARKUP
protected StreamIn in
protected java.io.PrintWriter errout
protected short badAccess
protected short badLayout
protected short badChars
protected short badForm
protected short warnings
protected short errors
protected int lines
protected int columns
protected boolean waswhite
protected boolean pushed
protected boolean insertspace
protected boolean excludeBlocks
protected boolean exiled
protected boolean isvoyager
protected short versions
protected int doctype
protected boolean badDoctype
protected int txtstart
protected int txtend
protected short state
protected Node token
protected byte[] lexbuf
protected int lexlength
protected int lexsize
protected Node inode
protected int insert
protected java.util.Stack istack
protected int istackbase
protected Style styles
protected Configuration configuration
protected boolean seenEndBody
protected boolean seenEndHtml
protected Report report
protected Node root
public Lexer(StreamIn in, Configuration configuration, Report report)
in
- StreamInconfiguration
- configuation instancereport
- report instance, for reporting errorspublic Node newNode()
public Node newNode(short type, byte[] textarray, int start, int end)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionpublic Node newNode(short type, byte[] textarray, int start, int end, java.lang.String element)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionelement
- tag namepublic Node cloneNode(Node node)
node
- Nodepublic AttVal cloneAttributes(AttVal attrs)
attrs
- original AttValprotected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
oldtextarray
in the current nodes.oldtextarray
- previous text arraynewtextarray
- new text arraypublic Node newLineNode()
public boolean endOfInput()
true
if end of input stream been reachedpublic void addByte(int c)
c
- byte to addpublic void changeChar(byte c)
c
- new charpublic void addCharToLexer(int c)
c
- char to storepublic void addStringToLexer(java.lang.String str)
str
- String to addpublic void parseEntity(short mode)
mode
- modepublic char parseTagName()
public void addStringLiteral(java.lang.String str)
str
- input Stringpublic short htmlVersion()
public java.lang.String htmlVersionName()
public boolean addGenerator(Node root)
root
- root nodetrue
if the tag has been addedpublic boolean checkDocTypeKeyWords(Node doctype)
doctype
- doctype nodepublic short findGivenVersion(Node doctype)
doctype
- doctype nodepublic void fixHTMLNameSpace(Node root, java.lang.String profile)
root
- root Nodeprofile
- current profilepublic boolean setXHTMLDocType(Node root)
root
- root nodetrue
if a doctype has been addedpublic short apparentVersion()
public boolean fixDocType(Node root)
root
- root nodefalse
if current version has not been identifiedpublic boolean fixXmlDecl(Node root)
<?XML version="1.0"?>
. Add encoding attribute if not using
ASCII or UTF-8 output.root
- root nodepublic Node inferredTag(java.lang.String name)
name
- tag namepublic Node getCDATA(Node container)
container
- container nodepublic void ungetToken()
public Node getToken(short mode)
mode
- one of the following:
MixedContent
-- for elements which don't accept PCDATAPreformatted
-- white spacepreserved as isIgnoreMarkup
-- for CDATA elements such as script, stylepublic Node parseAsp()
href='<%=rsSchool.Fields("ID").Value%>'
where the ASP that generates the attribute value is
masked from Tidy by the quotemarks.public Node parsePhp()
<?php ... ?>
.public java.lang.String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
isempty
- flag is passed as array so it can be modifiedasp
- asp Node, passed as array so it can be modifiedphp
- php Node, passed as array so it can be modifiedpublic int parseServerInstruction()
public java.lang.String parseValue(java.lang.String name, boolean foldCase, boolean[] isempty, int[] pdelim)
name
- attribute namefoldCase
- fold case?isempty
- is attribute empty? Passed as an array reference to allow modificationpdelim
- delimiter, passed as an array reference to allow modificationpublic static boolean isValidAttrName(java.lang.String attr)
attr
- String to check, must be non-nulltrue
if attr is a valid name.public static boolean isCSS1Selector(java.lang.String buf)
buf
- css selector nametrue
if the given string is a valid css1 selector namepublic AttVal parseAttrs(boolean[] isempty)
isempty
- is tag empty?public void pushInline(Node node)
<p><em> text <p><em> more text
Shouldn't be mapped to
<p><em> text </em></p><p><em><em> more text </em></em>
node
- Node to be pushedpublic void popInline(Node node)
node
- Node to be poppedpublic boolean isPushed(Node node)
node
- Nodetrue
is the node is found in the stackpublic int inlineDup(Node node)
<i><h1>italic heading</h1></i>
which is then treated as
equivalent to <h1><i>italic heading</i></h1>
This is implemented by setting the lexer
into a mode where it gets tokens from the inline stack rather than from the input stream.node
- original nodepublic Node insertedToken()
public boolean canPrune(Node element)
element
- nodetrue
if he element can be removedpublic void fixId(Node node)
node
- Node to check for name/it attributespublic void deferDup()
protected boolean preContent(Node node)
node
- contenttrue
if node is acceptable in pre elements