We desire to create a pushdown automaton (PDA) that uses the following "alphabet" (by alphabet I mean a unique set of symbol strings/keys):
+aaa
+bbb
+ccc
+ddd
+eee
-aaa
-bbb
-ccc
-ddd
-eee
The "symbols" (3-letter sequences with a + or - prefix) in this alphabet are used to make trees. Here are some example trees:
+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb
Visualized as an actual tree, they would be more like:
+eee
-eee
+aaa
+bbb
-bbb
-aaa
+bbb
+aaa
+ccc
+eee
-eee
-ccc
+ccc
-ccc
+ddd
+ccc
+eee
-eee
-ccc
-ddd
-aaa
-bbb
So given this alphabet and these example trees, the question is how you would write a generic pushdown automaton to parse these trees. The rules are:
- Any letter pair (open/close pair) can have any number of nested children, and it doesn't matter what letter pairs are nested as children.
How would you write a pushdown automaton in JavaScript to parse a string into an AST?
By this I mean, the implementation must literally have a stack
, states
, and transitions
. By this I mean, not implementing an ad-hoc parser, or even a recursive descent parser. This should be an iterative while loop looping through transitions somehow, not using recursion.
The output should be a very simple "AST" that just looks like this (for the +aaa+bbb-bbb-aaa
tree):
{
"type": "aaa",
"children": [
{
"type": "bbb",
"children": []
}
]
}
I am wondering how to build a PDA so I can, in my specific case working on a custom programming language, convert a rather involved and plex AST I am working and parse it into an object graph. That question is too plicated to write out on SO, and too hard to simplify, which is why I am asking here about a very basic PDA implementation in JavaScript.
I am interested to see how you keep track of the context at each branch point when doing the PDA algorithm and what the transitions look like.
Note: I've heard of / seen "2-state" PDAs occasionally mentioned here and there. It sounds like you have one state to do the putation, and one "final" state. Maybe even just a single state. If it's possible to do it sort of like this, then that would be cool. If not, don't worry about it.
If necessary, you can first write a Context Free Grammar and use that to dynamically construct a PDA, but the key question here is how the PDA is technically implemented. I think it's probably possible / straightforward to just write each transition function by hand, but I'm not entirely sure. Either way works for me.
We desire to create a pushdown automaton (PDA) that uses the following "alphabet" (by alphabet I mean a unique set of symbol strings/keys):
+aaa
+bbb
+ccc
+ddd
+eee
-aaa
-bbb
-ccc
-ddd
-eee
The "symbols" (3-letter sequences with a + or - prefix) in this alphabet are used to make trees. Here are some example trees:
+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb
Visualized as an actual tree, they would be more like:
+eee
-eee
+aaa
+bbb
-bbb
-aaa
+bbb
+aaa
+ccc
+eee
-eee
-ccc
+ccc
-ccc
+ddd
+ccc
+eee
-eee
-ccc
-ddd
-aaa
-bbb
So given this alphabet and these example trees, the question is how you would write a generic pushdown automaton to parse these trees. The rules are:
- Any letter pair (open/close pair) can have any number of nested children, and it doesn't matter what letter pairs are nested as children.
How would you write a pushdown automaton in JavaScript to parse a string into an AST?
By this I mean, the implementation must literally have a stack
, states
, and transitions
. By this I mean, not implementing an ad-hoc parser, or even a recursive descent parser. This should be an iterative while loop looping through transitions somehow, not using recursion.
The output should be a very simple "AST" that just looks like this (for the +aaa+bbb-bbb-aaa
tree):
{
"type": "aaa",
"children": [
{
"type": "bbb",
"children": []
}
]
}
I am wondering how to build a PDA so I can, in my specific case working on a custom programming language, convert a rather involved and plex AST I am working and parse it into an object graph. That question is too plicated to write out on SO, and too hard to simplify, which is why I am asking here about a very basic PDA implementation in JavaScript.
I am interested to see how you keep track of the context at each branch point when doing the PDA algorithm and what the transitions look like.
Note: I've heard of / seen "2-state" PDAs occasionally mentioned here and there. It sounds like you have one state to do the putation, and one "final" state. Maybe even just a single state. If it's possible to do it sort of like this, then that would be cool. If not, don't worry about it.
If necessary, you can first write a Context Free Grammar and use that to dynamically construct a PDA, but the key question here is how the PDA is technically implemented. I think it's probably possible / straightforward to just write each transition function by hand, but I'm not entirely sure. Either way works for me.
Share Improve this question edited May 28, 2021 at 18:40 Lance Pollard asked May 28, 2021 at 3:23 Lance PollardLance Pollard 79.7k98 gold badges333 silver badges611 bronze badges 17- Are you after the JS code for a general deterministic PDA? Or are you after the PDA definition for your grammar? Or both? The former is likely readily available (it's just a transition map and a stack). The latter is just fancier parenthesis matching, also readily available. What specifically are you after and what have you tried and where are you stuck? – Welbog Commented May 28, 2021 at 10:40
- 1 "How would you write a pushdown automaton in JavaScript to parse a string into an AST?" - why are you trying to write a PDA? Why not have a look at parsers instead? You'd need to change the theoretical formalism of the PDA if you want to output something. – Bergi Commented May 28, 2021 at 12:43
- 2 So basically you're trying to reinvent XSLT? Or a parser generator? I don't see how this question of "How to parse a (tagged) parenthesis tree?" does help with that. – Bergi Commented May 28, 2021 at 13:00
- 1 It sounds like you're basically trying to design a piler. This is a huge undertaking. Look into high-level articles on piler design, specifically the main steps: tokenization, parsing and creating an execution tree. – Welbog Commented May 28, 2021 at 13:03
- 1 You may find Reg Braithwaite's related article interesting reading. – Scott Sauyet Commented May 28, 2021 at 19:34
5 Answers
Reset to default 3It is not possible to create a PDA for producing this tree data structure because:
- a PDA has a finite number of states and stack symbols, and no output tape. Yet the number of trees that this PDA should be able to somehow represent (and make available) is infinite. If we consider the nodes of a tree in an object-oriented structure, then there are an infinite number of different nodes possible, as a node's identity is determined also by the references it has. This is in conflict with the finite set of symbols that a PDA has. If as alternative we would not go for an object-oriented output, then we gain nothing: the input already is a serialized representation of the tree.
- Even if you would add an output tape to this automaton, making it a rewrite system, you wouldn't gain much, since the input is already representing the tree in a serialized format. If I understood correctly, you are not interested in serialized output (like JSON, which would be trivial as the output order is the same as the input order), but structured output (JavaScript-like objects).
So, as others have also noted, we need to loosen the rules somehow. We can think of several (binations of) ways to do that:
- Allow for an infinite number of states, or
- Allow for an infinite stack alphabet, or
- Allow the PDA to have access to more of the stack than just its top element
- Let the transition table refer to a particular attribute of the stack's top element, not the element as a whole -- allowing infinite possibilities for other attributes of the stacked elements, while ensuring that the values of this particular attribute belong to a finite set.
- Keep the tree building outside of the PDA
- ...Any other measures for solving the inpatibility of PDA with the requirement to produce a structured tree.
Some of these options would infer an infinite transition table, and so in that case it cannot be a table, but could be a function.
Implementation choices
1. For the generic engine
Considering your focus on tree-building, and the requirement to have "a stack, states, and transitions", I went for the following implementation choices, and sometimes simplifications:
- I take strategy 4 from the above list
- The generic engine will itself create the tree. So it is only generic in a limited sense. It can only validate and generate trees.
- It will (only) take a transition table as configuration input
- It has only two states (boolean), which indicate whether all is OK so far, or not.
- The OK-state can only go from true to false when there is no matching transition in the transition table
- Once the OK-state is false, the engine will not even attempt to find a transition, but will just ignore any further input.
- By consequence the transition table will not include the current state and future state parts, as they are both implied to be true (i.e. "OK").
- The tree will be encoded in the stack. The
data
property of a stack element will be the special attribute that will serve for identifying transitions. Thechildren
attribute of each stack element will serve to define the tree and is not considered part of the stack alphabet that the transition table targets. - The stack starts out with one element. Its
children
attribute will be the output of the engine. At every step this output can be consulted, since this element is never removed from the stack. - The input symbols cannot include the empty string, which is reserved for signalling the end of the input stream. This allows the engine to give feedback whether that is a valid moment to end the input stream.
- The engine requires that its stack is "empty" (has just 1 entry) when the end of the input is indicated.
- I assumed that input symbols do not contain spaces: the space is used as a delimiter in the transition lookup algorithm
- The given transition table is converted to a (hash)map to allow for fast lookup given the current state and the data at the top of the stack. The keys in this map are concatenations of input and stack data values, and it is here that the space was chosen as delimiter. If a space is a valid character in an input symbol, then another encoding mechanism should be selected for serializing this key-pair (e.g. JSON), but I wanted to keep this simple.
- The engine does not need to get the set of input symbols, nor the set of stack symbols as configuration input, as these will be implied from the transition table.
- The engine's initial state is set to OK (it could be set differently by simple property assignment, but that would be useless as it would then ignore input)
2. For the specific +/- input format
The more specific part of the solution deals with the actual input format you have given in the question. One function will covert the set of types to a transition table, and another will tokenize the input based on the "+" and "-" characters, but without any validation (no symbol check, nor symbol length check, ...), as this will all surface as error anyway when the engine is called.
Any white space in the input is ignored.
Implementation
// To peek the top element of a stack:
Array.prototype.peek = function () { return this[this.length - 1] };
function createParser(transitions) {
function createNode(data) {
return {
data,
children: []
};
}
function addChild(parentNode, data) {
const childNode = createNode(data);
parentNode.children.push(childNode);
return childNode;
}
let isOK = true; // It's a 2-state (boolean) engine. Transitions implicitly only apply to OK-state
const stack = [createNode("")]; // Stack is private, and always has Sentinel node with value ""
// Create a map for the transitions table, for faster lookup
// We use space as a delimiter for the key pair, assuming an input symbol does not include spaces.
const transMap = new Map(transitions.map(({whenInput, whenStack, thenPushValue}) =>
[whenInput + " " + whenStack, thenPushValue]
));
const parser = {
read(input) { // Returns true when parser can accept more input after this one
// Once the engine is in an error state, it will not process any further inputs
if (!isOK) {
return false;
}
// Consider the empty string as the end-of-input marker
if (input === "") {
// Even when state is OK, the stack should now also be "empty"
isOK &&= stack.length === 1;
return false; // Not an error condition, but indication that no more input is expected
}
// Transitions do not include state requirements, nor new state definitions.
// It is assumed that a transition can only occur in an OK state, and that all
// included transitions lead to an OK state.
const pushValue = transMap.get(input + " " + stack.peek().data);
if (pushValue === undefined) { // No matching transition in the table implies that state is not OK
isOK = false;
} else {
// As this is a two-state engine, with defined transitions only between OK states,
// each defined transition will affect the stack: so it's either a push or pop.
// A push is identified by the (non-empy) value to be pushed. An empty string denotes a pop.
if (pushValue) {
stack.push(addChild(stack.peek(), pushValue));
} else {
stack.pop();
}
}
return isOK;
},
isOK, // Expose the (boolean) state
output: stack[0].children // Don't expose the stack, but just the forest encoded in it
};
return parser;
}
function createTransition(whenInput, whenStack, thenPushValue) {
return {whenInput, whenStack, thenPushValue}; // First two values imply the third
}
// Functions specific for the input format in the question:
function createTransitionTable(types) {
// Specific to the input structure (with + and - tags) given in the question
// An empty string in the second column represents an empty stack
return [
// Transitions for opening tags: O(n²)
...["", ...types].flatMap(stack =>
types.map(type => createTransition("+" + type, stack, type))
),
// Transitions for closing tags
...types.map(type => createTransition("-" + type, type, ""))
];
}
function tokenizer(str) { // Could be a generator, but I went for a function-returning function
str = str.replace(/\s+/g, ""); // remove white space from input string
let current = 0; // Maintain state between `getNextToken` function calls
function getNextToken() {
const start = current++;
while (current < str.length && str[current] !== "+" && str[current] !== "-") {
current++;
}
const token = str.slice(start, current);
console.log("read", token); // For debugging
return token;
}
return getNextToken;
}
// Example input language, as defined in the question:
const types = ["aaa", "bbb", "ccc", "ddd", "eee"];
const transitionTable = createTransitionTable(types);
const parser = createParser(transitionTable);
// Example input for it:
const rawInput = `
+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb`;
const getNextToken = tokenizer(rawInput);
// Process tokens
while (parser.read(getNextToken())) {}
console.log("Parser state is OK?: ", parser.isOK);
console.log("Parser output:\n");
console.log(JSON.stringify(parser.output, null, 3));
I should note at the outset that I'm no puter scientist and have no real experience writing piler code. So there may be glaring holes in the implementation or even the basic ideas. But if you want the thoughts of a work-a-day programmer who found this an interesting problem, here they are.
We can write a pda
function that simply recognizes our grammar, one that we can use like this. (Here we go only from aaa
to ccc
, but you could easily extend it to eee
or whatever.)
const {push: PUSH, pop: POP} = pda
const myParser = pda ('S', ['S'], [
// ^ ^
// | `----------------- accepting states
// +----------------------- initial state
// +----------------------------------------- current state
// | +-------------------------------- token
// | | +--------------------------- top of stack
// | | | +-------------------- new state
// | | | | +---------- stack action
// V V V V V
[ 'S', '+aaa', '', 'A', PUSH ('A') ],
[ 'S', '+bbb', '', 'B', PUSH ('B') ],
[ 'S', '+ccc', '', 'C', PUSH ('C') ],
[ 'A', '+aaa', '', 'A', PUSH ('A') ],
[ 'A', '-aaa', 'AA', 'A', POP ],
[ 'A', '-aaa', 'BA', 'B', POP ],
[ 'A', '-aaa', 'CA', 'C', POP ],
[ 'A', '-aaa', '', 'S', POP ],
[ 'A', '+bbb', '', 'B', PUSH ('B') ],
[ 'A', '+ccc', '', 'C', PUSH ('C') ],
[ 'B', '+aaa', '', 'A', PUSH ('A') ],
[ 'B', '+bbb', '', 'B', PUSH ('B') ],
[ 'B', '-bbb', 'AB', 'A', POP ],
[ 'B', '-bbb', 'BB', 'B', POP ],
[ 'B', '-bbb', 'CB', 'C', POP ],
[ 'B', '-bbb', '', 'S', POP ],
[ 'B', '+ccc', '', 'C', PUSH ('C') ],
[ 'C', '+aaa', '', 'A', PUSH ('A') ],
[ 'C', '+bbb', '', 'B', PUSH ('B') ],
[ 'C', '+ccc', '', 'C', PUSH ('C') ],
[ 'C', '-ccc', 'AC', 'A', POP ],
[ 'C', '-ccc', 'BC', 'B', POP ],
[ 'C', '-ccc', 'CC', 'C', POP ],
[ 'C', '-ccc', '', 'S', POP ],
])
And we would use it to test a series of tokens, like this:
myParser (['+aaa', '-aaa']) //=> true
myParser (['+aaa', '-bbb']) //=> false
myParser (['+aaa', '+bbb', '+ccc', '-ccc', '+aaa', '-aaa', '-bbb', '-aaa']) //=> true
This is not exactly to the mathematical definition of a PDA. We don't have a symbol to delineate the beginning of the stack, and we test the top two values of the stack, not just the top one. But it's reasonably close.
However, this just reports whether a string of tokens is in the grammar. You want something more than that. You need to use this to build a syntax tree. It's very difficult to see how to do this in the abstract. But it's easy enough to generate a sequence of events from that parsing that you could use. One approach would be just to capture the new node value at every push to the stack and capture every pop from the stack.
With that, we might tie to something like this:
const forestBuilder = () => { // multiple-rooted, so a forest not a tree
const top = (xs) => xs [ xs .length - 1 ]
const forest = {children: []}
let stack = [forest]
return {
push: (name) => {
const node = {name: name, children: []}
top (stack) .children .push (node)
stack.push (node)
},
pop: () => stack.pop(),
end: () => forest.children
}
}
const {push, pop, end} = forestBuilder ()
push ('aaa')
push ('bbb')
pop ()
push ('ccc')
push ('aaa')
pop()
pop()
pop()
push ('bbb')
push ('aaa')
end()
which would yield something like this:
[
{
"name": "aaa",
"children": [
{
"name": "bbb",
"children": []
},
{
"name": "ccc",
"children": [
{
"name": "aaa",
"children": []
}
]
}
]
},
{
"name": "bbb",
"children": [
{
"name": "aaa",
"children": []
}
]
}
]
So if we supply our pda function with some event listeners for the pushes and pops (also for pletion and errors), we might be able to build your tree from a series of tokens.
Here is one attempt to do this:
console .clear ()
const pda = (() => {
const PUSH = Symbol(), POP = Symbol()
const id = (x) => x
return Object .assign (
(start, accepting, transitions) =>
(tokens = [], onPush = id, onPop = id, onComplete = id, onError = () => false) => {
let stack = []
let state = start
for (let token of tokens) {
const transition = transitions .find (([st, tk, top]) =>
st == state &&
tk == token &&
(top .length == 0 || stack .slice (-top.length) .join ('') == top)
)
if (!transition) {
return onError (token, stack)
}
const [, , , nst, action] = transition
state = nst
action (stack)
if (action [PUSH]) {onPush (token)}
if (action [POP]) {onPop ()}
}
return onComplete (!!accepting .includes (state))
},{
push: (token) => Object.assign((stack) => stack .push (token), {[PUSH]: true}),
pop: Object.assign ((stack) => stack .pop (), {[POP]: true}),
}
)
})()
const {push: PUSH, pop: POP} = pda
const myParser = pda ('S', ['S'], [
// ^ ^
// | `----------------- accepting states
// +----------------------- initial state
// +----------------------------------------- current state
// | +-------------------------------- token
// | | +--------------------------- top of stack
// | | | +-------------------- new state
// | | | | +---------- stack action
// V V V V V
[ 'S', '+aaa', '', 'A', PUSH ('A') ],
[ 'S', '+bbb', '', 'B', PUSH ('B') ],
[ 'S', '+ccc', '', 'C', PUSH ('C') ],
[ 'A', '+aaa', '', 'A', PUSH ('A') ],
[ 'A', '-aaa', 'AA', 'A', POP ],
[ 'A', '-aaa', 'BA', 'B', POP ],
[ 'A', '-aaa', 'CA', 'C', POP ],
[ 'A', '-aaa', '', 'S', POP ],
[ 'A', '+bbb', '', 'B', PUSH ('B') ],
[ 'A', '+ccc', '', 'C', PUSH ('C') ],
[ 'B', '+aaa', '', 'A', PUSH ('A') ],
[ 'B', '+bbb', '', 'B', PUSH ('B') ],
[ 'B', '-bbb', 'AB', 'A', POP ],
[ 'B', '-bbb', 'BB', 'B', POP ],
[ 'B', '-bbb', 'CB', 'C', POP ],
[ 'B', '-bbb', '', 'S', POP ],
[ 'B', '+ccc', '', 'C', PUSH ('C') ],
[ 'C', '+aaa', '', 'A', PUSH ('A') ],
[ 'C', '+bbb', '', 'B', PUSH ('B') ],
[ 'C', '+ccc', '', 'C', PUSH ('C') ],
[ 'C', '-ccc', 'AC', 'A', POP ],
[ 'C', '-ccc', 'BC', 'B', POP ],
[ 'C', '-ccc', 'CC', 'C', POP ],
[ 'C', '-ccc', '', 'S', POP ],
])
const forestBuilder = () => {
const top = (xs) => xs [ xs .length - 1 ]
const forest = {children: []}
let stack = [forest]
return {
push: (name) => {
const node = {name: name .slice (1), children: []}
top (stack) .children .push (node)
stack.push (node)
},
pop: () => stack.pop(),
end: () => forest.children
}
}
const {push, pop, end} = forestBuilder ()
console .log (myParser (
["+ccc", "-ccc", "+aaa", "+bbb", "-bbb", "-aaa", "+bbb", "+aaa", "+ccc", "+bbb", "-bbb", "-ccc", "+ccc", "-ccc", "+aaa", "+ccc", "+ccc", "-ccc", "-ccc", "-aaa", "-aaa", "-bbb"],
push,
pop,
(accepted) => accepted ? end () : 'Error: ill-formed',
(token, stack) => `Error: token = '${token}', stack = ${JSON.stringify(stack)}`
))
.as-console-wrapper {max-height: 100% !important; top: 0}
There are lots of ways this could go. Perhaps the opening event should contain not only the token but the value pushed on the stack. There might be a good way to generate that table of transitions from a more declarative syntax. We might want a different version of the stack action column, one that takes strings instead of take functions. And so on. But it still might be a decent start.
In pseudocode, a DPDA can be implemented like this:
transitions <- map from (state, char, char) to (state, char[0-2])
stack <- stack initialized with a sentinel value 'Z'
input <- string of tokens to parse
currentState <- initial state
for each inputCharacter in input:
stackCharacter <- stack.pop()
currentState, charsToPush <- transitions[(currentState, inputCharacter, stackCharacter)]
if currentState is not valid:
return false
for charToPush in charsToPush.reverse():
stack.push(charToPush)
return (currentState in acceptingStates) and (stack contains only 'Z')
A PDA such as parenthesis matching is specified like this:
transitions <- {
(0, '(', 'Z') -> (0, "(Z"),
(0, '(', '(') -> (0, "(("),
(0, ')', 'Z') -> nope,
(0, ')', '(') -> (0, ""),
}
acceptingStates <- [0]
initialState <- 0
Note that the above is deterministic. General PDAs are nondeterministic and not all context-free languages can be decided by DPDAs. Yours can, and care has to be made for how you specify transitions.
To make it more general (nondeterministic) the transition map needs to map to a list of (state, char[]) tuples instead of just one; each step in the loop needs to consider all matching tuples instead of just one.
Does that help?
For your grammar specifically, your tokens are these "+aaa", "-aaa" things. Your alphabet is finite but very large, so you don't want to have to specify everything in your transition map. So here's a decision you have to make: do you want a pure PDA (fully specified map) or do you want to code a PDA-like thing that isn't quite a PDA to avoid that?
If the latter, you want to add a check in your loop that matches the identity of the + and - tokens. You might as well just write your own code at this point, because creating a generic parser that can handle everything is a ton of work. It's easier to just write a parser for your specific needs.
And this is why people have created libraries like flap.js, because this stuff is plicated.
To elaborate on a ment I made, if you make the transition function arbitrary instead of a map, you can express your language that way.
transition <- arbitrary function taking input (state, token, token) and output (state, token[0-2])
stack <- stack initialized with a sentinel token 'Z'
input <- string of tokens to parse
currentState <- initial state
for each inputToken in input:
stackToken <- stack.pop()
currentState, tokensToPush <- transition(currentState, inputToken, stackToken)
if currentState is not valid:
return false
for tokenToPush in tokensToPush.reverse():
stack.push(tokenToPush)
return (currentState in acceptingStates) and (stack contains only 'Z')
Define transition like this:
function transition(state, input, stack):
if (input.type = +)
return (0, [input, stack])
else if (input.id = stack.id)
return (0, [])
else
return nope
Where your tokens have a "type" (+ or -) and an "id" ("aaa", "bbb", etc).
You have to careful with an arbitrary transition function, as the solution is now less constrained, more likely to accidentally not be context-free.
The solution below works in three stages:
- The input string is tokenized into sequences and prefixes
- The tokenized result, as an array, is converted to an AST via transition rules
- The AST is converted to the desired JSON output.
First, defining the tokens, transitions, and tokenization function:
class Token{
constructor(vals, t_type){
this.vals = vals;
this.t_type = t_type
}
type_eq(t_type){
return this.t_type === t_type
}
static val_eq(t1, t2){
//check if two tokens objects t1 and t2 represent a +tag and a -tag
return t1.t_type === 'o_tag' && t2.t_type === 'c_tag' && t1.vals[1].vals[0] === t2.vals[1].vals[0]
}
}
var lexer = [[/^\-/, 'minus'], [/^\+/, 'plus'], [/^[a-z]+/, 'label']];
var transitions = [{'pattern':['plus', 'label'], 'result':'o_tag', 't_eq':false}, {'pattern':['minus', 'label'], 'result':'c_tag', 't_eq':false}, {'pattern':['o_tag', 'c_tag'], 'result':'block', 't_eq':true}, {'pattern':['o_tag', 'block', 'c_tag'], 'result':'block', 't_eq':true}, {'pattern':['block', 'block'], 'result':'block', 't_eq':false}]
function* tokenize(s){
//tokenize an input string `s`
//@yield Token object
while (s.length){
for (var [p, t] of lexer){
var m = s.match(p)
if (m){
yield (new Token([m[0]], t))
s = s.substring(m[0].length)
break
}
}
}
}
Next, defining a function that takes in the tokenized string and runs an iterative shift-reduce on a stack
to build the AST from the bottom up:
function pattern_match(stack, pattern){
//takes in the stack from `shift_reduce` and attempts to match `pattern` from the back
if (pattern.length > stack.length){
return false
}
return Array.from(Array(pattern.length).keys()).every(x => stack[stack.length-1-x].type_eq(pattern[pattern.length - 1-x]))
}
function shift_reduce(tokens){
//consumes `tokens` until empty and returns the resulting tree if in a valid state
var stack = []
while (true){
//per your ment, the line below displays the contents of the stack at each iteration
console.log(stack.map(x => x.t_type))
if (!stack.length){
//stack is empty, push a token on to it
stack.push(tokens.shift())
}
var f = false;
for (var {pattern:p, result:r, t_eq:tq} of transitions){
//try to match patterns from `transitions`
if (pattern_match(stack, p)){
var new_vals = p.map(_ => stack.pop()).reverse();
if (!tq || Token.val_eq(new_vals[0], new_vals[new_vals.length-1])){
//match found
f = true
stack.push((new Token(new_vals, r)))
break
}
else{
while (new_vals.length){
stack.push(new_vals.shift())
}
}
}
}
if (!f){
if (!tokens.length){
//no more tokens to consume, return root of the token tree.
if (stack.length > 1){
//stack was not reduced to a single token, thus an invalid state
throw new Error('invalid state')
}
return stack[0]
}
//no match found, push another token from `tokens` onto the stack
stack.push(tokens.shift())
}
}
}
Lastly, a function to convert the AST to JSON:
function* to_json(tree){
if (tree.vals.every(x => x.t_type === 'block')){
for (var i of tree.vals){
yield* to_json(i)
}
}
else{
yield {'type':tree.vals[0].vals[1].vals[0], ...(tree.vals.length === 2 ? {} : {'children':[...to_json(tree.vals[1])]})}
}
}
Putting it all together:
function to_tree(s){
var tokens = [...tokenize(s)] //get the tokenized string
var tree = shift_reduce(tokens) //build the AST from the tokens
var json_tree = [...to_json(tree)] //convert AST to JSON
return json_tree
}
console.log(to_tree('+eee-eee'))
console.log(to_tree('+aaa+bbb-bbb-aaa'))
console.log(to_tree('+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb'))
Output:
[
{
"type": "eee"
}
]
[
{
"type": "aaa",
"children": [
{
"type": "bbb"
}
]
}
]
[
{
"type": "bbb",
"children": [
{
"type": "aaa",
"children": [
{
"type": "ccc",
"children": [
{
"type": "eee"
}
]
},
{
"type": "ccc"
},
{
"type": "ddd",
"children": [
{
"type": "ccc",
"children": [
{
"type": "eee"
}
]
}
]
}
]
}
]
}
]
I agree with @trincot's answer (except for his claim that it isn't a PDA).
I'm not sure about the plicated pattern, but the simple one you have is nearly trivial to build a machine for. It is a DCFG (deterministic context free grammar), i.e. it is the intersection of a regular expression and a Dyck (parenthesis matching) machine. All DCFGs encode a PDA. Thus, my disagreement where he says it isn't a PDA.
It is a regular expression, because your "tokens" (the parenthesis) are not a single character long, so you need to right a regular expression that turns those character sequences into single symbol tokens. +aaa -> one token e.g '(', -aaa -> another token ')', +bbb -> still another '[', ... Note the characters I picked for the tokens aren't arbitrary (although they could be) but instead to help you visualize this as parenthesis matching. Note how they pair up.
Your list of tokens will be finite (your strings, while unbounded are still finite). And, there will be two (or three, types of tokens). There will be left parens (brackets, etc), and right parens, and things that neither (i.e. don't need to match). On a left paren, you push something on the stack. On a right paren, you pop the stack. On a neither, you either ignore the stack or push and pop both--both models work.
The FSM that runs the machine needs one state for each pair. On a push you enter that state and that tells you what kind of token you need to see to pop it. If you see a different popping token, you have an error.
Now, as long as your token types are easily divided into these three types of tokens, the problem is trivial. If your tokens, don't, for example, if you are looking for palindromes without a mid-point token, (i.e. some token can be both a left paren and a right paren and you cannot tell from left context which it is), the problem bees non-deterministic and you will need to implement a GLR type parser that keeps a parse forest of alternatives that are still candidates (and if the input is ambiguous, end up with more than one possible tree).
However, I think if you are trying to parse ASTs, you won't have that issue. You really have a very simplified version of the SLR (most basic LR parsing algorithm) at the parens level. And the conversion of sequences to regular expressions is also likely to be trivial, because they will just be a set of fixed strings.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745318650a4622338.html
评论列表(0条)