regex - How to implement Lexical Analysis in Javascript - Stack Overflow

Hey folks, thanks for readingI am currently attempting to do a Google-style calculator. You input a str

Hey folks, thanks for reading

I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.

I began slowly with the basics : + - / * and parenthesis handling.

I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.

This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.

I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.


Analysis

Let's define what a calculator query is:

// NON TERMINAL EXPRESSIONS //
query     -> statement
query     -> ε // means end of query

statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number

number    -> integer
number    -> float

// TERMINAL EXPRESSIONS //
operator  -> [+*/%^-]

prefix    -> -

integer   -> [0-9]+

float     -> [0-9]+[.,][0-9]+

Javascript

Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:

(I added spaces to make it more readable)

var calcPat = 
/^ (\s*
    ( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;

If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part

I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:

I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:

-If the first two tokens are a number followed by an operator:

-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)

-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.


What's the problem ?

My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?

Thank you so much for having taken the time to read everything. Gael

(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

Hey folks, thanks for reading

I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.

I began slowly with the basics : + - / * and parenthesis handling.

I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.

This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.

I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.


Analysis

Let's define what a calculator query is:

// NON TERMINAL EXPRESSIONS //
query     -> statement
query     -> ε // means end of query

statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number

number    -> integer
number    -> float

// TERMINAL EXPRESSIONS //
operator  -> [+*/%^-]

prefix    -> -

integer   -> [0-9]+

float     -> [0-9]+[.,][0-9]+

Javascript

Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:

(I added spaces to make it more readable)

var calcPat = 
/^ (\s*
    ( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;

If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part

I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:

I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:

-If the first two tokens are a number followed by an operator:

-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)

-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.


What's the problem ?

My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?

Thank you so much for having taken the time to read everything. Gael

(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

Share Improve this question edited Feb 27, 2024 at 16:12 VLAZ 29.1k9 gold badges63 silver badges84 bronze badges asked Jan 18, 2011 at 16:38 Gabriel S.Gabriel S. 1,9432 gold badges21 silver badges30 bronze badges 5
  • 2 You might want to give this tool a try: pegjs.majda.cz/online – Bart Kiers Commented Jan 18, 2011 at 16:50
  • That's pretty cool, but (judging from what that's called) it produces PEG parsers (packrat parsers I guess), which are really a whole 'nuther thing. The classic "lexer + parser" approach is for building LL or LR parsers for context-free grammars (or almost context-free), and PEG describes a different class of language than CFGs. – Pointy Commented Jan 18, 2011 at 17:34
  • Oh, and of particular note is the fact that with a packrat parser for a PEG language, you don't need a lexical analyzer at all :-) – Pointy Commented Jan 18, 2011 at 17:36
  • @bart-kiers: great tool, thanks. I tried to build a parser myself a couple years ago and gave up (thankfully, it was just a hobby, not any job requirement). Maybe I'll have to pull that project back off the shelf. – keithjgrant Commented Jan 18, 2011 at 17:48
  • @bart-kiers: Thanks for the link, I managed to build a working parser out of it, but since I don't know how it works and other legal stuff, I can't just start making money on their back like that ! – Gabriel S. Commented Jan 18, 2011 at 18:11
Add a ment  | 

1 Answer 1

Reset to default 10

You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.

  • The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.

  • The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.

You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.

Now, because you've got Javascript regular expressions to work with, you could e up with a regular expression designed to match a string of tokens. The trick with that will be ing up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.

edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very plicated, and it'd be pretty fun to have around.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1743703993a4493053.html

相关推荐

  • regex - How to implement Lexical Analysis in Javascript - Stack Overflow

    Hey folks, thanks for readingI am currently attempting to do a Google-style calculator. You input a str

    16小时前
    20

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信