grammar - ANTLR parsing string losing characters? -


i have following grammar

grammar lucene;   /*    * parser rules    */   query           : orexpr whitespace* newline? eof                   ;   orexpr          : expr ((ortoken | space)? expr)*   /* or exp */                   ;   expr            : lparen orexpr rparen                  /* grouping */                   | expr andtoken expr                    /* , exp */                   | expr nottoken expr                    /* not exp */                   | required                   | prohibited                   | proximity                   | fuzzy                   | boosted                   | phrase                   | term                   ;   proximity       : phrase tilde int                   ;   fuzzy           : term tilde float?                   ;   boosted         : (term | phrase) accent (float | int)                   ;   required        : plustoken whitespace? term                   ;   prohibited      : minustoken whitespace? term                   ;   term            : (alphanum+ ( '*' | '?' )? alphanum*)                   ;   phrase          : '"' ( ~'\\"' | . )*? '"'                    ;   /*    * lexer rules    */   alphanum        : character                   | num                   ;   character       : 'a'..'z'                   | 'a'..'z'                   ;   float           : num* '.' num+                   ;   int             : num+                   ;   num             : '0'..'9'                   ;   lparen          : '('                   ;   rparen          : ')'                   ;   andtoken        : ' , '                   ;   nottoken        : ' not '                   | ' !'                   ;   ortoken         : ' or '                   ;   plustoken       : '+'                   ;   minustoken      : '-'                   ;   tilde           : '~'                   ;   accent          : '^'                   ;   space           : ' '                   ;   cr              : '\r'                   | '\n'                   ;   whitespace      : ( space | '\t' ) -> skip ;   newline         : ('\r'?'\n'|'\r') -> skip;   

the intention handle string literals phrase rule when string contains characters "." or ":" following error when checking against testrig (using java org.antlr.v4.gui.testrig lucene query -gui):

line 1:15 token recognition error at: '. '     line 1:28 token recognition error at: ':'     [@0,0:0='"',<9>,1:0]     [@1,1:1='p',<4>,1:1]     [@2,2:2='r',<4>,1:2]     [@3,3:3='o',<4>,1:3]     [@4,4:4='v',<4>,1:4]     [@5,5:5='i',<4>,1:5]     [@6,6:6='d',<4>,1:6]     [@7,7:7='e',<4>,1:7]     [@8,8:8='d',<4>,1:8]     [@9,9:9=' ',<19>,1:9]     [@10,10:10='t',<4>,1:10]     [@11,11:11='e',<4>,1:11]     [@12,12:12='r',<4>,1:12]     [@13,13:13='m',<4>,1:13]     [@14,14:14='s',<4>,1:14]     [@15,17:17='f',<4>,1:17]     [@16,18:18='o',<4>,1:18]     [@17,19:19='r',<4>,1:19]     [@18,20:20=' ',<19>,1:20]     [@19,21:21='e',<4>,1:21]     [@20,22:22='x',<4>,1:22]     [@21,23:23='a',<4>,1:23]     [@22,24:24='m',<4>,1:24]     [@23,25:25='p',<4>,1:25]     [@24,26:26='l',<4>,1:26]     [@25,27:27='e',<4>,1:27]     [@26,29:29=' ',<19>,1:29]     [@27,30:30='a',<4>,1:30]     [@28,31:31='t',<4>,1:31]     [@29,32:32='e',<4>,1:32]     [@30,33:33='r',<4>,1:33]     [@31,34:34='m',<4>,1:34]     [@32,35:35='"',<9>,1:35]     [@33,38:37='<eof>',<-1>,2:0] 

any idea why getting errors mentioned?

it gets worse when there characters instead of space after dot character lost.

update 10/07/2015

fixed: fixed problem updating phrase rule (took of recomended changes @grosengberg not given grammar did not worked desired)

phrase                   : literal                   ;    literal                   : '"' ( '\\"' | .)*? '"'                 ;   

which gave desired result, updated grammar accept rest of rules, later changed initial rules solve operator precedence problem iam getting mutually left-recursion error. conflicting rules follow:

expr            : orexpr                 | andexpr                 | prohibited                 | required                 | boosted                 | fuzzy                 | spannear                 | proximity                 | term                  | phrase                 | groupexpr                 ;  orexpr          : expr ((ws+ | ws+ or ws+) orexpr)+                 | expr                 ;  andexpr         : expr (ws+ , ws+ andexpr)+                 | expr (ws+ notexpr)+                 | expr                 ;  notexpr         : not ws+ expr                 ; 

any idea on how fix issue? have separated rules orexpr , andexpr because use them identify these rules on visitor writing.

this cleaned version should help, looks need fair bit more time documentation. getting tdar highly recommended.

grammar lucene;  query   : expr+ eof ; expr    : lparen orexpr rparen  /* grouping */         | expr andtoken expr    /* , exp */         | expr nottoken expr    /* not exp */         | expr ortoken expr     /* or exp */         | required         | prohibited         | proximity         | fuzzy         | boosted         | phrase         | term         ;  proximity   : phrase tilde int ; fuzzy       : term tilde float? ; boosted     : (term | phrase) accent (float | int) ; required    : plustoken term ; prohibited  : minustoken term ; term        : alphanum+ ( star | qmark )? alphanum* ; alphanum    : character | num ; phrase      : string ;  andtoken    : ' , ' ; nottoken    : ' not ' | ' !' ; ortoken     : ' or ' ; float       : num* '.' num+ ; int         : num+ ; string      : '"' .*? '"' ;  whitespace  : [ \t\r\n]+ -> skip;  character   : 'a'..'z' | 'a'..'z' ; num         : '0'..'9' ; lparen      : '(' ; rparen      : ')' ; plustoken   : '+' ; minustoken  : '-' ; star        : '*' ; qmark       : '?' ; bang        : '!' ; tilde       : '~' ; accent      : '^' ; 

simplified expr rule include or variant - run left recursion problem separated.

if alphanum left lexer rule (alphanum), parser ever see alphanum tokens -- parser never see discrete character , num tokens.

similarly, since skipping whitespace in lexer, parser never see tokens - cannot used in parser rules.

the rhs of string rule (string) has in lexer rule. testing, can add parser rule references string rule.


Comments

Popular posts from this blog

1111. appearing after print sequence - php -

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

Ruby on Rails, ActiveRecord, Postgres, UTF-8 and ASCII-8BIT encodings -