grammar - ANTLR parsing string losing characters? -
i have following grammar
grammar lucene; /* * parser rules */ query : orexpr whitespace* newline? eof ; orexpr : expr ((ortoken | space)? expr)* /* or exp */ ; expr : lparen orexpr rparen /* grouping */ | expr andtoken expr /* , exp */ | expr nottoken expr /* not exp */ | required | prohibited | proximity | fuzzy | boosted | phrase | term ; proximity : phrase tilde int ; fuzzy : term tilde float? ; boosted : (term | phrase) accent (float | int) ; required : plustoken whitespace? term ; prohibited : minustoken whitespace? term ; term : (alphanum+ ( '*' | '?' )? alphanum*) ; phrase : '"' ( ~'\\"' | . )*? '"' ; /* * lexer rules */ alphanum : character | num ; character : 'a'..'z' | 'a'..'z' ; float : num* '.' num+ ; int : num+ ; num : '0'..'9' ; lparen : '(' ; rparen : ')' ; andtoken : ' , ' ; nottoken : ' not ' | ' !' ; ortoken : ' or ' ; plustoken : '+' ; minustoken : '-' ; tilde : '~' ; accent : '^' ; space : ' ' ; cr : '\r' | '\n' ; whitespace : ( space | '\t' ) -> skip ; newline : ('\r'?'\n'|'\r') -> skip;
the intention handle string literals phrase rule when string contains characters "." or ":" following error when checking against testrig (using java org.antlr.v4.gui.testrig lucene query -gui):
line 1:15 token recognition error at: '. ' line 1:28 token recognition error at: ':' [@0,0:0='"',<9>,1:0] [@1,1:1='p',<4>,1:1] [@2,2:2='r',<4>,1:2] [@3,3:3='o',<4>,1:3] [@4,4:4='v',<4>,1:4] [@5,5:5='i',<4>,1:5] [@6,6:6='d',<4>,1:6] [@7,7:7='e',<4>,1:7] [@8,8:8='d',<4>,1:8] [@9,9:9=' ',<19>,1:9] [@10,10:10='t',<4>,1:10] [@11,11:11='e',<4>,1:11] [@12,12:12='r',<4>,1:12] [@13,13:13='m',<4>,1:13] [@14,14:14='s',<4>,1:14] [@15,17:17='f',<4>,1:17] [@16,18:18='o',<4>,1:18] [@17,19:19='r',<4>,1:19] [@18,20:20=' ',<19>,1:20] [@19,21:21='e',<4>,1:21] [@20,22:22='x',<4>,1:22] [@21,23:23='a',<4>,1:23] [@22,24:24='m',<4>,1:24] [@23,25:25='p',<4>,1:25] [@24,26:26='l',<4>,1:26] [@25,27:27='e',<4>,1:27] [@26,29:29=' ',<19>,1:29] [@27,30:30='a',<4>,1:30] [@28,31:31='t',<4>,1:31] [@29,32:32='e',<4>,1:32] [@30,33:33='r',<4>,1:33] [@31,34:34='m',<4>,1:34] [@32,35:35='"',<9>,1:35] [@33,38:37='<eof>',<-1>,2:0]
any idea why getting errors mentioned?
it gets worse when there characters instead of space after dot character lost.
update 10/07/2015
fixed: fixed problem updating phrase rule (took of recomended changes @grosengberg not given grammar did not worked desired)
phrase : literal ; literal : '"' ( '\\"' | .)*? '"' ;
which gave desired result, updated grammar accept rest of rules, later changed initial rules solve operator precedence problem iam getting mutually left-recursion error. conflicting rules follow:
expr : orexpr | andexpr | prohibited | required | boosted | fuzzy | spannear | proximity | term | phrase | groupexpr ; orexpr : expr ((ws+ | ws+ or ws+) orexpr)+ | expr ; andexpr : expr (ws+ , ws+ andexpr)+ | expr (ws+ notexpr)+ | expr ; notexpr : not ws+ expr ;
any idea on how fix issue? have separated rules orexpr , andexpr because use them identify these rules on visitor writing.
this cleaned version should help, looks need fair bit more time documentation. getting tdar highly recommended.
grammar lucene; query : expr+ eof ; expr : lparen orexpr rparen /* grouping */ | expr andtoken expr /* , exp */ | expr nottoken expr /* not exp */ | expr ortoken expr /* or exp */ | required | prohibited | proximity | fuzzy | boosted | phrase | term ; proximity : phrase tilde int ; fuzzy : term tilde float? ; boosted : (term | phrase) accent (float | int) ; required : plustoken term ; prohibited : minustoken term ; term : alphanum+ ( star | qmark )? alphanum* ; alphanum : character | num ; phrase : string ; andtoken : ' , ' ; nottoken : ' not ' | ' !' ; ortoken : ' or ' ; float : num* '.' num+ ; int : num+ ; string : '"' .*? '"' ; whitespace : [ \t\r\n]+ -> skip; character : 'a'..'z' | 'a'..'z' ; num : '0'..'9' ; lparen : '(' ; rparen : ')' ; plustoken : '+' ; minustoken : '-' ; star : '*' ; qmark : '?' ; bang : '!' ; tilde : '~' ; accent : '^' ;
simplified expr
rule include or
variant - run left recursion problem separated.
if alphanum
left lexer rule (alphanum), parser ever see alphanum tokens -- parser never see discrete character , num tokens.
similarly, since skipping whitespace in lexer, parser never see tokens - cannot used in parser rules.
the rhs of string rule (string) has in lexer rule. testing, can add parser rule references string rule.
Comments
Post a Comment