diff --git a/CHANGELOG.txt b/CHANGELOG.txt new file mode 100644 index 000000000..ce36b1fa8 --- /dev/null +++ b/CHANGELOG.txt @@ -0,0 +1,3 @@ +MontySolr 0.1, 2011-05-19 +------------------------- +- Initial release \ No newline at end of file diff --git a/COPYRIGHT.txt b/COPYRIGHT.txt new file mode 100644 index 000000000..a46d0c12b --- /dev/null +++ b/COPYRIGHT.txt @@ -0,0 +1,21 @@ +All MontySolr code 2011 is copyright by the original authors. + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 2 of the License, or (at +your option) any later version. + +This program is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +for more details. + +You should have received a copy of the GNU General Public License +along with this program as the file LICENSE.txt; if not, please see +http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt. + +Drupal includes works under other copyright notices and distributed +according to the terms of the GNU General Public License or a compatible +license, including: + + jzlib - Copyright (c) 2000-2003 ymnk, JCraft,Inc. \ No newline at end of file diff --git a/INSTALL.txt b/INSTALL.txt new file mode 100644 index 000000000..e5a1ed8cb --- /dev/null +++ b/INSTALL.txt @@ -0,0 +1,210 @@ + + +For the impatient: + + $ cd montysolr + $ cp build.properties.default build.properties + $ vim build.properties # review the config and set the correct paths + $ ant automatic-install + + +HOWEVER: this will work only if you already have pre-requisities installed OR if we +built them for you. In any case, give it a try! + +The automatic installation will: + + * check correct version of Python + * check for JCC module (if available in our repository, we will offer to download and install it for you) + * check pylucene package (again, we'll download and install it if available) + * build the wrapper for Solr (as a Python module) + * build MontySolr (as a Python module) + * run some basic tests + +== STANDARD INSTALLATION == + +To build MontySolr you need: + + * Java JDK >= 1.6 (tested and worked with both OpenJDK and Sun Java JDK) + * ant >= 1.6 + * Python >= 2.4 (if you want to take advantage of multiprocessing, then at least Python 2.5) + * JCC module for Python + * PyLucene + * setuptools for Python (needed for installation of pylucene) + + +The recommended installation is this: + + * Install JCC + - http://lucene.apache.org/pylucene/documentation/install.html + - NOTE: you must build JCC in a shared mode (default now) + + {{{ + export JCC_LFLAGS='-framework:JavaVM:-framework:Python' + python setup.py build + python setup.py install + }}} + + + $ cp build.properties.default build.properties + $ vim build.properties # edit and review the configuration + $ ant assemble-example # assemble demo example + $ ant montysolr-build # compile montysolr + $ ant run-montysolr # run the demo + + + + + + + + + + + + * Follow the installation instructions for JCC at http://pypi.python.org/pypi/JCC/ + +1. create a copy of the build.properties + + + +=== JCC NOTES === + +http://lucene.apache.org/pylucene/jcc/index.html + +JCC is a code generator by Andi Vajda. It is used to wrap Java into a tiny layer of C++. Thanks to the work +of JCC we can build Python modules from the Java code. Thanks to JCC we can use Java inside Python, and also +Python inside Java! + +JCC must be built in a shared mode (default now). To check it, you can do: + + $ python -c "from jcc import config; print 'version=', config.VERSION, ', shared=', config.SHARED" + +If shared is not 'True', then you have to rebuilt JCC + + $ cd /some/dir/with/jcc + $ export USE_DISTUTILS + $ python setup.py build + $ python setup.py install + + -- if you are on Mac OS X, the sequence is: -- + + $ cd /some/dir/with/jcc + $ export JCC_LFLAGS='-framework:JavaVM:-framework:Python' + $ export USE_DISTUTILS + $ python setup.py build + $ python setup.py install + +If your system does not have the correct setup, JCC will warn you and will also provide the instructions on +how to fix it. + +-- Note for Mac OS X users -- + + +=== PYLUCENE === + +http://lucene.apache.org/pylucene + +JCC builds PyLucene project which is simply a normal Lucene wrapped into a Python package. + +Because we don't want to duplicate code, MontySolr takes advantage of pylucene built as a separate module. +For this to work Pylucene, solr and also montysolr packages must be built in a shared mode. +Shared mode is default on many recent systems now, but it is good to check that. + + TODO: build a small program that checks shared mode + +In MontySolr we use generics support for Java (this makes your life as Java programmers much easier +in Python). Unfortunately, generics is not yet a default option. So if you already built a pylucene, +you will have to rebuild it again (in my experience, inclusion of generics does not have a negative +impact -- besides you having to work with newer versions of Java, but you do that already, right?) + +So the build of Lucene is: + + $ set JCCFLAGS= + $ export JCCFLAGS + $ make + $ make install + + + +== FAQ == + +Q: What version of Solr shall I user? + +You can use both Solr 1.4 and 3.x but make sure that also your PyLucene is the same version, *including minor versions*! +Ie. if your solr is using 2.9.3, then also your pylucene must have 2.9.3. + + + +Q: Do I need a separate distrubution of the lucene sources or is what's in the solr distribution enough? + +It is enough. But PyLucene has the lucene jars inside, so inevitably you will end up with two sets +of lucene jars. This is not a problem though. Just make sure that your PyLucene is using the same +version as your Solr instance! + + + + +Q: When I start montysolr, I see errors "ImportError: No module named solr_java" ... or "lucene", "foo", "bar" etc. + +The message comes from the Python interpreter that cannot find some module. If you installed lucene or +any other into non-standard location, then you have to make that location known to Python. + + * use PYTHONPATH + - e.g. "export PYTHONPATH=/some/path:/some/other/path" + * if you start montysolr with ant ("ant run-montysolr") + - edit build.properties + - python_path=/some/path:/some/other/path + +If the missing module is "solr_java" then you did not finish installation properly, you can fix it by: + + $ ant solr-build + +You shall find the solr_java module inside "./montysolr/build/dist" -- from there it can be easily installed. + + $ cd ./montysolr/build/dist + $ easy_install solr_java-0.1-py2.6-macosx-10.6-universal.egg + + + + +Q: When building the montysolr, I get these errors: + + build/_solr_java/__wrap__.cpp: In function ‘PyObject* org::apache::solr::search::function::t_DualFloatFunction_createWeight(org::apache::solr::search::function::t_DualFloatFunction*, PyObject*)’: + build/_solr_java/__wrap__.cpp:2325: error: ‘parameters_’ is not a member of ‘java::util::t_Map’ + +Your PyLucene is not built with the generics support. Please see above how to fix it. + + + +Q: I am using python virtualenv, will it cause problems? + +Absolutely no. + +Q: Are there any limitations of what I can run inside the Python? + +No, if it works in the normal Python session, it will work also inside Solr (provided that you set up correct paths, +have enough memory to host both systems etc.) + + +Q: I have several versions of Python installed, how to run MontySolr with the non-standard one? + +In the past, I was doing something like this: + +I was doing this: + +# first set up environment vars that make system use different Python + +$ export LD_LIBRARY_PATH=/opt/rchyla/python26/lib/:/opt/rchyla/python26/lib/python2.6/lib-dynload/ +$ export PYTHONHOME=/opt/rchyla/python26/ +$ export PYTHONPATH=/opt/rchyla/workspace/:/opt/rchyla/workspace/solrpie/python/ +$ export PATH=/opt/rchyla/python26/bin:/opt/rchyla/python26:/afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/bin/:$PATH + +# then use the ant to run MontySolr as a daemon + +$ export SOLRPIE_ARGS='--port 8443 --daemon' +$ export SOLRPIE_JVMARGS='-d64 -Xmx2048m -Dsolrpie.max_workers=5 -Dsolrpie.max_threads=200' +$ export SOLRPIE_MAX_WORKERS=5 +$ export SOLRPIE_NEWENVIRONMENT=false +$ ant run-solrpie-daemon + + diff --git a/LICENSE.txt b/LICENSE.txt new file mode 100644 index 000000000..5b6e7c66c --- /dev/null +++ b/LICENSE.txt @@ -0,0 +1,340 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. diff --git a/README b/README new file mode 100644 index 000000000..ff9b64155 --- /dev/null +++ b/README @@ -0,0 +1,59 @@ +CONTENTS OF THIS FILE +--------------------- + +* About MontySolr +* Configuration and features +* Developing for MontySolr + +ABOUT MONTYSOLR +------------ + +MontySolr is an open source extension that makes it possible to include Python +code inside Solr (http://lucene.apache.org/solr). You can call Python routines +from the Java side, as well as control (most of the) Solr operations from the +Python side. + + +CONFIGURATION AND FEATURES +-------------------------- + +MontySolr (what you get when you download and extract montysolr-x.y.tgz) is only +an extension for Solr. You will need a separate Solr instance as well as a few +dependencies to use MontySolr. + + +More about configuration: + * Install, upgrade, and maintaince: + See INSTALL.txt in the same directory as this document. + * Learn about how to extend MontySolr: + See docs/technical-details.txt + * See also: https://svnweb.cern.ch/trac/rcarepo/wiki/MontySolr + + +DEVELOPING FOR MONTYSOLR +------------------------ + +MontySolr contains very simple API and the layer between Solr and Python is +intentionally kept minimal. In most cases you simply want to use MontySolr just +as a communication layer between Solr and your own Python-written system. In this +case you don't need to make any changes inside MontySolr, but you will write simple +Python code that controls the business logic between Solr and Python. + +More about writing wrappers to call your Python system(s): + * Hello world example + See docs/hello-world.txt + * To understand details of the wrappers + See docs/how-to-wrap.txt + + + +If you need new functionality that is not present in MontySolr, search for +existing solutions or discussion on the mailing list: + + * Invenio Development Team + + + + * For more information about developing + See docs/development.txt + diff --git a/build.properties.default b/build.properties.default new file mode 100644 index 000000000..99d0cc46a --- /dev/null +++ b/build.properties.default @@ -0,0 +1,41 @@ + +# These are the main variables you may need to change for successful build +# of montysolr modules. The more detailed settings can be changed in the configuration +# section of the build.xml file. Be careful to remove trailing whitespaces. + +# based on the installed version of JCC, you have to select the +# correct invocation + + +jcc=jcc.__main__ + +# folder where the solr lives +solr_home=/x/dev/workspace/apache-solr-1.4.1 + +# your custom solr configuration +webdist=./example + +# python executable +python=python + +# PYTHONPATH : the tasks with the name of run-* (eg. run-solr) +# are starting JVM and the JVM has a Python inside. This python +# interpreter needs to know where to look for python packages +# By default, run-* tasks add ./python and ./build/dist folders +# to the classpath, but here you can add or change the, this +# PYTHONPATH will be prepended + +python_path=/opt/invenio/lib/python + +# Needed only if you want to regenerate InvenioQueryParser syntax, +# which is probably not what you need to do + +javacc.home=/some/path/javacc-5.0 diff --git a/build.xml b/build.xml new file mode 100644 index 000000000..d394f2481 --- /dev/null +++ b/build.xml @@ -0,0 +1,728 @@ + + + Java extensions for Invenio - Java search engine made python-friendly + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + The properties of the project are not set correctly. Copy "build.properties.default" -> "build.properties" and edit the new file if necessary. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ${toString:montysolr.classpath} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Running montysolr as: +======== +java -cp '${jcc_egg}/jcc/classes${path.separator}${classes.dir}${path.separator}${toString:montysolr.classpath}' + -Dsolr.solr.home=${webdist.home}/solr -Dsolr.data.dir=${webdist.home}/solr/data + -Djava.library.path=${jcc_egg} + ${env.MONTYSOLR_JVMARGS} + --webroot ${webdist.webapp} + --context /solr + ${env.MONTYSOLR_ARGS} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + One or more of the JavaCC .jj files is newer than its corresponding + .java file. Run the "javacc" target to regenerate the artifacts. + + + + + + ################################################################## + JavaCC not found. + JavaCC Home: ${javacc.home} + JavaCC JAR: ${javacc.jar} + + Please download and install JavaCC from: + + <http://javacc.dev.java.net> + + Then, create a build.properties file either in your home + directory, or within the Lucene directory and set the javacc.home + property to the path where JavaCC is installed. For example, + if you installed JavaCC in /usr/local/java/javacc-3.2, then set the + javacc.home property to: + + javacc.home=/usr/local/java/javacc-3.2 + + If you get an error like the one below, then you have not installed + things correctly. Please check all your paths and try again. + + java.lang.NoClassDefFoundError: org.javacc.parser.Main + ################################################################## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + The MontySolr example was assembled from the original Solr example + (${solr.home}/example) + See ${montysolr.home}/examples/README.txt for instructions on how to + run the demos. + + + + + Downloading solr ${solr.version} from ${solr.url} + + + + + Building the Solr example + + + + + + + + + + + + + + diff --git a/common-build.xml b/common-build.xml new file mode 100644 index 000000000..8c35de8c6 --- /dev/null +++ b/common-build.xml @@ -0,0 +1,42 @@ + + + + + + This file is designed for importing into a main build file, and not intended + for standalone use. + + + + + + + + + @{dest} + + + + + + + + + diff --git a/docs/development.txt b/docs/development.txt new file mode 100644 index 000000000..e69de29bb diff --git a/docs/hello-world.txt b/docs/hello-world.txt new file mode 100644 index 000000000..e69de29bb diff --git a/docs/how-to-wrap.txt b/docs/how-to-wrap.txt new file mode 100644 index 000000000..e69de29bb diff --git a/docs/technical-details.txt b/docs/technical-details.txt new file mode 100644 index 000000000..09b53394b --- /dev/null +++ b/docs/technical-details.txt @@ -0,0 +1,157 @@ += Technical Details of the Solr-Python-Invenio integration = + +== Embedding Solr == + +There are three ways that I found/considered to embed Solr: + + * org.apache.solr.client.solrj.embedded.EmbeddedSolrServer + - this is the default way, Solr is running as an embedded process, not inside a servlet container + - the parent process is responsible for querying Solr using the Solr API + - this is implemented in the rca.solr.GetRecIds + * org.apache.solr.servlet.DirectSolrConnect + - this is like the above, but easier -- all the queries are sent as strings, everything is just a string. + This solution is very flexible and probably suitable for quick integration + * embed servlet container (in this case Jetty) + - this is the craziest, but potentially the most powerful solution (so I went for it) + - we encapsulate the servlet container and let it run Solr as normally, everything is just as it was in the normal server appliance + - test implemented in the rca.solr.JettyRunner + * embed Python VM in Java VM + - well, this is the best (not sure if crazy) solution which Andi showed me after I asked + - it embeds PythonVM inside Java and therefore it is more straightforward, possible cleaner than communication from python with jetty + Because there i would have to devise ways how to initiate calls to python from Java + + + + +== Servlet Jetty Embedding == + + +OK, so I had to discover those things (and who knows what else will have to be discovered): + + - classloaders and servlets + - normally there are 3 class loaders in java (system, application and one another which i forgot) + - class loaders ask their parents before loading the new classes + - HOWEVER, for servlets this is not true; servlet classloader can ignore (and actually should) ignore + the already loaded class (of its parent class loaders) -- well, jetty creates a new classloader + that does not have a parent classloader + - THEREFORE - my singletons were not absolute singletons + - There are two ways to solve this: + - tell Jetty to behave (like normal java) (i use this) + - register parent classloader + + - extra classes + - it is important, that the montysolr singleton classes ARE NOT present in the webapp/WEB-INF + - anything there might be loaded by the classloader separately, even if we specified classpath + or configured jetty's classloader + - JUST KEEP montysolr classes OUT OF webapp! + + +=== Good Practice === + - keep them separated, what belongs to Solr should live inside solr + - what belongs to Jetty, lives inside webapp + - our code is starting Jetty, Jetty reads Solr, Solr will load our classes, BUT we don't + include things neither in solr nor in webapps folders! + + - HOWEVER, certain special features must be activated in the solr configuration - so WE EDIT + solrconfig.xml + + + +=== TODO === + - ~~run JettyRunner in Python~~ + - organize imports (look at solr distro -- jars from solr/dist should be included) + - fix build.xml + - can we wrap jetty start.jar? -- to do the hard job of setting classpath, but also make sure we can load our singleton? + + +== Embed PythonVM == + + - JCC must be built in a shared mode (default) + - however, on Mac, LCFLAGS must also contain 'framework Python', otherwise you get error when trying to start PythonVM + {{{ + System.loadLibrary("jcc"); + Exception in thread "main" java.lang.UnsatisfiedLinkError: + /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib: + Symbol not found: _PyExc_RuntimeError + +Andi's explanation: +That's because Python's shared library wasn't found. The reason is that, by default, Python's shared lib not on JCC's link line because normally JCC is loaded into a Python process and the dynamic linker thus finds the symbols needed inside the process. + +Here, since you're not starting inside a Python process, you need to add '-framework Python' to JCC's LFLAGS in setup.py so that the dynamic linker can find the Python VM shared lib and load it. + + }}} + - so change the JCC setup.py, or add LFLAGS and rebuild + {{{ + export JCC_LFLAGS='-framework:JavaVM:-framework:Python' + python setup.py build + python setup.py install + }}} + + - the extension that i build with JCC is also runnable from inside Python, therefore I can built one extension, and run + java from python, or python from java + - to do that, I have to be careful to compile wrapper only for chosen classes + - do not use --jar montysolr.jar for compilation + - or make sure, that some classes (namely org.apache.jcc.PythonVM) are not called from my public classes + - because during build, jcc compiler will try to load the extension which we are just compiling + - JCC does not like singletons (neither on Python side, nor on Java side - it was hanging at vm.acquireThreadState()) + - it is possible to use Interfaces, however the wrapped java class must have the basic methods in itself (not inherited; i tried that, + even with protected fields, it was throwing error unsatisfied link error) + - be careful to exclude *from solr* classes that you build for your own extension (jcc will ignore them if they were already built, then + i was getting misterious unsatisfied link error because they were included from solr but i didn't know that) + + - by default, lucene is build with --no-generics (and also i was building other modules without generics, however that makes + it difficult for writing things in Python and pass to Java. Therefore it is necessary to have modules built with generics. + MontySolr builds with generics by default, but if lucene was not build like that, we cannot probably shared them. + + So to rectify that, build lucene with generics (empty JCCFLAGS switches that off) - worked fine on Mac 10.6 w/ Python 2.6 + and WindowsXP w/ Python2.5 + {{{ + export JCCFLAGS= + make + make install + }}} + + +== Programming for Java == + + 1. Watch out for invisible differences between Python and Java wrapped objects + dictname = String('dictname') + dictname.__hash__() + 1926449949 + s = 'dictname' + s.__hash__() + 1024421145 + + Because Python entries in the dictionary are recognized by __hash__, this will + not find anything + d = {'dictname': 1} + d[dictname] + + However, this will work + d[str(dictname)] + + 2. Printing big objects will crash the VM (if there is not enough memory) + + 3. If running invenio, and using MySQLdb extension, or basically any other + extensions - the java must be in the compatible mode - which is usually 32bit + java -d32 .... + + 4. Invalid access memory error.... + + This error is caused by two things: either you try to access java native method that + does not exist. JCC then helps you to find out by dying immediately. + + The other problem is actually more tricky: it is the intbitset module of Invenio. + This module is written in C and in case of error, it prints no diagnostic messages. + And the error inside the C extension brings down the whole JavaVM. To find out this + error, run unittests for python. + + + +== TO PUT SOMEWHERE == + + start JettyRunner with -Dsolr.data.dir=/some/path/to/solr/data + - otherwise solr creates an empty index in ./solr/data + - this is configurable in solr/conf/solrconfig.xml + + \ No newline at end of file diff --git a/examples/README.txt b/examples/README.txt new file mode 100644 index 000000000..46306e30f --- /dev/null +++ b/examples/README.txt @@ -0,0 +1,6 @@ +This folder holds the configuration files of the various Solr examples/demos. +The examples contain only the changed files, please run the following commands +to assemble the full structure: + + $ cd montysolr + $ ant examples-assemble \ No newline at end of file diff --git a/examples/invenio/etc/jetty.xml b/examples/invenio/etc/jetty.xml new file mode 100755 index 000000000..ebf3eebde --- /dev/null +++ b/examples/invenio/etc/jetty.xml @@ -0,0 +1,212 @@ + + + + + + + + + + + + + + + + + org.mortbay.jetty.Request.maxFormContentSize + 1000000 + + + + + + + + + 10 + 50 + 60 + + + + + + + + + + + + + + + + + + + + + + 50000 + 1500 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + /contexts + 1 + + + + + + + + + + + + + + + + + + + + + + /webapps + false + true + false + /etc/webdefault.xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + /yyyy_mm_dd.request.log + 90 + true + false + GMT + + + + + + + + true + + true + + + diff --git a/examples/invenio/etc/logging.properties b/examples/invenio/etc/logging.properties new file mode 100644 index 000000000..6a54f49d7 --- /dev/null +++ b/examples/invenio/etc/logging.properties @@ -0,0 +1,12 @@ +# Default global logging level: +.level= SEVERE + +# Write to a file: +handlers= java.util.logging.FileHandler + +# Write log messages in XML format: +#java.util.logging.FileHandler.formatter = java.util.logging.XMLFormatter +java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter + +# Log to the current working directory, with log files named solrxxx.log +java.util.logging.FileHandler.pattern = /tmp/solr-test-%u.log \ No newline at end of file diff --git a/examples/invenio/etc/webdefault.xml b/examples/invenio/etc/webdefault.xml new file mode 100644 index 000000000..66bbdd5f8 --- /dev/null +++ b/examples/invenio/etc/webdefault.xml @@ -0,0 +1,379 @@ + + + + + + + + + + + + + + + + + + + + + + + Default web.xml file. + This file is applied to a Web application before it's own WEB_INF/web.xml file + + + + + + + + + + org.mortbay.jetty.webapp.NoTLDJarPattern + start.jar|ant-.*\.jar|dojo-.*\.jar|jetty-.*\.jar|jsp-api-.*\.jar|junit-.*\.jar|servlet-api-.*\.jar|dnsns\.jar|rt\.jar|jsse\.jar|tools\.jar|sunpkcs11\.jar|sunjce_provider\.jar|xerces.*\.jar| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + default + org.mortbay.jetty.servlet.DefaultServlet + + acceptRanges + true + + + dirAllowed + true + + + redirectWelcome + false + + + maxCacheSize + 2000000 + + + maxCachedFileSize + 254000 + + + maxCachedFiles + 1000 + + + gzip + false + + + useFileMappedBuffer + false + + + 0 + + + default / + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + jsp + org.apache.jasper.servlet.JspServlet + + logVerbosityLevel + WARNING + + + fork + false + + + xpoweredBy + false + + + 0 + + + + jsp + *.jsp + *.jspf + *.jspx + *.xsp + *.JSP + *.JSPF + *.JSPX + *.XSP + + + + + + + + + + + + + + + + + + + + + + + + invoker + org.mortbay.jetty.servlet.Invoker + + verbose + true + + + nonContextServlets + true + + + dynamicParam + anyValue + + 1 + + + invoker /servlet/* + + + + + + + 30 + + + + + + + + + + + + + index.html + index.htm + index.jsp + + + + + arISO-8859-6 + beISO-8859-5 + bgISO-8859-5 + caISO-8859-1 + csISO-8859-2 + daISO-8859-1 + deISO-8859-1 + elISO-8859-7 + enISO-8859-1 + esISO-8859-1 + etISO-8859-1 + fiISO-8859-1 + frISO-8859-1 + hrISO-8859-2 + huISO-8859-2 + isISO-8859-1 + itISO-8859-1 + iwISO-8859-8 + jaShift_JIS + koEUC-KR + ltISO-8859-2 + lvISO-8859-2 + mkISO-8859-5 + nlISO-8859-1 + noISO-8859-1 + plISO-8859-2 + ptISO-8859-1 + roISO-8859-2 + ruISO-8859-5 + shISO-8859-5 + skISO-8859-2 + slISO-8859-2 + sqISO-8859-2 + srISO-8859-5 + svISO-8859-1 + trISO-8859-9 + ukISO-8859-5 + zhGB2312 + zh_TWBig5 + + + + + + diff --git a/examples/invenio/solr/conf/admin-extra.html b/examples/invenio/solr/conf/admin-extra.html new file mode 100755 index 000000000..aa739da86 --- /dev/null +++ b/examples/invenio/solr/conf/admin-extra.html @@ -0,0 +1,31 @@ + + + diff --git a/examples/invenio/solr/conf/data-config-test-java.xml b/examples/invenio/solr/conf/data-config-test-java.xml new file mode 100644 index 000000000..3dec8773e --- /dev/null +++ b/examples/invenio/solr/conf/data-config-test-java.xml @@ -0,0 +1,53 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/examples/invenio/solr/conf/data-config.xml b/examples/invenio/solr/conf/data-config.xml new file mode 100644 index 000000000..3a0080d5a --- /dev/null +++ b/examples/invenio/solr/conf/data-config.xml @@ -0,0 +1,91 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/examples/invenio/solr/conf/elevate.xml b/examples/invenio/solr/conf/elevate.xml new file mode 100755 index 000000000..9b4caec69 --- /dev/null +++ b/examples/invenio/solr/conf/elevate.xml @@ -0,0 +1,36 @@ + + + + + + + + + + + + + + + + + + diff --git a/examples/invenio/solr/conf/mapping-ISOLatin1Accent.txt b/examples/invenio/solr/conf/mapping-ISOLatin1Accent.txt new file mode 100755 index 000000000..ede774258 --- /dev/null +++ b/examples/invenio/solr/conf/mapping-ISOLatin1Accent.txt @@ -0,0 +1,246 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Syntax: +# "source" => "target" +# "source".length() > 0 (source cannot be empty.) +# "target".length() >= 0 (target can be empty.) + +# example: +# "À" => "A" +# "\u00C0" => "A" +# "\u00C0" => "\u0041" +# "ß" => "ss" +# "\t" => " " +# "\n" => "" + +# À => A +"\u00C0" => "A" + +# Á => A +"\u00C1" => "A" + +#  => A +"\u00C2" => "A" + +# à => A +"\u00C3" => "A" + +# Ä => A +"\u00C4" => "A" + +# Å => A +"\u00C5" => "A" + +# Æ => AE +"\u00C6" => "AE" + +# Ç => C +"\u00C7" => "C" + +# È => E +"\u00C8" => "E" + +# É => E +"\u00C9" => "E" + +# Ê => E +"\u00CA" => "E" + +# Ë => E +"\u00CB" => "E" + +# Ì => I +"\u00CC" => "I" + +# Í => I +"\u00CD" => "I" + +# Î => I +"\u00CE" => "I" + +# Ï => I +"\u00CF" => "I" + +# IJ => IJ +"\u0132" => "IJ" + +# Ð => D +"\u00D0" => "D" + +# Ñ => N +"\u00D1" => "N" + +# Ò => O +"\u00D2" => "O" + +# Ó => O +"\u00D3" => "O" + +# Ô => O +"\u00D4" => "O" + +# Õ => O +"\u00D5" => "O" + +# Ö => O +"\u00D6" => "O" + +# Ø => O +"\u00D8" => "O" + +# Œ => OE +"\u0152" => "OE" + +# Þ +"\u00DE" => "TH" + +# Ù => U +"\u00D9" => "U" + +# Ú => U +"\u00DA" => "U" + +# Û => U +"\u00DB" => "U" + +# Ü => U +"\u00DC" => "U" + +# Ý => Y +"\u00DD" => "Y" + +# Ÿ => Y +"\u0178" => "Y" + +# à => a +"\u00E0" => "a" + +# á => a +"\u00E1" => "a" + +# â => a +"\u00E2" => "a" + +# ã => a +"\u00E3" => "a" + +# ä => a +"\u00E4" => "a" + +# å => a +"\u00E5" => "a" + +# æ => ae +"\u00E6" => "ae" + +# ç => c +"\u00E7" => "c" + +# è => e +"\u00E8" => "e" + +# é => e +"\u00E9" => "e" + +# ê => e +"\u00EA" => "e" + +# ë => e +"\u00EB" => "e" + +# ì => i +"\u00EC" => "i" + +# í => i +"\u00ED" => "i" + +# î => i +"\u00EE" => "i" + +# ï => i +"\u00EF" => "i" + +# ij => ij +"\u0133" => "ij" + +# ð => d +"\u00F0" => "d" + +# ñ => n +"\u00F1" => "n" + +# ò => o +"\u00F2" => "o" + +# ó => o +"\u00F3" => "o" + +# ô => o +"\u00F4" => "o" + +# õ => o +"\u00F5" => "o" + +# ö => o +"\u00F6" => "o" + +# ø => o +"\u00F8" => "o" + +# œ => oe +"\u0153" => "oe" + +# ß => ss +"\u00DF" => "ss" + +# þ => th +"\u00FE" => "th" + +# ù => u +"\u00F9" => "u" + +# ú => u +"\u00FA" => "u" + +# û => u +"\u00FB" => "u" + +# ü => u +"\u00FC" => "u" + +# ý => y +"\u00FD" => "y" + +# ÿ => y +"\u00FF" => "y" + +# ff => ff +"\uFB00" => "ff" + +# fi => fi +"\uFB01" => "fi" + +# fl => fl +"\uFB02" => "fl" + +# ffi => ffi +"\uFB03" => "ffi" + +# ffl => ffl +"\uFB04" => "ffl" + +# ſt => ft +"\uFB05" => "ft" + +# st => st +"\uFB06" => "st" diff --git a/examples/invenio/solr/conf/protwords.txt b/examples/invenio/solr/conf/protwords.txt new file mode 100755 index 000000000..1dfc0abec --- /dev/null +++ b/examples/invenio/solr/conf/protwords.txt @@ -0,0 +1,21 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +# Use a protected word file to protect against the stemmer reducing two +# unrelated words to the same base word. + +# Some non-words that normally won't be encountered, +# just to test that they won't be stemmed. +dontstems +zwhacky + diff --git a/examples/invenio/solr/conf/schema.xml b/examples/invenio/solr/conf/schema.xml new file mode 100755 index 000000000..3b63345be --- /dev/null +++ b/examples/invenio/solr/conf/schema.xml @@ -0,0 +1,666 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + id + + + all + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/examples/invenio/solr/conf/scripts.conf b/examples/invenio/solr/conf/scripts.conf new file mode 100755 index 000000000..f58b262ae --- /dev/null +++ b/examples/invenio/solr/conf/scripts.conf @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +user= +solr_hostname=localhost +solr_port=8983 +rsyncd_port=18983 +data_dir= +webapp_name=solr +master_host= +master_data_dir= +master_status_dir= diff --git a/examples/invenio/solr/conf/solrconfig.xml b/examples/invenio/solr/conf/solrconfig.xml new file mode 100755 index 000000000..66ebbbdf2 --- /dev/null +++ b/examples/invenio/solr/conf/solrconfig.xml @@ -0,0 +1,1106 @@ + + + + + + ${solr.abortOnConfigurationError:true} + + + + + + + + + + + + + + + + ${solr.data.dir:./solr/data} + + + + + + false + + 10 + + + + + 32 + + 10000 + 1000 + 10000 + + + + + + + + + + + + + native + + + + + + + false + 32 + 10 + + + + + + + + false + + + true + + + + + + + + 1 + + 0 + + + + + false + + + + + + + + + + + + + + + + + + + + + + + + + + + 1024 + + + + + + + + + + + + + + + + true + + + + + + + + 20 + + + 200 + + + + + + + + + + + + + solr rocks010 + static firstSearcher warming query from solrconfig.xml + {!iq iq.mode=maxinv}refersto:recid:100010invenio + {!iq iq.mode=maxinv}citedby:recid:100010invenio + + + + + false + + + 2 + + + + + + + + + + + + + + + + + + + + + + + explicit + + + + + + + + + + + query + invenio-formatter + facet + mlt + highlight + stats + debug + + + + iq + explicit + invenio + maxinv + fulltext + AND + + + + + + + + + + + + + dismax + explicit + 0.01 + + text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 + + + text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 + + + popularity^0.5 recip(price,1,1000,1000)^0.3 + + + id,name,price,score + + + 2<-1 5<-2 6<90% + + 100 + *:* + + text features name + + 0 + + name + regex + + + + + + + dismax + explicit + text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 + 2<-1 5<-2 6<90% + + incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2 + + + + inStock:true + + + + cat + manu_exact + price:[* TO 500] + price:[500 TO *] + + + + + + + + + + textSpell + + + default + name + ./spellchecker + + + + + + + + + + + + + + + + false + + false + + 1 + + + spellcheck + + + + + + + + true + + + tvComponent + + + + + + + + + default + + org.carrot2.clustering.lingo.LingoClusteringAlgorithm + + 20 + + + stc + org.carrot2.clustering.stc.STCClusteringAlgorithm + + + + + true + default + true + + name + id + + features + + true + + + + false + + + clusteringComponent + + + + + + + + text + true + ignored_ + + + true + links + ignored_ + + + + + + + + + + true + + + termsComponent + + + + + + + + string + elevate.xml + + + + + + explicit + + + elevator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + standard + solrpingquery + all + + + + + + + explicit + true + + + + + + + + + 100 + + + + + + + + 70 + + 0.5 + + [-\w ,/\n\"']{20,200} + + + + + + + ]]> + ]]> + + + + + + + + + + + + + + 5 + + + + + + + + + + + + + solr + + + + + + + + + data-config.xml + false + false + + + + + data-config.xml + false + false + + + + + data-config-test-java.xml + false + false + + + + + + diff --git a/examples/invenio/solr/conf/spellings.txt b/examples/invenio/solr/conf/spellings.txt new file mode 100755 index 000000000..d7ede6f56 --- /dev/null +++ b/examples/invenio/solr/conf/spellings.txt @@ -0,0 +1,2 @@ +pizza +history \ No newline at end of file diff --git a/examples/invenio/solr/conf/stopwords.txt b/examples/invenio/solr/conf/stopwords.txt new file mode 100755 index 000000000..b5824da32 --- /dev/null +++ b/examples/invenio/solr/conf/stopwords.txt @@ -0,0 +1,58 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +# a couple of test stopwords to test that the words are really being +# configured from this file: +stopworda +stopwordb + +#Standard english stop words taken from Lucene's StopAnalyzer +a +an +and +are +as +at +be +but +by +for +if +in +into +is +it +no +not +of +on +or +s +such +t +that +the +their +then +there +these +they +this +to +was +will +with + diff --git a/examples/invenio/solr/conf/synonyms.txt b/examples/invenio/solr/conf/synonyms.txt new file mode 100755 index 000000000..b0e31cb7e --- /dev/null +++ b/examples/invenio/solr/conf/synonyms.txt @@ -0,0 +1,31 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +#some test synonym mappings unlikely to appear in real input text +aaa => aaaa +bbb => bbbb1 bbbb2 +ccc => cccc1,cccc2 +a\=>a => b\=>b +a\,a => b\,b +fooaaa,baraaa,bazaaa + +# Some synonym groups specific to this example +GB,gib,gigabyte,gigabytes +MB,mib,megabyte,megabytes +Television, Televisions, TV, TVs +#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming +#after us won't split it into two words. + +# Synonym mappings can be used for spelling correction too +pixima => pixma + diff --git a/examples/invenio/solr/conf/xslt/example.xsl b/examples/invenio/solr/conf/xslt/example.xsl new file mode 100755 index 000000000..6832a1d4c --- /dev/null +++ b/examples/invenio/solr/conf/xslt/example.xsl @@ -0,0 +1,132 @@ + + + + + + + + + + + + + + + <xsl:value-of select="$title"/> + + + +

+
+ This has been formatted by the sample "example.xsl" transform - + use your own XSLT to get a nicer page +
+ + + +
+ + + +
+ + + + +
+
+
+ + + + + + + + + + + + + + javascript:toggle("");? +
+ + exp + + + + + +
+ + +
+ + + + + + + +
    + +
  • +
    +
+ + +
+ + + + + + + + + + + + + + + + + + + + +
diff --git a/examples/invenio/solr/conf/xslt/example_atom.xsl b/examples/invenio/solr/conf/xslt/example_atom.xsl new file mode 100755 index 000000000..e1c7d5a2a --- /dev/null +++ b/examples/invenio/solr/conf/xslt/example_atom.xsl @@ -0,0 +1,67 @@ + + + + + + + + + + + + + + Example Solr Atom 1.0 Feed + + This has been formatted by the sample "example_atom.xsl" transform - + use your own XSLT to get a nicer Atom feed. + + + Apache Solr + solr-user@lucene.apache.org + + + + + + tag:localhost,2007:example + + + + + + + + + <xsl:value-of select="str[@name='name']"/> + + tag:localhost,2007: + + + + + + diff --git a/examples/invenio/solr/conf/xslt/example_rss.xsl b/examples/invenio/solr/conf/xslt/example_rss.xsl new file mode 100755 index 000000000..3e09e654d --- /dev/null +++ b/examples/invenio/solr/conf/xslt/example_rss.xsl @@ -0,0 +1,66 @@ + + + + + + + + + + + + + Example Solr RSS 2.0 Feed + http://localhost:8983/solr + + This has been formatted by the sample "example_rss.xsl" transform - + use your own XSLT to get a nicer RSS feed. + + en-us + http://localhost:8983/solr + + + + + + + + + + + <xsl:value-of select="str[@name='name']"/> + + http://localhost:8983/solr/select?q=id: + + + + + + + http://localhost:8983/solr/select?q=id: + + + + diff --git a/examples/invenio/solr/conf/xslt/luke.xsl b/examples/invenio/solr/conf/xslt/luke.xsl new file mode 100755 index 000000000..6e9a064d7 --- /dev/null +++ b/examples/invenio/solr/conf/xslt/luke.xsl @@ -0,0 +1,337 @@ + + + + + + + + + Solr Luke Request Handler Response + + + + + + + + + <xsl:value-of select="$title"/> + + + + + +

+ +

+
+ +
+ +

Index Statistics

+ +
+ +

Field Statistics

+ + + +

Document statistics

+ + + + +
+ + + + + +
+ +
+ + +
+ +
+ +
+
+
+ + + + + + + + + + + + + + + + + + + + + +
+

+ +

+ +
+ +
+
+
+ + +
+ + 50 + 800 + 160 + blue + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ background-color: ; width: px; height: px; +
+
+ +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
  • + +
  • +
    +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1 + + + + + + + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + + + + + + + + + + + + + + + + +
diff --git a/examples/invenio/solr/conf/xslt/twitter.xsl b/examples/invenio/solr/conf/xslt/twitter.xsl new file mode 100755 index 000000000..0bc3e4087 --- /dev/null +++ b/examples/invenio/solr/conf/xslt/twitter.xsl @@ -0,0 +1,140 @@ + + + + + + + + + + + + + + + <xsl:value-of select="$title"/> + + + +

+
+ Demo Twitter indexing with Python +
+
+ + + + + + + +
+ + + +
+ + + +
+ + + + +
+
+
+ + + + + + + + + + + + + + javascript:toggle("");? +
+ + exp + + + + + +
+ + +
+ + + + + + + +
    + +
  • +
    +
+ + +
+ + + + + + + + + + + + + + + + + + + + +
diff --git a/examples/twitter/etc/jetty.xml b/examples/twitter/etc/jetty.xml new file mode 100755 index 000000000..ebf3eebde --- /dev/null +++ b/examples/twitter/etc/jetty.xml @@ -0,0 +1,212 @@ + + + + + + + + + + + + + + + + + org.mortbay.jetty.Request.maxFormContentSize + 1000000 + + + + + + + + + 10 + 50 + 60 + + + + + + + + + + + + + + + + + + + + + + 50000 + 1500 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + /contexts + 1 + + + + + + + + + + + + + + + + + + + + + + /webapps + false + true + false + /etc/webdefault.xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + /yyyy_mm_dd.request.log + 90 + true + false + GMT + + + + + + + + true + + true + + + diff --git a/examples/twitter/etc/logging.properties b/examples/twitter/etc/logging.properties new file mode 100644 index 000000000..6a54f49d7 --- /dev/null +++ b/examples/twitter/etc/logging.properties @@ -0,0 +1,12 @@ +# Default global logging level: +.level= SEVERE + +# Write to a file: +handlers= java.util.logging.FileHandler + +# Write log messages in XML format: +#java.util.logging.FileHandler.formatter = java.util.logging.XMLFormatter +java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter + +# Log to the current working directory, with log files named solrxxx.log +java.util.logging.FileHandler.pattern = /tmp/solr-test-%u.log \ No newline at end of file diff --git a/examples/twitter/etc/webdefault.xml b/examples/twitter/etc/webdefault.xml new file mode 100644 index 000000000..66bbdd5f8 --- /dev/null +++ b/examples/twitter/etc/webdefault.xml @@ -0,0 +1,379 @@ + + + + + + + + + + + + + + + + + + + + + + + Default web.xml file. + This file is applied to a Web application before it's own WEB_INF/web.xml file + + + + + + + + + + org.mortbay.jetty.webapp.NoTLDJarPattern + start.jar|ant-.*\.jar|dojo-.*\.jar|jetty-.*\.jar|jsp-api-.*\.jar|junit-.*\.jar|servlet-api-.*\.jar|dnsns\.jar|rt\.jar|jsse\.jar|tools\.jar|sunpkcs11\.jar|sunjce_provider\.jar|xerces.*\.jar| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + default + org.mortbay.jetty.servlet.DefaultServlet + + acceptRanges + true + + + dirAllowed + true + + + redirectWelcome + false + + + maxCacheSize + 2000000 + + + maxCachedFileSize + 254000 + + + maxCachedFiles + 1000 + + + gzip + false + + + useFileMappedBuffer + false + + + 0 + + + default / + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + jsp + org.apache.jasper.servlet.JspServlet + + logVerbosityLevel + WARNING + + + fork + false + + + xpoweredBy + false + + + 0 + + + + jsp + *.jsp + *.jspf + *.jspx + *.xsp + *.JSP + *.JSPF + *.JSPX + *.XSP + + + + + + + + + + + + + + + + + + + + + + + + invoker + org.mortbay.jetty.servlet.Invoker + + verbose + true + + + nonContextServlets + true + + + dynamicParam + anyValue + + 1 + + + invoker /servlet/* + + + + + + + 30 + + + + + + + + + + + + + index.html + index.htm + index.jsp + + + + + arISO-8859-6 + beISO-8859-5 + bgISO-8859-5 + caISO-8859-1 + csISO-8859-2 + daISO-8859-1 + deISO-8859-1 + elISO-8859-7 + enISO-8859-1 + esISO-8859-1 + etISO-8859-1 + fiISO-8859-1 + frISO-8859-1 + hrISO-8859-2 + huISO-8859-2 + isISO-8859-1 + itISO-8859-1 + iwISO-8859-8 + jaShift_JIS + koEUC-KR + ltISO-8859-2 + lvISO-8859-2 + mkISO-8859-5 + nlISO-8859-1 + noISO-8859-1 + plISO-8859-2 + ptISO-8859-1 + roISO-8859-2 + ruISO-8859-5 + shISO-8859-5 + skISO-8859-2 + slISO-8859-2 + sqISO-8859-2 + srISO-8859-5 + svISO-8859-1 + trISO-8859-9 + ukISO-8859-5 + zhGB2312 + zh_TWBig5 + + + + + + diff --git a/examples/twitter/solr/conf/admin-extra.html b/examples/twitter/solr/conf/admin-extra.html new file mode 100755 index 000000000..aa739da86 --- /dev/null +++ b/examples/twitter/solr/conf/admin-extra.html @@ -0,0 +1,31 @@ + + + diff --git a/examples/twitter/solr/conf/data-config-test-java.xml b/examples/twitter/solr/conf/data-config-test-java.xml new file mode 100644 index 000000000..3dec8773e --- /dev/null +++ b/examples/twitter/solr/conf/data-config-test-java.xml @@ -0,0 +1,53 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/examples/twitter/solr/conf/data-config.xml b/examples/twitter/solr/conf/data-config.xml new file mode 100644 index 000000000..3a0080d5a --- /dev/null +++ b/examples/twitter/solr/conf/data-config.xml @@ -0,0 +1,91 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/examples/twitter/solr/conf/elevate.xml b/examples/twitter/solr/conf/elevate.xml new file mode 100755 index 000000000..9b4caec69 --- /dev/null +++ b/examples/twitter/solr/conf/elevate.xml @@ -0,0 +1,36 @@ + + + + + + + + + + + + + + + + + + diff --git a/examples/twitter/solr/conf/mapping-ISOLatin1Accent.txt b/examples/twitter/solr/conf/mapping-ISOLatin1Accent.txt new file mode 100755 index 000000000..ede774258 --- /dev/null +++ b/examples/twitter/solr/conf/mapping-ISOLatin1Accent.txt @@ -0,0 +1,246 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Syntax: +# "source" => "target" +# "source".length() > 0 (source cannot be empty.) +# "target".length() >= 0 (target can be empty.) + +# example: +# "À" => "A" +# "\u00C0" => "A" +# "\u00C0" => "\u0041" +# "ß" => "ss" +# "\t" => " " +# "\n" => "" + +# À => A +"\u00C0" => "A" + +# Á => A +"\u00C1" => "A" + +#  => A +"\u00C2" => "A" + +# à => A +"\u00C3" => "A" + +# Ä => A +"\u00C4" => "A" + +# Å => A +"\u00C5" => "A" + +# Æ => AE +"\u00C6" => "AE" + +# Ç => C +"\u00C7" => "C" + +# È => E +"\u00C8" => "E" + +# É => E +"\u00C9" => "E" + +# Ê => E +"\u00CA" => "E" + +# Ë => E +"\u00CB" => "E" + +# Ì => I +"\u00CC" => "I" + +# Í => I +"\u00CD" => "I" + +# Î => I +"\u00CE" => "I" + +# Ï => I +"\u00CF" => "I" + +# IJ => IJ +"\u0132" => "IJ" + +# Ð => D +"\u00D0" => "D" + +# Ñ => N +"\u00D1" => "N" + +# Ò => O +"\u00D2" => "O" + +# Ó => O +"\u00D3" => "O" + +# Ô => O +"\u00D4" => "O" + +# Õ => O +"\u00D5" => "O" + +# Ö => O +"\u00D6" => "O" + +# Ø => O +"\u00D8" => "O" + +# Œ => OE +"\u0152" => "OE" + +# Þ +"\u00DE" => "TH" + +# Ù => U +"\u00D9" => "U" + +# Ú => U +"\u00DA" => "U" + +# Û => U +"\u00DB" => "U" + +# Ü => U +"\u00DC" => "U" + +# Ý => Y +"\u00DD" => "Y" + +# Ÿ => Y +"\u0178" => "Y" + +# à => a +"\u00E0" => "a" + +# á => a +"\u00E1" => "a" + +# â => a +"\u00E2" => "a" + +# ã => a +"\u00E3" => "a" + +# ä => a +"\u00E4" => "a" + +# å => a +"\u00E5" => "a" + +# æ => ae +"\u00E6" => "ae" + +# ç => c +"\u00E7" => "c" + +# è => e +"\u00E8" => "e" + +# é => e +"\u00E9" => "e" + +# ê => e +"\u00EA" => "e" + +# ë => e +"\u00EB" => "e" + +# ì => i +"\u00EC" => "i" + +# í => i +"\u00ED" => "i" + +# î => i +"\u00EE" => "i" + +# ï => i +"\u00EF" => "i" + +# ij => ij +"\u0133" => "ij" + +# ð => d +"\u00F0" => "d" + +# ñ => n +"\u00F1" => "n" + +# ò => o +"\u00F2" => "o" + +# ó => o +"\u00F3" => "o" + +# ô => o +"\u00F4" => "o" + +# õ => o +"\u00F5" => "o" + +# ö => o +"\u00F6" => "o" + +# ø => o +"\u00F8" => "o" + +# œ => oe +"\u0153" => "oe" + +# ß => ss +"\u00DF" => "ss" + +# þ => th +"\u00FE" => "th" + +# ù => u +"\u00F9" => "u" + +# ú => u +"\u00FA" => "u" + +# û => u +"\u00FB" => "u" + +# ü => u +"\u00FC" => "u" + +# ý => y +"\u00FD" => "y" + +# ÿ => y +"\u00FF" => "y" + +# ff => ff +"\uFB00" => "ff" + +# fi => fi +"\uFB01" => "fi" + +# fl => fl +"\uFB02" => "fl" + +# ffi => ffi +"\uFB03" => "ffi" + +# ffl => ffl +"\uFB04" => "ffl" + +# ſt => ft +"\uFB05" => "ft" + +# st => st +"\uFB06" => "st" diff --git a/examples/twitter/solr/conf/protwords.txt b/examples/twitter/solr/conf/protwords.txt new file mode 100755 index 000000000..1dfc0abec --- /dev/null +++ b/examples/twitter/solr/conf/protwords.txt @@ -0,0 +1,21 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +# Use a protected word file to protect against the stemmer reducing two +# unrelated words to the same base word. + +# Some non-words that normally won't be encountered, +# just to test that they won't be stemmed. +dontstems +zwhacky + diff --git a/examples/twitter/solr/conf/schema.xml b/examples/twitter/solr/conf/schema.xml new file mode 100755 index 000000000..8c5ad68dc --- /dev/null +++ b/examples/twitter/solr/conf/schema.xml @@ -0,0 +1,666 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + id + + + all + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/examples/twitter/solr/conf/scripts.conf b/examples/twitter/solr/conf/scripts.conf new file mode 100755 index 000000000..f58b262ae --- /dev/null +++ b/examples/twitter/solr/conf/scripts.conf @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +user= +solr_hostname=localhost +solr_port=8983 +rsyncd_port=18983 +data_dir= +webapp_name=solr +master_host= +master_data_dir= +master_status_dir= diff --git a/examples/twitter/solr/conf/solrconfig.xml b/examples/twitter/solr/conf/solrconfig.xml new file mode 100755 index 000000000..023e94490 --- /dev/null +++ b/examples/twitter/solr/conf/solrconfig.xml @@ -0,0 +1,1106 @@ + + + + + + ${solr.abortOnConfigurationError:true} + + + + + + + + + + + + + + + + ${solr.data.dir:./solr/data} + + + + + + false + + 10 + + + + + 32 + + 10000 + 1000 + 10000 + + + + + + + + + + + + + native + + + + + + + false + 32 + 10 + + + + + + + + false + + + true + + + + + + + + 1 + + 0 + + + + + false + + + + + + + + + + + + + + + + + + + + + + + + + + + 1024 + + + + + + + + + + + + + + + + true + + + + + + + + 20 + + + 200 + + + + + + + + + + + + + solr rocks010 + static firstSearcher warming query from solrconfig.xml + {!iq iq.mode=maxinv}refersto:recid:100010invenio + {!iq iq.mode=maxinv}citedby:recid:100010invenio + + + + + false + + + 2 + + + + + + + + + + + + + + + + + + + + + + + explicit + + + + + + + + + + + query + invenio-formatter + facet + mlt + highlight + stats + debug + + + + iq + explicit + invenio + maxinv + fulltext + AND + + + + + + + + + + + + + dismax + explicit + 0.01 + + text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 + + + text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 + + + popularity^0.5 recip(price,1,1000,1000)^0.3 + + + id,name,price,score + + + 2<-1 5<-2 6<90% + + 100 + *:* + + text features name + + 0 + + name + regex + + + + + + + dismax + explicit + text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 + 2<-1 5<-2 6<90% + + incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2 + + + + inStock:true + + + + cat + manu_exact + price:[* TO 500] + price:[500 TO *] + + + + + + + + + + textSpell + + + default + name + ./spellchecker + + + + + + + + + + + + + + + + false + + false + + 1 + + + spellcheck + + + + + + + + true + + + tvComponent + + + + + + + + + default + + org.carrot2.clustering.lingo.LingoClusteringAlgorithm + + 20 + + + stc + org.carrot2.clustering.stc.STCClusteringAlgorithm + + + + + true + default + true + + name + id + + features + + true + + + + false + + + clusteringComponent + + + + + + + + text + true + ignored_ + + + true + links + ignored_ + + + + + + + + + + true + + + termsComponent + + + + + + + + string + elevate.xml + + + + + + explicit + + + elevator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + standard + solrpingquery + all + + + + + + + explicit + true + + + + + + + + + 100 + + + + + + + + 70 + + 0.5 + + [-\w ,/\n\"']{20,200} + + + + + + + ]]> + ]]> + + + + + + + + + + + + + + 5 + + + + + + + + + + + + + solr + + + + + + + + + data-config.xml + false + false + + + + + data-config.xml + false + false + + + + + data-config-test-java.xml + false + false + + + + + + diff --git a/examples/twitter/solr/conf/spellings.txt b/examples/twitter/solr/conf/spellings.txt new file mode 100755 index 000000000..d7ede6f56 --- /dev/null +++ b/examples/twitter/solr/conf/spellings.txt @@ -0,0 +1,2 @@ +pizza +history \ No newline at end of file diff --git a/examples/twitter/solr/conf/stopwords.txt b/examples/twitter/solr/conf/stopwords.txt new file mode 100755 index 000000000..b5824da32 --- /dev/null +++ b/examples/twitter/solr/conf/stopwords.txt @@ -0,0 +1,58 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +# a couple of test stopwords to test that the words are really being +# configured from this file: +stopworda +stopwordb + +#Standard english stop words taken from Lucene's StopAnalyzer +a +an +and +are +as +at +be +but +by +for +if +in +into +is +it +no +not +of +on +or +s +such +t +that +the +their +then +there +these +they +this +to +was +will +with + diff --git a/examples/twitter/solr/conf/synonyms.txt b/examples/twitter/solr/conf/synonyms.txt new file mode 100755 index 000000000..b0e31cb7e --- /dev/null +++ b/examples/twitter/solr/conf/synonyms.txt @@ -0,0 +1,31 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +#some test synonym mappings unlikely to appear in real input text +aaa => aaaa +bbb => bbbb1 bbbb2 +ccc => cccc1,cccc2 +a\=>a => b\=>b +a\,a => b\,b +fooaaa,baraaa,bazaaa + +# Some synonym groups specific to this example +GB,gib,gigabyte,gigabytes +MB,mib,megabyte,megabytes +Television, Televisions, TV, TVs +#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming +#after us won't split it into two words. + +# Synonym mappings can be used for spelling correction too +pixima => pixma + diff --git a/examples/twitter/solr/conf/xslt/example.xsl b/examples/twitter/solr/conf/xslt/example.xsl new file mode 100755 index 000000000..6832a1d4c --- /dev/null +++ b/examples/twitter/solr/conf/xslt/example.xsl @@ -0,0 +1,132 @@ + + + + + + + + + + + + + + + <xsl:value-of select="$title"/> + + + +

+
+ This has been formatted by the sample "example.xsl" transform - + use your own XSLT to get a nicer page +
+ + + +
+ + + +
+ + + + +
+
+
+ + + + + + + + + + + + + + javascript:toggle("");? +
+ + exp + + + + + +
+ + +
+ + + + + + + +
    + +
  • +
    +
+ + +
+ + + + + + + + + + + + + + + + + + + + +
diff --git a/examples/twitter/solr/conf/xslt/example_atom.xsl b/examples/twitter/solr/conf/xslt/example_atom.xsl new file mode 100755 index 000000000..e1c7d5a2a --- /dev/null +++ b/examples/twitter/solr/conf/xslt/example_atom.xsl @@ -0,0 +1,67 @@ + + + + + + + + + + + + + + Example Solr Atom 1.0 Feed + + This has been formatted by the sample "example_atom.xsl" transform - + use your own XSLT to get a nicer Atom feed. + + + Apache Solr + solr-user@lucene.apache.org + + + + + + tag:localhost,2007:example + + + + + + + + + <xsl:value-of select="str[@name='name']"/> + + tag:localhost,2007: + + + + + + diff --git a/examples/twitter/solr/conf/xslt/example_rss.xsl b/examples/twitter/solr/conf/xslt/example_rss.xsl new file mode 100755 index 000000000..3e09e654d --- /dev/null +++ b/examples/twitter/solr/conf/xslt/example_rss.xsl @@ -0,0 +1,66 @@ + + + + + + + + + + + + + Example Solr RSS 2.0 Feed + http://localhost:8983/solr + + This has been formatted by the sample "example_rss.xsl" transform - + use your own XSLT to get a nicer RSS feed. + + en-us + http://localhost:8983/solr + + + + + + + + + + + <xsl:value-of select="str[@name='name']"/> + + http://localhost:8983/solr/select?q=id: + + + + + + + http://localhost:8983/solr/select?q=id: + + + + diff --git a/examples/twitter/solr/conf/xslt/luke.xsl b/examples/twitter/solr/conf/xslt/luke.xsl new file mode 100755 index 000000000..6e9a064d7 --- /dev/null +++ b/examples/twitter/solr/conf/xslt/luke.xsl @@ -0,0 +1,337 @@ + + + + + + + + + Solr Luke Request Handler Response + + + + + + + + + <xsl:value-of select="$title"/> + + + + + +

+ +

+
+ +

Index Statistics

+ +
+ +

Field Statistics

+ + + +

Document statistics

+ + + + + + + + + + +
+ +
+ + +
+ +
+ +
+
+
+ + + + + + + + + + + + + + + + + + + + + +
+

+ +

+ +
+ +
+
+
+ + +
+ + 50 + 800 + 160 + blue + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ background-color: ; width: px; height: px; +
+
+ +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
  • + +
  • +
    +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1 + + + + + + + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + - + + + + + + + + + + + + + + + + + + diff --git a/examples/twitter/solr/conf/xslt/twitter.xsl b/examples/twitter/solr/conf/xslt/twitter.xsl new file mode 100755 index 000000000..0bc3e4087 --- /dev/null +++ b/examples/twitter/solr/conf/xslt/twitter.xsl @@ -0,0 +1,140 @@ + + + + + + + + + + + + + + + <xsl:value-of select="$title"/> + + + +

+
+ Demo Twitter indexing with Python +
+
+ + + + + + + +
+ + + +
+ + + +
+ + + + +
+
+
+ + + + + + + + + + + + + + javascript:toggle("");? +
+ + exp + + + + + +
+ + +
+ + + + + + + +
    + +
  • +
    +
+ + +
+ + + + + + + + + + + + + + + + + + + + +
diff --git a/lib/LICENSE.jzlib.txt b/lib/LICENSE.jzlib.txt new file mode 100644 index 000000000..cdce5007d --- /dev/null +++ b/lib/LICENSE.jzlib.txt @@ -0,0 +1,29 @@ +JZlib 0.0.* were released under the GNU LGPL license. Later, we have switched +over to a BSD-style license. + +------------------------------------------------------------------------------ +Copyright (c) 2000,2001,2002,2003 ymnk, JCraft,Inc. All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + + 1. Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. + + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in + the documentation and/or other materials provided with the distribution. + + 3. The names of the authors may not be used to endorse or promote products + derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL JCRAFT, +INC. OR ANY CONTRIBUTORS TO THIS SOFTWARE BE LIABLE FOR ANY DIRECT, INDIRECT, +INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, +OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, +EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/lib/junit-3.8.2.jar b/lib/junit-3.8.2.jar new file mode 100755 index 000000000..c8f711d05 Binary files /dev/null and b/lib/junit-3.8.2.jar differ diff --git a/lib/jzlib-1.0.7.jar b/lib/jzlib-1.0.7.jar new file mode 100644 index 000000000..dd760655e Binary files /dev/null and b/lib/jzlib-1.0.7.jar differ diff --git a/src/java/invenio/montysolr/JettyRunner.java b/src/java/invenio/montysolr/JettyRunner.java new file mode 100644 index 000000000..b7d95bcd5 --- /dev/null +++ b/src/java/invenio/montysolr/JettyRunner.java @@ -0,0 +1,147 @@ +package invenio.montysolr; + +import java.io.File; +import java.net.URL; +import org.apache.commons.io.IOUtils; +import org.mortbay.jetty.Connector; +import org.mortbay.jetty.Server; +import org.mortbay.jetty.bio.SocketConnector; +import org.mortbay.jetty.webapp.WebAppContext; +import org.mortbay.jetty.webapp.WebAppClassLoader; + + +public class JettyRunner { + int port = 8983; + String context = "/test"; + String webroot = "/x/dev/workspace/test-solr/webapp"; + Server server; + boolean isRunning = false; + + public JettyRunner() { + System.out.println("JettyRunner loaded"); + } + + public JettyRunner(String[] args) throws Exception { + System.out.println("JettyRunner loaded"); + this.configure(args); + } + + public void configure(String[] params) throws Exception { + for (int i = 0; i < params.length; i++) { + String t = params[i]; + if (t.contains("port")) { + port = new Integer(params[i + 1]); + } else if (t.contains("context")) { + context = params[i + 1]; + } else if (t.contains("solr.home")) { + System.setProperty("solr.solr.home", params[i + 1]); + } else if (t.contains("webroot")) { + webroot = params[i + 1]; + } else { + throw new Exception("Unknown option " + t); + } + i++; + } + + File h = new File(System.getProperty("solr.solr.home")); + if ( !h.exists()) { + throw new Exception("solr.solr.home not set or not exists"); + } + + } + + public void start() throws Exception { + if (!isRunning) { + + server = new Server(port); + + WebAppContext ctx = new WebAppContext(server, webroot, context); + + + // this sets the normal java class-loading policy, when system + // classes (and classes loaded first) have higher priority + // this is imporant for our singleton to work, otherwise there + // are different classloaders and the singletons are not singletons + // across webapps + ctx.setParentLoaderPriority(true); + + // also this works and has the same effect (I don't know what are + // implications of one or the other method + //ctx.setClassLoader(this.getClass().getClassLoader()); + + SocketConnector connector = new SocketConnector(); + connector.setMaxIdleTime(1000 * 60 * 60); + connector.setSoLingerTime(-1); + connector.setPort(port); + server.setConnectors(new Connector[] { connector }); + + server.setStopAtShutdown(true); + + server.start(); + port = connector.getLocalPort(); + isRunning = true; + } + } + + public void stop() throws Exception { + if (isRunning) { + server.stop(); + isRunning = false; + } + } + + private void testJSP() throws Exception + { + // Currently not an extensive test, but it does fire up the JSP pages and make + // sure they compile ok + + String queryPath = "http://localhost:"+port+context+"/"; + String adminPath = "http://localhost:"+port+context+"/admin/"; + + String html = IOUtils.toString( new URL(queryPath).openStream() ); + assert html.contains("") + 12; + System.out.println(html); + + // special caching query + html = IOUtils.toString( new URL(queryPath+"select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on&qt=recidspython").openStream()); + start_pos = html.indexOf("name=\"docs\">") + 12; + System.out.println(html); + } + + /** + * @param args + * @throws Exception + */ + public static void main(String[] args) throws Exception { + System.out.println("bootstrap.Main loader = " + JettyRunnerPythonVM.class.getClassLoader().toString()); + JettyRunnerPythonVM jr = null; + try { + jr = new JettyRunnerPythonVM(); + ClassLoader currentContextLoader = Thread.currentThread().getContextClassLoader(); + Thread.currentThread().setContextClassLoader(jr.getClass().getClassLoader()); // load jetty + //Thread.currentThread().setContextClassLoader(currentContextLoader ); + + jr.configure(args); + + if (jr.daemonMode) { + jr.join(); + } + else { + jr.start(); + jr.testJSP(); + jr.stop(); + } + } catch (Exception e) { + // TODO Auto-generated catch block + e.printStackTrace(); + jr.stop(); + } + + } + +} diff --git a/src/java/invenio/montysolr/SolrRunner.java b/src/java/invenio/montysolr/SolrRunner.java new file mode 100644 index 000000000..70332b5c3 --- /dev/null +++ b/src/java/invenio/montysolr/SolrRunner.java @@ -0,0 +1,30 @@ +package invenio.montysolr; + +import org.apache.solr.client.solrj.embedded.JettySolrRunner; +import java.io.File; +import java.lang.System; + + +public class SolrRunner { + + static File rootDir = new File("/x/dev/workspace/apache-solr-1.4.1/example/"); + static File homeDir = new File(rootDir, "solr"); + static File dataDir = new File(homeDir, "data"); + static File confDir = new File(homeDir, "conf"); + + + public static JettySolrRunner createJetty() throws Exception { + System.setProperty("solr.solr.home", homeDir.toString()); + System.setProperty("solr.data.dir", dataDir.toString()); + JettySolrRunner jetty = new JettySolrRunner("/solr", 0, rootDir.toString() + "/etc/jetty.xml"); + jetty.start(); + return jetty; + } + + public static void main(String[] args) throws Exception { + JettySolrRunner jetty = createJetty(); + System.out.println(jetty); + System.out.println(jetty.getLocalPort()); + } + +} diff --git a/src/java/invenio/montysolr/examples/TwitterAPIHandler.java b/src/java/invenio/montysolr/examples/TwitterAPIHandler.java new file mode 100644 index 000000000..a6d7758c3 --- /dev/null +++ b/src/java/invenio/montysolr/examples/TwitterAPIHandler.java @@ -0,0 +1,54 @@ +package invenio.montysolr.examples; + +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import org.apache.solr.handler.RequestHandlerBase; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.request.SolrQueryResponse; + + +public class TwitterAPIHandler extends RequestHandlerBase{ + @Override + + public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception + { + + long start = System.currentTimeMillis(); + + PythonMessage message = MontySolrVM.INSTANCE + .createMessage("twitter_api") + .setSender(this.getClass().getSimpleName()) + .setSolrQueryRequest(req) + .setSolrQueryResponse(rsp); + + MontySolrVM.INSTANCE.sendMessage(message); + + long end = System.currentTimeMillis(); + + rsp.add( "QTime", end-start); + } + + + //////////////////////// SolrInfoMBeans methods ////////////////////// + + @Override + public String getVersion() { + return ""; + } + + @Override + public String getDescription() { + return "Adds new Tweets each time search term is passed"; + } + + @Override + public String getSourceId() { + return ""; + } + + @Override + public String getSource() { + return ""; + } +} diff --git a/src/java/invenio/montysolr/jni/BasicBridge.java b/src/java/invenio/montysolr/jni/BasicBridge.java new file mode 100644 index 000000000..77d63ba58 --- /dev/null +++ b/src/java/invenio/montysolr/jni/BasicBridge.java @@ -0,0 +1,60 @@ +package invenio.montysolr.jni; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.solr.core.SolrResourceLoader; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This is abstract class that holds testing methods and basic methods + * of each bridge + * @author rca + * + */ + +public abstract class BasicBridge { + + public static final Logger log = LoggerFactory.getLogger(BasicBridge.class); + + protected String bridgeName = null; + + public String getName() { + return this.bridgeName; + } + public void setName(String name) { + this.bridgeName = name; + } + + // ------------- java testing methods ----------------- + + + public void testPrint() { + System.out.println("java is printing, instance: " + this.toString() + + " from thread id: " + Thread.currentThread().getId()); + } + + public String testReturnString() { + return "java is printing, instance: " + this.toString() + + " from thread id: " + Thread.currentThread().getId(); + } + + public List testReturnListOfStrings() { + ArrayList l = new ArrayList(); + l.add(getName()); + l.add(this.toString()); + l.add(Long.toString(Thread.currentThread().getId())); + return l; + } + + public List testReturnListOfIntegers() { + ArrayList l = new ArrayList(); + l.add(0); + l.add(1); + return l; + } + + + +} diff --git a/src/java/invenio/montysolr/jni/MontySolrBridge.java b/src/java/invenio/montysolr/jni/MontySolrBridge.java new file mode 100644 index 000000000..e2d30585b --- /dev/null +++ b/src/java/invenio/montysolr/jni/MontySolrBridge.java @@ -0,0 +1,63 @@ +package invenio.montysolr.jni; + + + +import org.apache.jcc.PythonVM; + + + +/** + * This class is used for calling Invenio from inside JavaVM + * + * @author rca + * + */ + +public class MontySolrBridge extends BasicBridge implements PythonBridge { + + private long pythonObject; + protected String bridgeName; + + + public void pythonExtension(long pythonObject) { + this.pythonObject = pythonObject; + } + + public long pythonExtension() { + return this.pythonObject; + } + + public void finalize() throws Throwable { + pythonDecRef(); + } + + public native void pythonDecRef(); + + + + /** + * The main method that passes the PythonMessage instance + * to the remote site over the JNI/JCC bridge + * @param message + */ + @Override + public void sendMessage(PythonMessage message) { + PythonVM vm = PythonVM.get(); + vm.acquireThreadState(); + receive_message(message); + vm.releaseThreadState(); + } + public native void receive_message(PythonMessage message); + + + /** + * Just some testing methods, should be removed after + * the code stabilizes + */ + public native void test_print(); + public native String test_return_string(); + +} + + + diff --git a/src/java/invenio/montysolr/jni/MontySolrVM.java b/src/java/invenio/montysolr/jni/MontySolrVM.java new file mode 100644 index 000000000..5edae2b99 --- /dev/null +++ b/src/java/invenio/montysolr/jni/MontySolrVM.java @@ -0,0 +1,117 @@ +package invenio.montysolr.jni; + +import org.apache.jcc.PythonException; +import org.apache.jcc.PythonVM; + + + +import java.util.concurrent.Semaphore; + +public enum MontySolrVM { + INSTANCE; + + private PythonVM vm = null; + + private Semaphore semaphore = + new Semaphore((System.getProperty("montysolr.max_workers") != null ? new Integer(System.getProperty("montysolr.max_workers")) : 1), true); + + public PythonVM start(String programName) { + if (vm == null) + vm = PythonVM.start(programName); + return vm; + } + + /** + * Creates a new instance of the bridge over the Python waters. + * This instance can be used to send the PythonMessage, but until + * sendMessage is called, it does nothing + * @return {@link PythonBridge} + */ + public PythonBridge getBridge() { + return PythonVMBridge.start(); + } + + /** + * Creates a PythonMessage that wraps all the parameters that will be delivered + * to the remote side. It will also contain any return value + * @param receiver + * @return void + */ + public PythonMessage createMessage(String receiver) { + return new PythonMessage(receiver); + } + + /** + * Passes the message over to the remote site, this method is just a factory + * the passing is done by the Bridge itself + * @throws InterruptedException + */ + + public void sendMessage(PythonMessage message) throws InterruptedException { + PythonBridge b = getBridge(); + try { + semaphore.acquire(); + b.sendMessage(message); + } finally { + semaphore.release(); + } + + } + + + +} + +/** + * This class MUST NOT be a singleton. It serves the purpose of communicating w + * Python VM. You get the bridge and use methods of the bridge. + * + * @author rca + * + */ + +class PythonVMBridge { + static protected PythonBridge bridge; + + protected PythonVMBridge() { + + } + + static public PythonBridge start() throws PythonException { + if (System.getProperty("python.bridge") == null) { + return start("SimpleBridge"); + } + else { + return start(System.getProperty("python.bridge")); + } + } + + static public PythonBridge start(String bridgeName) throws PythonException { + if (bridgeName == "montysolr.java_bridge.SimpleBridge" || + bridgeName == "SimpleBridge") { + return start("montysolr.java_bridge", "SimpleBridge"); + } + return bridge; + } + + static public PythonBridge start(String moduleName, String className) throws PythonException + { + if (bridge == null) + { + PythonVM vm = PythonVM.get(); + bridge = (PythonBridge)vm.instantiate(moduleName, className); + bridge.setName(moduleName+'.'+className); + } + + return bridge; + } + + static public PythonBridge get() throws PythonException { + return start("SimpleBridge"); + //return bridge; + } + + +} + + diff --git a/src/java/invenio/montysolr/jni/PythonBridge.java b/src/java/invenio/montysolr/jni/PythonBridge.java new file mode 100644 index 000000000..9efcd2ed9 --- /dev/null +++ b/src/java/invenio/montysolr/jni/PythonBridge.java @@ -0,0 +1,36 @@ +/** + * + */ +package invenio.montysolr.jni; + + +/** + * All the bridge implementation between Java<->Python are required to have + * these basic methods + * + * @author rca + * + */ +public interface PythonBridge { + + /** + * Returns the name of this Bridge implementation + * @return + */ + public String getName(); + + /** + * Sets the name of the bridge after the Bridge was instantiated by the + * PythonVMBridge + * @void + */ + public void setName(String name); + + + /** + * Generic method for sending a PythonMessage into the remote JNI side + * @param message + */ + public void sendMessage(PythonMessage message); + +} diff --git a/src/java/invenio/montysolr/jni/PythonMessage.java b/src/java/invenio/montysolr/jni/PythonMessage.java new file mode 100644 index 000000000..cb816c566 --- /dev/null +++ b/src/java/invenio/montysolr/jni/PythonMessage.java @@ -0,0 +1,122 @@ +package invenio.montysolr.jni; + +import java.util.AbstractCollection; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; + +import org.apache.solr.client.solrj.SolrRequest; +import org.apache.solr.client.solrj.SolrResponse; +import org.apache.solr.core.SolrResourceLoader; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.request.SolrQueryResponse; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class PythonMessage extends HashMap{ + + /** + * + */ + private static final long serialVersionUID = -3744935985066647405L; + public static final Logger log = LoggerFactory.getLogger(PythonMessage.class); + + + public PythonMessage(String receiver) { + this.put("receiver", receiver); + } + + public String getSender() { + return (String) this.get("sender"); + } + + public PythonMessage setSender(String sender) { + this.put("sender", sender); + return this; + } + + public String getReceiver() { + return (String) this.get("receiver"); + } + + public PythonMessage setReceiver(String receiver) { + this.put("receiver", receiver); + return this; + } + + + public SolrQueryRequest getSolrQueryRequest() { + return (SolrQueryRequest) this.get("SolrQueryRequest"); + } + public PythonMessage setSolrQueryRequest(SolrQueryRequest sqr) { + this.put("SolrQueryRequest", sqr); + return this; + } + + public SolrQueryResponse getSolrQueryResponse() { + return (SolrQueryResponse) this.get("SolrQueryResponse"); + } + public PythonMessage setSolrQueryResponse(SolrQueryResponse srp) { + this.put("SolrQueryResponse", srp); + return this; + } + + public PythonMessage setParam(String name, Object value) { + this.put(name, value); + return this; + } + + public Object getParam(String name) { + return this.get(name); + } + + public int[] getParamArray_int(String name) { + return (int[]) this.get(name); + } + public String[] getParamArray_str(String name) { + return (String[]) this.get(name); + } + + public Object getResults() { + return this.get("#result"); + } + + public void setResults(Object result) { + this.put("#result", result); + } + + + public String toString() { + Set> s = this.entrySet(); + StringBuilder out = new StringBuilder(); + for (Entry e: s) { + out.append(e.getKey()); + out.append("="); + Object v = e.getValue(); + if (v instanceof AbstractCollection) { + if (((AbstractCollection) v).size() > 10) { + out.append("@" + v.getClass()); + } + else { + out.append(v); + } + } + else { + out.append(v); + } + out.append(","); + } + return out.toString(); + } + + public void threadInfo(String s) { + log.info("[Python] " + s + this.getInfo()); + } + + private String getInfo() { + return " [Thread=" + Thread.currentThread().getName() + " id=" + Thread.currentThread().getId() + + " time=" + System.currentTimeMillis() + "]"; + } + +} diff --git a/src/java/invenio/montysolr/util/DebuggingMethods.java b/src/java/invenio/montysolr/util/DebuggingMethods.java new file mode 100644 index 000000000..c72610e02 --- /dev/null +++ b/src/java/invenio/montysolr/util/DebuggingMethods.java @@ -0,0 +1,25 @@ +package invenio.montysolr.util; + +import org.apache.jcc.PythonVM; + +public class DebuggingMethods { + + public static void discoverImportableModules() { + String[] modules = {"montysolr_java", "montysolr_java.solrpye.invenio", "montysolr_java.solrpye", "solrpye.invenio"}; + String[] classes = {"TestX", "InvenioSolrBridge", "Emql"}; + Object m = null; + PythonVM vm = PythonVM.get(); + for (int i=0;i= size) + return false; + return (bytes[bit / 8] & BIT_MASK[bit % 8]) != 0; + } + + protected static void setBit(int bit, byte[] bytes) { + int size = bytes == null ? 0 : bytes.length * 8; + if (bit >= size) + throw new ArrayIndexOutOfBoundsException("Byte array too small"); + bytes[bit / 8] |= BIT_MASK[bit % 8]; + } + + public static void main(String[] args) throws DataFormatException, IOException { + + int min = 0; + int max = 5000; + InvenioBitSet ibs = new InvenioBitSet(max); + for (int i = 0; i < 500; i++) { + int r = min + (int) (Math.random() * ((max - min) + 1)); + ibs.set(r); + } + ByteArrayOutputStream b = ibs.fastDump(); + InvenioBitSet bs; + bs = InvenioBitSet.fastLoad(b.toByteArray()); + System.out.println("set lengths: " + ibs.cardinality() + " - " + bs.cardinality()); + System.out.println("Set equals? " + ibs.equals(bs)); + + } +} diff --git a/src/java/org/ads/solr/InvenioBitSet.java b/src/java/org/ads/solr/InvenioBitSet.java new file mode 100755 index 000000000..1fb205e6f --- /dev/null +++ b/src/java/org/ads/solr/InvenioBitSet.java @@ -0,0 +1,73 @@ +package org.ads.solr; + +import java.io.*; +import java.util.BitSet; +import java.util.Arrays; + +public class InvenioBitSet extends BitSet { + + private static final long serialVersionUID = 1L; + + public InvenioBitSet() { + super(); + } + + public InvenioBitSet(int nbits) { + super(nbits); + } + + // TODO: remove the trailing 8 bytes (added by intbitset format) and test + public InvenioBitSet(byte[] bytes) { + this(bytes == null? 0 : bytes.length * 8); + for (int i = 0; i < size(); i++) { + if (isBitOn(i, bytes)) + set(i); + } + } + + // convert to a byte array to be used as intbitset in Invenio + public byte[] toByteArray() { + + if (size() == 0) + return new byte[0]; + + // Find highest bit + int hiBit = -1; + for (int i = 0; i < size(); i++) { + if (get(i)) + hiBit = i; + } + + // was: int n = (hiBit + 8) / 8; + // +128 (64 for trailing zeros used in intbitset and 64 to avoid trancating) + int n = ((hiBit + 128) / 64) * 8; + byte[] bytes = new byte[n]; + if (n == 0) + return bytes; + + Arrays.fill(bytes, (byte)0); + for (int i=0; i= size) + return false; + return (bytes[bit/8] & BIT_MASK[bit%8]) != 0; + } + + protected static void setBit(int bit, byte[] bytes) { + int size = bytes == null ? 0 : bytes.length*8; + if (bit >= size) + throw new ArrayIndexOutOfBoundsException("Byte array too small"); + bytes[bit/8] |= BIT_MASK[bit%8]; + } +} diff --git a/src/java/org/apache/lucene/queryParser/CharStream.java b/src/java/org/apache/lucene/queryParser/CharStream.java new file mode 100644 index 000000000..fc9606f0d --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/CharStream.java @@ -0,0 +1,115 @@ +/* Generated By:JavaCC: Do not edit this line. CharStream.java Version 5.0 */ +/* JavaCCOptions:STATIC=false,SUPPORT_CLASS_VISIBILITY_PUBLIC=true */ +package org.apache.lucene.queryParser; + +/** + * This interface describes a character stream that maintains line and + * column number positions of the characters. It also has the capability + * to backup the stream to some extent. An implementation of this + * interface is used in the TokenManager implementation generated by + * JavaCCParser. + * + * All the methods except backup can be implemented in any fashion. backup + * needs to be implemented correctly for the correct operation of the lexer. + * Rest of the methods are all used to get information like line number, + * column number and the String that constitutes a token and are not used + * by the lexer. Hence their implementation won't affect the generated lexer's + * operation. + */ + +public +interface CharStream { + + /** + * Returns the next character from the selected input. The method + * of selecting the input is the responsibility of the class + * implementing this interface. Can throw any java.io.IOException. + */ + char readChar() throws java.io.IOException; + + @Deprecated + /** + * Returns the column position of the character last read. + * @deprecated + * @see #getEndColumn + */ + int getColumn(); + + @Deprecated + /** + * Returns the line number of the character last read. + * @deprecated + * @see #getEndLine + */ + int getLine(); + + /** + * Returns the column number of the last character for current token (being + * matched after the last call to BeginTOken). + */ + int getEndColumn(); + + /** + * Returns the line number of the last character for current token (being + * matched after the last call to BeginTOken). + */ + int getEndLine(); + + /** + * Returns the column number of the first character for current token (being + * matched after the last call to BeginTOken). + */ + int getBeginColumn(); + + /** + * Returns the line number of the first character for current token (being + * matched after the last call to BeginTOken). + */ + int getBeginLine(); + + /** + * Backs up the input stream by amount steps. Lexer calls this method if it + * had already read some characters, but could not use them to match a + * (longer) token. So, they will be used again as the prefix of the next + * token and it is the implemetation's responsibility to do this right. + */ + void backup(int amount); + + /** + * Returns the next character that marks the beginning of the next token. + * All characters must remain in the buffer between two successive calls + * to this method to implement backup correctly. + */ + char BeginToken() throws java.io.IOException; + + /** + * Returns a string made up of characters from the marked token beginning + * to the current buffer position. Implementations have the choice of returning + * anything that they want to. For example, for efficiency, one might decide + * to just return null, which is a valid implementation. + */ + String GetImage(); + + /** + * Returns an array of characters that make up the suffix of length 'len' for + * the currently matched token. This is used to build up the matched string + * for use in actions in the case of MORE. A simple and inefficient + * implementation of this is as follows : + * + * { + * String t = GetImage(); + * return t.substring(t.length() - len, t.length()).toCharArray(); + * } + */ + char[] GetSuffix(int len); + + /** + * The lexer calls this function to indicate that it is done with the stream + * and hence implementations can free any resources held by this class. + * Again, the body of this function can be just empty and it will not + * affect the lexer's operation. + */ + void Done(); + +} +/* JavaCC - OriginalChecksum=6b854f7f279fcc2b052037ffc369be2d (do not edit this line) */ diff --git a/src/java/org/apache/lucene/queryParser/InvenioQueryParser.java b/src/java/org/apache/lucene/queryParser/InvenioQueryParser.java new file mode 100644 index 000000000..77cbc567c --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/InvenioQueryParser.java @@ -0,0 +1,1932 @@ +/* Generated By:JavaCC: Do not edit this line. InvenioQueryParser.java */ +package org.apache.lucene.queryParser; + +import java.io.IOException; +import java.io.StringReader; +import java.text.Collator; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.Date; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Vector; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.CachingTokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.analysis.tokenattributes.TermAttribute; +import org.apache.lucene.document.DateField; +import org.apache.lucene.document.DateTools; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.BooleanClause; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.FuzzyQuery; +import org.apache.lucene.search.MultiTermQuery; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.search.MultiPhraseQuery; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.PrefixQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.TermRangeQuery; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.WildcardQuery; +import org.apache.lucene.util.Parameter; +import org.apache.lucene.util.Version; + +/** + * This class is generated by JavaCC. The most important method is + * {@link #parse(String)}. + * + * The syntax for query strings is as follows: + * A Query is a series of clauses. + * A clause may be prefixed by: + *
    + *
  • a plus (+) or a minus (-) sign, indicating + * that the clause is required or prohibited respectively; or + *
  • a term followed by a colon, indicating the field to be searched. + * This enables one to construct queries which search multiple fields. + *
+ * + * A clause may be either: + *
    + *
  • a term, indicating all the documents that contain this term; or + *
  • a nested query, enclosed in parentheses. Note that this may be used + * with a +/- prefix to require any of a set of + * terms. + *
+ * + * Thus, in BNF, the query grammar is: + *
+ *   Query  ::= ( Clause )*
+ *   Clause ::= ["+", "-"] [<TERM> ":"] ( <TERM> | "(" Query ")" )
+ * 
+ * + *

+ * Examples of appropriately formatted queries can be found in the query syntax + * documentation. + *

+ * + *

+ * In {@link TermRangeQuery}s, InvenioQueryParser tries to detect date values, e.g. + * date:[6/1/2005 TO 6/4/2005] produces a range query that searches + * for "date" fields between 2005-06-01 and 2005-06-04. Note that the format + * of the accepted input depends on {@link #setLocale(Locale) the locale}. + * By default a date is converted into a search term using the deprecated + * {@link DateField} for compatibility reasons. + * To use the new {@link DateTools} to convert dates, a + * {@link org.apache.lucene.document.DateTools.Resolution} has to be set. + *

+ *

+ * The date resolution that shall be used for RangeQueries can be set + * using {@link #setDateResolution(DateTools.Resolution)} + * or {@link #setDateResolution(String, DateTools.Resolution)}. The former + * sets the default date resolution for all fields, whereas the latter can + * be used to set field specific date resolutions. Field specific date + * resolutions take, if set, precedence over the default date resolution. + *

+ *

+ * If you use neither {@link DateField} nor {@link DateTools} in your + * index, you can create your own + * query parser that inherits InvenioQueryParser and overwrites + * {@link #getRangeQuery(String, String, String, boolean)} to + * use a different method for date conversion. + *

+ * + *

Note that InvenioQueryParser is not thread-safe.

+ * + *

NOTE: there is a new InvenioQueryParser in contrib, which matches + * the same syntax as this class, but is more modular, + * enabling substantial customization to how a query is created. + * + * + *

NOTE: You must specify the required {@link Version} + * compatibility when creating InvenioQueryParser: + *

+ */ +public class InvenioQueryParser implements InvenioQueryParserConstants { + + private static final int CONJ_NONE = 0; + private static final int CONJ_AND = 1; + private static final int CONJ_OR = 2; + + private static final int MOD_NONE = 0; + private static final int MOD_NOT = 10; + private static final int MOD_REQ = 11; + private static final int MOD_SECOND = 12; + + // make it possible to call setDefaultOperator() without accessing + // the nested class: + /** Alternative form of InvenioQueryParser.Operator.AND */ + public static final Operator AND_OPERATOR = Operator.AND; + /** Alternative form of InvenioQueryParser.Operator.OR */ + public static final Operator OR_OPERATOR = Operator.OR; + + /** The actual operator that parser uses to combine query terms */ + private Operator operator = OR_OPERATOR; + + boolean lowercaseExpandedTerms = true; + MultiTermQuery.RewriteMethod multiTermRewriteMethod = MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT; + boolean allowLeadingWildcard = false; + boolean enablePositionIncrements = true; + + Analyzer analyzer; + String field; + int phraseSlop = 0; + float fuzzyMinSim = FuzzyQuery.defaultMinSimilarity; + int fuzzyPrefixLength = FuzzyQuery.defaultPrefixLength; + Locale locale = Locale.getDefault(); + + // the default date resolution + DateTools.Resolution dateResolution = null; + // maps field names to date resolutions + Map fieldToDateResolution = null; + + // The collator to use when determining range inclusion, + // for use when constructing RangeQuerys. + Collator rangeCollator = null; + + /** The default operator for parsing queries. + * Use {@link InvenioQueryParser#setDefaultOperator} to change it. + */ + static public final class Operator extends Parameter { + private Operator(String name) { + super(name); + } + static public final Operator OR = new Operator("OR"); + static public final Operator AND = new Operator("AND"); + } + + + /** Constructs a query parser. + * @param f the default field for query terms. + * @param a used to find terms in the query text. + * @deprecated Use {@link #InvenioQueryParser(Version, String, Analyzer)} instead + */ + public InvenioQueryParser(String f, Analyzer a) { + this(Version.LUCENE_24, f, a); + } + + /** Constructs a query parser. + * @param matchVersion Lucene version to match. See {@link above) + * @param f the default field for query terms. + * @param a used to find terms in the query text. + */ + public InvenioQueryParser(Version matchVersion, String f, Analyzer a) { + this(new FastCharStream(new StringReader(""))); + analyzer = a; + field = f; + if (matchVersion.onOrAfter(Version.LUCENE_29)) { + enablePositionIncrements = true; + } else { + enablePositionIncrements = false; + } + } + + /** Parses a query string, returning a {@link org.apache.lucene.search.Query}. + * @param query the query string to be parsed. + * @throws ParseException if the parsing fails + */ + public Query parse(String query) throws ParseException { + ReInit(new FastCharStream(new StringReader(query))); + try { + // TopLevelQuery is a Query followed by the end-of-input (EOF) + Query res = TopLevelQuery(field); + return res!=null ? res : newBooleanQuery(false); + } + catch (ParseException tme) { + // rethrow to include the original query: + ParseException e = new ParseException("Cannot parse '" +query+ "': " + tme.getMessage()); + e.initCause(tme); + throw e; + } + catch (TokenMgrError tme) { + ParseException e = new ParseException("Cannot parse '" +query+ "': " + tme.getMessage()); + e.initCause(tme); + throw e; + } + catch (BooleanQuery.TooManyClauses tmc) { + ParseException e = new ParseException("Cannot parse '" +query+ "': too many boolean clauses"); + e.initCause(tmc); + throw e; + } + } + + /** + * @return Returns the analyzer. + */ + public Analyzer getAnalyzer() { + return analyzer; + } + + /** + * @return Returns the field. + */ + public String getField() { + return field; + } + + /** + * Get the minimal similarity for fuzzy queries. + */ + public float getFuzzyMinSim() { + return fuzzyMinSim; + } + + /** + * Set the minimum similarity for fuzzy queries. + * Default is 0.5f. + */ + public void setFuzzyMinSim(float fuzzyMinSim) { + this.fuzzyMinSim = fuzzyMinSim; + } + + /** + * Get the prefix length for fuzzy queries. + * @return Returns the fuzzyPrefixLength. + */ + public int getFuzzyPrefixLength() { + return fuzzyPrefixLength; + } + + /** + * Set the prefix length for fuzzy queries. Default is 0. + * @param fuzzyPrefixLength The fuzzyPrefixLength to set. + */ + public void setFuzzyPrefixLength(int fuzzyPrefixLength) { + this.fuzzyPrefixLength = fuzzyPrefixLength; + } + + /** + * Sets the default slop for phrases. If zero, then exact phrase matches + * are required. Default value is zero. + */ + public void setPhraseSlop(int phraseSlop) { + this.phraseSlop = phraseSlop; + } + + /** + * Gets the default slop for phrases. + */ + public int getPhraseSlop() { + return phraseSlop; + } + + + /** + * Set to true to allow leading wildcard characters. + *

+ * When set, * or ? are allowed as + * the first character of a PrefixQuery and WildcardQuery. + * Note that this can produce very slow + * queries on big indexes. + *

+ * Default: false. + */ + public void setAllowLeadingWildcard(boolean allowLeadingWildcard) { + this.allowLeadingWildcard = allowLeadingWildcard; + } + + /** + * @see #setAllowLeadingWildcard(boolean) + */ + public boolean getAllowLeadingWildcard() { + return allowLeadingWildcard; + } + + /** + * Set to true to enable position increments in result query. + *

+ * When set, result phrase and multi-phrase queries will + * be aware of position increments. + * Useful when e.g. a StopFilter increases the position increment of + * the token that follows an omitted token. + *

+ * Default: false. + */ + public void setEnablePositionIncrements(boolean enable) { + this.enablePositionIncrements = enable; + } + + /** + * @see #setEnablePositionIncrements(boolean) + */ + public boolean getEnablePositionIncrements() { + return enablePositionIncrements; + } + + /** + * Sets the boolean operator of the InvenioQueryParser. + * In default mode (OR_OPERATOR) terms without any modifiers + * are considered optional: for example capital of Hungary is equal to + * capital OR of OR Hungary.
+ * In AND_OPERATOR mode terms are considered to be in conjunction: the + * above mentioned query is parsed as capital AND of AND Hungary + */ + public void setDefaultOperator(Operator op) { + this.operator = op; + } + + + /** + * Gets implicit operator setting, which will be either AND_OPERATOR + * or OR_OPERATOR. + */ + public Operator getDefaultOperator() { + return operator; + } + + + /** + * Whether terms of wildcard, prefix, fuzzy and range queries are to be automatically + * lower-cased or not. Default is true. + */ + public void setLowercaseExpandedTerms(boolean lowercaseExpandedTerms) { + this.lowercaseExpandedTerms = lowercaseExpandedTerms; + } + + + /** + * @see #setLowercaseExpandedTerms(boolean) + */ + public boolean getLowercaseExpandedTerms() { + return lowercaseExpandedTerms; + } + + /** + * @deprecated Please use {@link #setMultiTermRewriteMethod} instead. + */ + public void setUseOldRangeQuery(boolean useOldRangeQuery) { + if (useOldRangeQuery) { + setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); + } else { + setMultiTermRewriteMethod(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT); + } + } + + + /** + * @deprecated Please use {@link #getMultiTermRewriteMethod} instead. + */ + public boolean getUseOldRangeQuery() { + if (getMultiTermRewriteMethod() == MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE) { + return true; + } else { + return false; + } + } + + /** + * By default InvenioQueryParser uses {@link MultiTermQuery#CONSTANT_SCORE_AUTO_REWRITE_DEFAULT} + * when creating a PrefixQuery, WildcardQuery or RangeQuery. This implementation is generally preferable because it + * a) Runs faster b) Does not have the scarcity of terms unduly influence score + * c) avoids any "TooManyBooleanClauses" exception. + * However, if your application really needs to use the + * old-fashioned BooleanQuery expansion rewriting and the above + * points are not relevant then use this to change + * the rewrite method. + */ + public void setMultiTermRewriteMethod(MultiTermQuery.RewriteMethod method) { + multiTermRewriteMethod = method; + } + + + /** + * @see #setMultiTermRewriteMethod + */ + public MultiTermQuery.RewriteMethod getMultiTermRewriteMethod() { + return multiTermRewriteMethod; + } + + /** + * Set locale used by date range parsing. + */ + public void setLocale(Locale locale) { + this.locale = locale; + } + + /** + * Returns current locale, allowing access by subclasses. + */ + public Locale getLocale() { + return locale; + } + + /** + * Sets the default date resolution used by RangeQueries for fields for which no + * specific date resolutions has been set. Field specific resolutions can be set + * with {@link #setDateResolution(String, DateTools.Resolution)}. + * + * @param dateResolution the default date resolution to set + */ + public void setDateResolution(DateTools.Resolution dateResolution) { + this.dateResolution = dateResolution; + } + + /** + * Sets the date resolution used by RangeQueries for a specific field. + * + * @param fieldName field for which the date resolution is to be set + * @param dateResolution date resolution to set + */ + public void setDateResolution(String fieldName, DateTools.Resolution dateResolution) { + if (fieldName == null) { + throw new IllegalArgumentException("Field cannot be null."); + } + + if (fieldToDateResolution == null) { + // lazily initialize HashMap + fieldToDateResolution = new HashMap(); + } + + fieldToDateResolution.put(fieldName, dateResolution); + } + + /** + * Returns the date resolution that is used by RangeQueries for the given field. + * Returns null, if no default or field specific date resolution has been set + * for the given field. + * + */ + public DateTools.Resolution getDateResolution(String fieldName) { + if (fieldName == null) { + throw new IllegalArgumentException("Field cannot be null."); + } + + if (fieldToDateResolution == null) { + // no field specific date resolutions set; return default date resolution instead + return this.dateResolution; + } + + DateTools.Resolution resolution = (DateTools.Resolution) fieldToDateResolution.get(fieldName); + if (resolution == null) { + // no date resolutions set for the given field; return default date resolution instead + resolution = this.dateResolution; + } + + return resolution; + } + + /** + * Sets the collator used to determine index term inclusion in ranges + * for RangeQuerys. + *

+ * WARNING: Setting the rangeCollator to a non-null + * collator using this method will cause every single index Term in the + * Field referenced by lowerTerm and/or upperTerm to be examined. + * Depending on the number of index Terms in this Field, the operation could + * be very slow. + * + * @param rc the collator to use when constructing RangeQuerys + */ + public void setRangeCollator(Collator rc) { + rangeCollator = rc; + } + + /** + * @return the collator used to determine index term inclusion in ranges + * for RangeQuerys. + */ + public Collator getRangeCollator() { + return rangeCollator; + } + + /** + * @deprecated use {@link #addClause(List, int, int, Query)} instead. + */ + protected void addClause(Vector clauses, int conj, int mods, Query q) { + addClause((List) clauses, conj, mods, q); + } + + protected void addClause(List clauses, int conj, int mods, Query q) { + boolean required, prohibited; + + // If this term is introduced by AND, make the preceding term required, + // unless it's already prohibited + if (clauses.size() > 0 && conj == CONJ_AND) { + BooleanClause c = (BooleanClause) clauses.get(clauses.size()-1); + if (!c.isProhibited()) + c.setOccur(BooleanClause.Occur.MUST); + } + + if (clauses.size() > 0 && operator == AND_OPERATOR && conj == CONJ_OR) { + // If this term is introduced by OR, make the preceding term optional, + // unless it's prohibited (that means we leave -a OR b but +a OR b-->a OR b) + // notice if the input is a OR b, first term is parsed as required; without + // this modification a OR b would parsed as +a OR b + BooleanClause c = (BooleanClause) clauses.get(clauses.size()-1); + if (!c.isProhibited()) + c.setOccur(BooleanClause.Occur.SHOULD); + } + + // We might have been passed a null query; the term might have been + // filtered away by the analyzer. + if (q == null) + return; + + if (operator == OR_OPERATOR) { + // We set REQUIRED if we're introduced by AND or +; PROHIBITED if + // introduced by NOT or -; make sure not to set both. + prohibited = (mods == MOD_NOT); + required = (mods == MOD_REQ); + if (conj == CONJ_AND && !prohibited) { + required = true; + } + } else { + // We set PROHIBITED if we're introduced by NOT or -; We set REQUIRED + // if not PROHIBITED and not introduced by OR + prohibited = (mods == MOD_NOT); + required = (!prohibited && conj != CONJ_OR); + } + if (required && !prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST)); + else if (!required && !prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.SHOULD)); + else if (!required && prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST_NOT)); + else + throw new RuntimeException("Clause cannot be both required and prohibited"); + } + + + /** + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFieldQuery(String field, String queryText) throws ParseException { + // Use the analyzer to get all the tokens, and then build a TermQuery, + // PhraseQuery, or nothing based on the term count + + TokenStream source; + try { + source = analyzer.reusableTokenStream(field, new StringReader(queryText)); + source.reset(); + } catch (IOException e) { + source = analyzer.tokenStream(field, new StringReader(queryText)); + } + CachingTokenFilter buffer = new CachingTokenFilter(source); + TermAttribute termAtt = null; + PositionIncrementAttribute posIncrAtt = null; + int numTokens = 0; + + boolean success = false; + try { + buffer.reset(); + success = true; + } catch (IOException e) { + // success==false if we hit an exception + } + if (success) { + if (buffer.hasAttribute(TermAttribute.class)) { + termAtt = (TermAttribute) buffer.getAttribute(TermAttribute.class); + } + if (buffer.hasAttribute(PositionIncrementAttribute.class)) { + posIncrAtt = (PositionIncrementAttribute) buffer.getAttribute(PositionIncrementAttribute.class); + } + } + + int positionCount = 0; + boolean severalTokensAtSamePosition = false; + + boolean hasMoreTokens = false; + if (termAtt != null) { + try { + hasMoreTokens = buffer.incrementToken(); + while (hasMoreTokens) { + numTokens++; + int positionIncrement = (posIncrAtt != null) ? posIncrAtt.getPositionIncrement() : 1; + if (positionIncrement != 0) { + positionCount += positionIncrement; + } else { + severalTokensAtSamePosition = true; + } + hasMoreTokens = buffer.incrementToken(); + } + } catch (IOException e) { + // ignore + } + } + try { + // rewind the buffer stream + buffer.reset(); + + // close original stream - all tokens buffered + source.close(); + } + catch (IOException e) { + // ignore + } + + if (numTokens == 0) + return null; + else if (numTokens == 1) { + String term = null; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + return newTermQuery(new Term(field, term)); + } else { + if (severalTokensAtSamePosition) { + if (positionCount == 1) { + // no phrase query: + BooleanQuery q = newBooleanQuery(true); + for (int i = 0; i < numTokens; i++) { + String term = null; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + Query currentQuery = newTermQuery( + new Term(field, term)); + q.add(currentQuery, BooleanClause.Occur.SHOULD); + } + return q; + } + else { + // phrase query: + MultiPhraseQuery mpq = newMultiPhraseQuery(); + mpq.setSlop(phraseSlop); + List multiTerms = new ArrayList(); + int position = -1; + for (int i = 0; i < numTokens; i++) { + String term = null; + int positionIncrement = 1; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + if (posIncrAtt != null) { + positionIncrement = posIncrAtt.getPositionIncrement(); + } + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + if (positionIncrement > 0 && multiTerms.size() > 0) { + if (enablePositionIncrements) { + mpq.add((Term[])multiTerms.toArray(new Term[0]),position); + } else { + mpq.add((Term[])multiTerms.toArray(new Term[0])); + } + multiTerms.clear(); + } + position += positionIncrement; + multiTerms.add(new Term(field, term)); + } + if (enablePositionIncrements) { + mpq.add((Term[])multiTerms.toArray(new Term[0]),position); + } else { + mpq.add((Term[])multiTerms.toArray(new Term[0])); + } + return mpq; + } + } + else { + PhraseQuery pq = newPhraseQuery(); + pq.setSlop(phraseSlop); + int position = -1; + + + for (int i = 0; i < numTokens; i++) { + String term = null; + int positionIncrement = 1; + + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + if (posIncrAtt != null) { + positionIncrement = posIncrAtt.getPositionIncrement(); + } + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + if (enablePositionIncrements) { + position += positionIncrement; + pq.add(new Term(field, term),position); + } else { + pq.add(new Term(field, term)); + } + } + return pq; + } + } + } + + + + /** + * Base implementation delegates to {@link #getFieldQuery(String,String)}. + * This method may be overridden, for example, to return + * a SpanNearQuery instead of a PhraseQuery. + * + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFieldQuery(String field, String queryText, int slop) + throws ParseException { + Query query = getFieldQuery(field, queryText); + + if (query instanceof PhraseQuery) { + ((PhraseQuery) query).setSlop(slop); + } + if (query instanceof MultiPhraseQuery) { + ((MultiPhraseQuery) query).setSlop(slop); + } + + return query; + } + + + /** + * @exception ParseException throw in overridden method to disallow + */ + protected Query getRangeQuery(String field, + String part1, + String part2, + boolean inclusive) throws ParseException + { + if (lowercaseExpandedTerms) { + part1 = part1.toLowerCase(); + part2 = part2.toLowerCase(); + } + try { + DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT, locale); + df.setLenient(true); + Date d1 = df.parse(part1); + Date d2 = df.parse(part2); + if (inclusive) { + // The user can only specify the date, not the time, so make sure + // the time is set to the latest possible time of that date to really + // include all documents: + Calendar cal = Calendar.getInstance(locale); + cal.setTime(d2); + cal.set(Calendar.HOUR_OF_DAY, 23); + cal.set(Calendar.MINUTE, 59); + cal.set(Calendar.SECOND, 59); + cal.set(Calendar.MILLISECOND, 999); + d2 = cal.getTime(); + } + DateTools.Resolution resolution = getDateResolution(field); + if (resolution == null) { + // no default or field specific date resolution has been set, + // use deprecated DateField to maintain compatibility with + // pre-1.9 Lucene versions. + part1 = DateField.dateToString(d1); + part2 = DateField.dateToString(d2); + } else { + part1 = DateTools.dateToString(d1, resolution); + part2 = DateTools.dateToString(d2, resolution); + } + } + catch (Exception e) { } + + return newRangeQuery(field, part1, part2, inclusive); + } + + /** + * Builds a new BooleanQuery instance + * @param disableCoord disable coord + * @return new BooleanQuery instance + */ + protected BooleanQuery newBooleanQuery(boolean disableCoord) { + return new BooleanQuery(disableCoord); + } + + /** + * Builds a new BooleanClause instance + * @param q sub query + * @param occur how this clause should occur when matching documents + * @return new BooleanClause instance + */ + protected BooleanClause newBooleanClause(Query q, BooleanClause.Occur occur) { + return new BooleanClause(q, occur); + } + + /** + * Builds a new TermQuery instance + * @param term term + * @return new TermQuery instance + */ + protected Query newTermQuery(Term term){ + return new TermQuery(term); + } + + /** + * Builds a new PhraseQuery instance + * @return new PhraseQuery instance + */ + protected PhraseQuery newPhraseQuery(){ + return new PhraseQuery(); + } + + /** + * Builds a new MultiPhraseQuery instance + * @return new MultiPhraseQuery instance + */ + protected MultiPhraseQuery newMultiPhraseQuery(){ + return new MultiPhraseQuery(); + } + + /** + * Builds a new PrefixQuery instance + * @param prefix Prefix term + * @return new PrefixQuery instance + */ + protected Query newPrefixQuery(Term prefix){ + PrefixQuery query = new PrefixQuery(prefix); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Builds a new FuzzyQuery instance + * @param term Term + * @param minimumSimilarity minimum similarity + * @param prefixLength prefix length + * @return new FuzzyQuery Instance + */ + protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) { + // FuzzyQuery doesn't yet allow constant score rewrite + return new FuzzyQuery(term,minimumSimilarity,prefixLength); + } + + /** + * Builds a new TermRangeQuery instance + * @param field Field + * @param part1 min + * @param part2 max + * @param inclusive true if range is inclusive + * @return new TermRangeQuery instance + */ + protected Query newRangeQuery(String field, String part1, String part2, boolean inclusive) { + final TermRangeQuery query = new TermRangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Builds a new MatchAllDocsQuery instance + * @return new MatchAllDocsQuery instance + */ + protected Query newMatchAllDocsQuery() { + return new MatchAllDocsQuery(); + } + + /** + * Builds a new WildcardQuery instance + * @param t wildcard term + * @return new WildcardQuery instance + */ + protected Query newWildcardQuery(Term t) { + WildcardQuery query = new WildcardQuery(t); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + * @deprecated use {@link #getBooleanQuery(List)} instead + */ + protected Query getBooleanQuery(Vector clauses) throws ParseException { + return getBooleanQuery((List) clauses, false); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + */ + protected Query getBooleanQuery(List clauses) throws ParseException { + return getBooleanQuery(clauses, false); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * @param disableCoord true if coord scoring should be disabled. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + * @deprecated use {@link #getBooleanQuery(List, boolean)} instead + */ + protected Query getBooleanQuery(Vector clauses, boolean disableCoord) + throws ParseException + { + return getBooleanQuery((List) clauses, disableCoord); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * @param disableCoord true if coord scoring should be disabled. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + */ + protected Query getBooleanQuery(List clauses, boolean disableCoord) + throws ParseException + { + if (clauses.size()==0) { + return null; // all clause words were filtered away by the analyzer. + } + BooleanQuery query = newBooleanQuery(disableCoord); + for (int i = 0; i < clauses.size(); i++) { + query.add((BooleanClause)clauses.get(i)); + } + return query; + } + + /** + * Factory method for generating a query. Called when parser + * parses an input term token that contains one or more wildcard + * characters (? and *), but is not a prefix term token (one + * that has just a single * character at the end) + *

+ * Depending on settings, prefix term may be lower-cased + * automatically. It will not go through the default Analyzer, + * however, since normal Analyzers are unlikely to work properly + * with wildcard templates. + *

+ * Can be overridden by extending classes, to provide custom handling for + * wildcard queries, which may be necessary due to missing analyzer calls. + * + * @param field Name of the field query will use. + * @param termStr Term token that contains one or more wild card + * characters (? or *), but is not simple prefix term + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getWildcardQuery(String field, String termStr) throws ParseException + { + if ("*".equals(field)) { + if ("*".equals(termStr)) return newMatchAllDocsQuery(); + } + if (!allowLeadingWildcard && (termStr.startsWith("*") || termStr.startsWith("?"))) + throw new ParseException("'*' or '?' not allowed as first character in WildcardQuery"); + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newWildcardQuery(t); + } + + /** + * Factory method for generating a query (similar to + * {@link #getWildcardQuery}). Called when parser parses an input term + * token that uses prefix notation; that is, contains a single '*' wildcard + * character as its last character. Since this is a special case + * of generic wildcard term, and such a query can be optimized easily, + * this usually results in a different query object. + *

+ * Depending on settings, a prefix term may be lower-cased + * automatically. It will not go through the default Analyzer, + * however, since normal Analyzers are unlikely to work properly + * with wildcard templates. + *

+ * Can be overridden by extending classes, to provide custom handling for + * wild card queries, which may be necessary due to missing analyzer calls. + * + * @param field Name of the field query will use. + * @param termStr Term token to use for building term for the query + * (without trailing '*' character!) + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getPrefixQuery(String field, String termStr) throws ParseException + { + if (!allowLeadingWildcard && termStr.startsWith("*")) + throw new ParseException("'*' not allowed as first character in PrefixQuery"); + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newPrefixQuery(t); + } + + /** + * Factory method for generating a query (similar to + * {@link #getWildcardQuery}). Called when parser parses + * an input term token that has the fuzzy suffix (~) appended. + * + * @param field Name of the field query will use. + * @param termStr Term token to use for building term for the query + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFuzzyQuery(String field, String termStr, float minSimilarity) throws ParseException + { + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newFuzzyQuery(t, minSimilarity, fuzzyPrefixLength); + } + + /** + * Returns a String where the escape char has been + * removed, or kept only once if there was a double escape. + * + * Supports escaped unicode characters, e. g. translates + * \\u0041 to A. + * + */ + private String discardEscapeChar(String input) throws ParseException { + // Create char array to hold unescaped char sequence + char[] output = new char[input.length()]; + + // The length of the output can be less than the input + // due to discarded escape chars. This variable holds + // the actual length of the output + int length = 0; + + // We remember whether the last processed character was + // an escape character + boolean lastCharWasEscapeChar = false; + + // The multiplier the current unicode digit must be multiplied with. + // E. g. the first digit must be multiplied with 16^3, the second with 16^2... + int codePointMultiplier = 0; + + // Used to calculate the codepoint of the escaped unicode character + int codePoint = 0; + + for (int i = 0; i < input.length(); i++) { + char curChar = input.charAt(i); + if (codePointMultiplier > 0) { + codePoint += hexToInt(curChar) * codePointMultiplier; + codePointMultiplier >>>= 4; + if (codePointMultiplier == 0) { + output[length++] = (char)codePoint; + codePoint = 0; + } + } else if (lastCharWasEscapeChar) { + if (curChar == 'u') { + // found an escaped unicode character + codePointMultiplier = 16 * 16 * 16; + } else { + // this character was escaped + output[length] = curChar; + length++; + } + lastCharWasEscapeChar = false; + } else { + if (curChar == '\u005c\u005c') { + lastCharWasEscapeChar = true; + } else { + output[length] = curChar; + length++; + } + } + } + + if (codePointMultiplier > 0) { + throw new ParseException("Truncated unicode escape sequence."); + } + + if (lastCharWasEscapeChar) { + throw new ParseException("Term can not end with escape character."); + } + + return new String(output, 0, length); + } + + /** Returns the numeric value of the hexadecimal character */ + private static final int hexToInt(char c) throws ParseException { + if ('0' <= c && c <= '9') { + return c - '0'; + } else if ('a' <= c && c <= 'f'){ + return c - 'a' + 10; + } else if ('A' <= c && c <= 'F') { + return c - 'A' + 10; + } else { + throw new ParseException("None-hex character in unicode escape sequence: " + c); + } + } + + /** + * Returns a String where those characters that InvenioQueryParser + * expects to be escaped are escaped by a preceding \. + */ + public static String escape(String s) { + StringBuffer sb = new StringBuffer(); + for (int i = 0; i < s.length(); i++) { + char c = s.charAt(i); + // These characters are part of the query syntax and must be escaped + if (c == '\u005c\u005c' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':' + || c == '^' || c == '[' || c == ']' || c == '\u005c"' || c == '{' || c == '}' || c == '~' + || c == '*' || c == '?' || c == '|' || c == '&') { + sb.append('\u005c\u005c'); + } + sb.append(c); + } + return sb.toString(); + } + + /** + * Command line tool to test InvenioQueryParser, using {@link org.apache.lucene.analysis.SimpleAnalyzer}. + * Usage:
+ * java org.apache.lucene.queryParser.InvenioQueryParser <input> + */ + public static void main(String[] args) throws Exception { + if (args.length == 0) { + System.out.println("Usage: java org.apache.lucene.queryParser.InvenioQueryParser "); + System.exit(0); + } + InvenioQueryParser qp = new InvenioQueryParser(Version.LUCENE_CURRENT, "field", + new org.apache.lucene.analysis.SimpleAnalyzer()); + Query q = qp.parse(args[0]); + System.out.println(q.toString("field")); + } + +// * Query ::= ( Clause )* +// * Clause ::= ["+", "-"] [ ":"] ( | "(" Query ")" ) + final public int Conjunction() throws ParseException { + int ret = CONJ_NONE; + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case AND: + case OR: + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case AND: + jj_consume_token(AND); + ret = CONJ_AND; + break; + case OR: + jj_consume_token(OR); + ret = CONJ_OR; + break; + default: + jj_la1[0] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + break; + default: + jj_la1[1] = jj_gen; + ; + } + {if (true) return ret;} + throw new Error("Missing return statement in function"); + } + + final public int Modifiers() throws ParseException { + int ret = MOD_NONE; + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case NOT: + case PLUS: + case MINUS: + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case PLUS: + jj_consume_token(PLUS); + ret = MOD_REQ; + break; + case MINUS: + jj_consume_token(MINUS); + ret = MOD_NOT; + break; + case NOT: + jj_consume_token(NOT); + ret = MOD_NOT; + break; + default: + jj_la1[2] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + break; + default: + jj_la1[3] = jj_gen; + ; + } + {if (true) return ret;} + throw new Error("Missing return statement in function"); + } + +// This makes sure that there is no garbage after the query string + final public Query TopLevelQuery(String field) throws ParseException { + Query q; + q = Query(field); + jj_consume_token(0); + {if (true) return q;} + throw new Error("Missing return statement in function"); + } + + final public Query Query(String field) throws ParseException { + List clauses = new ArrayList(); + Query q, firstQuery=null; + int conj, mods; + mods = Modifiers(); + q = Clause(field); + addClause(clauses, CONJ_NONE, mods, q); + if (mods == MOD_NONE) + firstQuery=q; + label_1: + while (true) { + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case AND: + case OR: + case NOT: + case PLUS: + case MINUS: + case LPAREN: + case STAR: + case QUOTED: + case QUOTED_PARTIAL: + case TERM: + case PREFIXTERM: + case WILDTERM: + case RANGEIN_START: + case RANGEEX_START: + case REGEX_TERM: + case NUMBER: + ; + break; + default: + jj_la1[4] = jj_gen; + break label_1; + } + conj = Conjunction(); + mods = Modifiers(); + q = Clause(field); + addClause(clauses, conj, mods, q); + } + if (clauses.size() == 1 && firstQuery != null) + {if (true) return firstQuery;} + else { + {if (true) return getBooleanQuery(clauses);} + } + throw new Error("Missing return statement in function"); + } + + final public Query Clause(String field) throws ParseException { + Query q; + Token fieldToken=null, boost=null; + if (jj_2_1(2)) { + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case TERM: + fieldToken = jj_consume_token(TERM); + jj_consume_token(COLON); + field=discardEscapeChar(fieldToken.image); + break; + case STAR: + jj_consume_token(STAR); + jj_consume_token(COLON); + field="*"; + break; + default: + jj_la1[5] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + } else { + ; + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case STAR: + case QUOTED: + case QUOTED_PARTIAL: + case TERM: + case PREFIXTERM: + case WILDTERM: + case RANGEIN_START: + case RANGEEX_START: + case REGEX_TERM: + case NUMBER: + q = Term(field); + break; + case LPAREN: + jj_consume_token(LPAREN); + q = Query(field); + jj_consume_token(RPAREN); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case CARAT: + jj_consume_token(CARAT); + boost = jj_consume_token(NUMBER); + break; + default: + jj_la1[6] = jj_gen; + ; + } + break; + default: + jj_la1[7] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + if (boost != null) { + float f = (float)1.0; + try { + f = Float.valueOf(boost.image).floatValue(); + q.setBoost(f); + } catch (Exception ignored) { } + } + {if (true) return q;} + throw new Error("Missing return statement in function"); + } + + final public Query Term(String field) throws ParseException { + Token term, boost=null, fuzzySlop=null, goop1, goop2; + boolean prefix = false; + boolean wildcard = false; + boolean fuzzy = false; + Query q; + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case STAR: + case TERM: + case PREFIXTERM: + case WILDTERM: + case REGEX_TERM: + case NUMBER: + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case TERM: + term = jj_consume_token(TERM); + break; + case STAR: + term = jj_consume_token(STAR); + wildcard=true; + break; + case PREFIXTERM: + term = jj_consume_token(PREFIXTERM); + prefix=true; + break; + case WILDTERM: + term = jj_consume_token(WILDTERM); + wildcard=true; + break; + case NUMBER: + term = jj_consume_token(NUMBER); + break; + case REGEX_TERM: + term = jj_consume_token(REGEX_TERM); + break; + default: + jj_la1[8] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case FUZZY_SLOP: + fuzzySlop = jj_consume_token(FUZZY_SLOP); + fuzzy=true; + break; + default: + jj_la1[9] = jj_gen; + ; + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case CARAT: + jj_consume_token(CARAT); + boost = jj_consume_token(NUMBER); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case FUZZY_SLOP: + fuzzySlop = jj_consume_token(FUZZY_SLOP); + fuzzy=true; + break; + default: + jj_la1[10] = jj_gen; + ; + } + break; + default: + jj_la1[11] = jj_gen; + ; + } + String termImage=discardEscapeChar(term.image); + if (wildcard) { + q = getWildcardQuery(field, termImage); + } else if (prefix) { + q = getPrefixQuery(field, + discardEscapeChar(term.image.substring + (0, term.image.length()-1))); + } else if (fuzzy) { + float fms = fuzzyMinSim; + try { + fms = Float.valueOf(fuzzySlop.image.substring(1)).floatValue(); + } catch (Exception ignored) { } + if(fms < 0.0f || fms > 1.0f){ + {if (true) throw new ParseException("Minimum similarity for a FuzzyQuery has to be between 0.0f and 1.0f !");} + } + q = getFuzzyQuery(field, termImage,fms); + } else { + q = getFieldQuery(field, termImage); + } + break; + case RANGEIN_START: + jj_consume_token(RANGEIN_START); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case RANGEIN_GOOP: + goop1 = jj_consume_token(RANGEIN_GOOP); + break; + case RANGEIN_QUOTED: + goop1 = jj_consume_token(RANGEIN_QUOTED); + break; + default: + jj_la1[12] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case RANGEIN_TO: + jj_consume_token(RANGEIN_TO); + break; + default: + jj_la1[13] = jj_gen; + ; + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case RANGEIN_GOOP: + goop2 = jj_consume_token(RANGEIN_GOOP); + break; + case RANGEIN_QUOTED: + goop2 = jj_consume_token(RANGEIN_QUOTED); + break; + default: + jj_la1[14] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + jj_consume_token(RANGEIN_END); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case CARAT: + jj_consume_token(CARAT); + boost = jj_consume_token(NUMBER); + break; + default: + jj_la1[15] = jj_gen; + ; + } + if (goop1.kind == RANGEIN_QUOTED) { + goop1.image = goop1.image.substring(1, goop1.image.length()-1); + } + if (goop2.kind == RANGEIN_QUOTED) { + goop2.image = goop2.image.substring(1, goop2.image.length()-1); + } + q = getRangeQuery(field, discardEscapeChar(goop1.image), discardEscapeChar(goop2.image), true); + break; + case RANGEEX_START: + jj_consume_token(RANGEEX_START); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case RANGEEX_GOOP: + goop1 = jj_consume_token(RANGEEX_GOOP); + break; + case RANGEEX_QUOTED: + goop1 = jj_consume_token(RANGEEX_QUOTED); + break; + default: + jj_la1[16] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case RANGEEX_TO: + jj_consume_token(RANGEEX_TO); + break; + default: + jj_la1[17] = jj_gen; + ; + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case RANGEEX_GOOP: + goop2 = jj_consume_token(RANGEEX_GOOP); + break; + case RANGEEX_QUOTED: + goop2 = jj_consume_token(RANGEEX_QUOTED); + break; + default: + jj_la1[18] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + jj_consume_token(RANGEEX_END); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case CARAT: + jj_consume_token(CARAT); + boost = jj_consume_token(NUMBER); + break; + default: + jj_la1[19] = jj_gen; + ; + } + if (goop1.kind == RANGEEX_QUOTED) { + goop1.image = goop1.image.substring(1, goop1.image.length()-1); + } + if (goop2.kind == RANGEEX_QUOTED) { + goop2.image = goop2.image.substring(1, goop2.image.length()-1); + } + + q = getRangeQuery(field, discardEscapeChar(goop1.image), discardEscapeChar(goop2.image), false); + break; + case QUOTED: + term = jj_consume_token(QUOTED); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case FUZZY_SLOP: + fuzzySlop = jj_consume_token(FUZZY_SLOP); + break; + default: + jj_la1[20] = jj_gen; + ; + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case CARAT: + jj_consume_token(CARAT); + boost = jj_consume_token(NUMBER); + break; + default: + jj_la1[21] = jj_gen; + ; + } + int s = phraseSlop; + + if (fuzzySlop != null) { + try { + s = Float.valueOf(fuzzySlop.image.substring(1)).intValue(); + } + catch (Exception ignored) { } + } + q = getFieldQuery(field, discardEscapeChar(term.image.substring(1, term.image.length()-1)), s); + break; + case QUOTED_PARTIAL: + term = jj_consume_token(QUOTED_PARTIAL); + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case FUZZY_SLOP: + fuzzySlop = jj_consume_token(FUZZY_SLOP); + break; + default: + jj_la1[22] = jj_gen; + ; + } + switch ((jj_ntk==-1)?jj_ntk():jj_ntk) { + case CARAT: + jj_consume_token(CARAT); + boost = jj_consume_token(NUMBER); + break; + default: + jj_la1[23] = jj_gen; + ; + } + int partialSlop = 2; + + if (fuzzySlop != null) { + try { + partialSlop = Float.valueOf(fuzzySlop.image.substring(1)).intValue(); + } + catch (Exception ignored) { + partialSlop = 2; + } + } + q = getFieldQuery(field, discardEscapeChar(term.image.substring(1, term.image.length()-1)), partialSlop); + break; + default: + jj_la1[24] = jj_gen; + jj_consume_token(-1); + throw new ParseException(); + } + if (boost != null) { + float f = (float) 1.0; + try { + f = Float.valueOf(boost.image).floatValue(); + } + catch (Exception ignored) { + /* Should this be handled somehow? (defaults to "no boost", if + * boost number is invalid) + */ + } + + // avoid boosting null queries, such as those caused by stop words + if (q != null) { + q.setBoost(f); + } + } + {if (true) return q;} + throw new Error("Missing return statement in function"); + } + + private boolean jj_2_1(int xla) { + jj_la = xla; jj_lastpos = jj_scanpos = token; + try { return !jj_3_1(); } + catch(LookaheadSuccess ls) { return true; } + finally { jj_save(0, xla); } + } + + private boolean jj_3R_3() { + if (jj_scan_token(STAR)) return true; + if (jj_scan_token(COLON)) return true; + return false; + } + + private boolean jj_3R_2() { + if (jj_scan_token(TERM)) return true; + if (jj_scan_token(COLON)) return true; + return false; + } + + private boolean jj_3_1() { + Token xsp; + xsp = jj_scanpos; + if (jj_3R_2()) { + jj_scanpos = xsp; + if (jj_3R_3()) return true; + } + return false; + } + + /** Generated Token Manager. */ + public InvenioQueryParserTokenManager token_source; + /** Current token. */ + public Token token; + /** Next token. */ + public Token jj_nt; + private int jj_ntk; + private Token jj_scanpos, jj_lastpos; + private int jj_la; + private int jj_gen; + final private int[] jj_la1 = new int[25]; + static private int[] jj_la1_0; + static private int[] jj_la1_1; + static { + jj_la1_init_0(); + jj_la1_init_1(); + } + private static void jj_la1_init_0() { + jj_la1_0 = new int[] {0xc00,0xc00,0x7000,0x7000,0x3f74fc00,0x440000,0x80000,0x3f748000,0x33440000,0x800000,0x800000,0x80000,0x0,0x40000000,0x0,0x80000,0x0,0x0,0x0,0x80000,0x800000,0x80000,0x800000,0x80000,0x3f740000,}; + } + private static void jj_la1_init_1() { + jj_la1_1 = new int[] {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x3,0x0,0x3,0x0,0x30,0x4,0x30,0x0,0x0,0x0,0x0,0x0,0x0,}; + } + final private JJCalls[] jj_2_rtns = new JJCalls[1]; + private boolean jj_rescan = false; + private int jj_gc = 0; + + /** Constructor with user supplied CharStream. */ + protected InvenioQueryParser(CharStream stream) { + token_source = new InvenioQueryParserTokenManager(stream); + token = new Token(); + jj_ntk = -1; + jj_gen = 0; + for (int i = 0; i < 25; i++) jj_la1[i] = -1; + for (int i = 0; i < jj_2_rtns.length; i++) jj_2_rtns[i] = new JJCalls(); + } + + /** Reinitialise. */ + public void ReInit(CharStream stream) { + token_source.ReInit(stream); + token = new Token(); + jj_ntk = -1; + jj_gen = 0; + for (int i = 0; i < 25; i++) jj_la1[i] = -1; + for (int i = 0; i < jj_2_rtns.length; i++) jj_2_rtns[i] = new JJCalls(); + } + + /** Constructor with generated Token Manager. */ + protected InvenioQueryParser(InvenioQueryParserTokenManager tm) { + token_source = tm; + token = new Token(); + jj_ntk = -1; + jj_gen = 0; + for (int i = 0; i < 25; i++) jj_la1[i] = -1; + for (int i = 0; i < jj_2_rtns.length; i++) jj_2_rtns[i] = new JJCalls(); + } + + /** Reinitialise. */ + public void ReInit(InvenioQueryParserTokenManager tm) { + token_source = tm; + token = new Token(); + jj_ntk = -1; + jj_gen = 0; + for (int i = 0; i < 25; i++) jj_la1[i] = -1; + for (int i = 0; i < jj_2_rtns.length; i++) jj_2_rtns[i] = new JJCalls(); + } + + private Token jj_consume_token(int kind) throws ParseException { + Token oldToken; + if ((oldToken = token).next != null) token = token.next; + else token = token.next = token_source.getNextToken(); + jj_ntk = -1; + if (token.kind == kind) { + jj_gen++; + if (++jj_gc > 100) { + jj_gc = 0; + for (int i = 0; i < jj_2_rtns.length; i++) { + JJCalls c = jj_2_rtns[i]; + while (c != null) { + if (c.gen < jj_gen) c.first = null; + c = c.next; + } + } + } + return token; + } + token = oldToken; + jj_kind = kind; + throw generateParseException(); + } + + static private final class LookaheadSuccess extends java.lang.Error { } + final private LookaheadSuccess jj_ls = new LookaheadSuccess(); + private boolean jj_scan_token(int kind) { + if (jj_scanpos == jj_lastpos) { + jj_la--; + if (jj_scanpos.next == null) { + jj_lastpos = jj_scanpos = jj_scanpos.next = token_source.getNextToken(); + } else { + jj_lastpos = jj_scanpos = jj_scanpos.next; + } + } else { + jj_scanpos = jj_scanpos.next; + } + if (jj_rescan) { + int i = 0; Token tok = token; + while (tok != null && tok != jj_scanpos) { i++; tok = tok.next; } + if (tok != null) jj_add_error_token(kind, i); + } + if (jj_scanpos.kind != kind) return true; + if (jj_la == 0 && jj_scanpos == jj_lastpos) throw jj_ls; + return false; + } + + +/** Get the next Token. */ + final public Token getNextToken() { + if (token.next != null) token = token.next; + else token = token.next = token_source.getNextToken(); + jj_ntk = -1; + jj_gen++; + return token; + } + +/** Get the specific Token. */ + final public Token getToken(int index) { + Token t = token; + for (int i = 0; i < index; i++) { + if (t.next != null) t = t.next; + else t = t.next = token_source.getNextToken(); + } + return t; + } + + private int jj_ntk() { + if ((jj_nt=token.next) == null) + return (jj_ntk = (token.next=token_source.getNextToken()).kind); + else + return (jj_ntk = jj_nt.kind); + } + + private java.util.List jj_expentries = new java.util.ArrayList(); + private int[] jj_expentry; + private int jj_kind = -1; + private int[] jj_lasttokens = new int[100]; + private int jj_endpos; + + private void jj_add_error_token(int kind, int pos) { + if (pos >= 100) return; + if (pos == jj_endpos + 1) { + jj_lasttokens[jj_endpos++] = kind; + } else if (jj_endpos != 0) { + jj_expentry = new int[jj_endpos]; + for (int i = 0; i < jj_endpos; i++) { + jj_expentry[i] = jj_lasttokens[i]; + } + jj_entries_loop: for (java.util.Iterator it = jj_expentries.iterator(); it.hasNext();) { + int[] oldentry = (int[])(it.next()); + if (oldentry.length == jj_expentry.length) { + for (int i = 0; i < jj_expentry.length; i++) { + if (oldentry[i] != jj_expentry[i]) { + continue jj_entries_loop; + } + } + jj_expentries.add(jj_expentry); + break jj_entries_loop; + } + } + if (pos != 0) jj_lasttokens[(jj_endpos = pos) - 1] = kind; + } + } + + /** Generate ParseException. */ + public ParseException generateParseException() { + jj_expentries.clear(); + boolean[] la1tokens = new boolean[38]; + if (jj_kind >= 0) { + la1tokens[jj_kind] = true; + jj_kind = -1; + } + for (int i = 0; i < 25; i++) { + if (jj_la1[i] == jj_gen) { + for (int j = 0; j < 32; j++) { + if ((jj_la1_0[i] & (1< jj_gen) { + jj_la = p.arg; jj_lastpos = jj_scanpos = p.first; + switch (i) { + case 0: jj_3_1(); break; + } + } + p = p.next; + } while (p != null); + } catch(LookaheadSuccess ls) { } + } + jj_rescan = false; + } + + private void jj_save(int index, int xla) { + JJCalls p = jj_2_rtns[index]; + while (p.gen > jj_gen) { + if (p.next == null) { p = p.next = new JJCalls(); break; } + p = p.next; + } + p.gen = jj_gen + xla - jj_la; p.first = token; p.arg = xla; + } + + static final class JJCalls { + int gen; + Token first; + int arg; + JJCalls next; + } + +} diff --git a/src/java/org/apache/lucene/queryParser/InvenioQueryParser.jj b/src/java/org/apache/lucene/queryParser/InvenioQueryParser.jj new file mode 100644 index 000000000..9eff146ad --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/InvenioQueryParser.jj @@ -0,0 +1,1493 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +options { + STATIC=false; + JAVA_UNICODE_ESCAPE=true; + USER_CHAR_STREAM=true; +} + +PARSER_BEGIN(InvenioQueryParser) + +package org.apache.lucene.queryParser; + +import java.io.IOException; +import java.io.StringReader; +import java.text.Collator; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.Date; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Vector; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.CachingTokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.analysis.tokenattributes.TermAttribute; +import org.apache.lucene.document.DateField; +import org.apache.lucene.document.DateTools; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.BooleanClause; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.FuzzyQuery; +import org.apache.lucene.search.MultiTermQuery; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.search.MultiPhraseQuery; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.PrefixQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.TermRangeQuery; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.WildcardQuery; +import org.apache.lucene.util.Parameter; +import org.apache.lucene.util.Version; + +/** + * This class is generated by JavaCC. The most important method is + * {@link #parse(String)}. + * + * The syntax for query strings is as follows: + * A Query is a series of clauses. + * A clause may be prefixed by: + *

    + *
  • a plus (+) or a minus (-) sign, indicating + * that the clause is required or prohibited respectively; or + *
  • a term followed by a colon, indicating the field to be searched. + * This enables one to construct queries which search multiple fields. + *
+ * + * A clause may be either: + *
    + *
  • a term, indicating all the documents that contain this term; or + *
  • a nested query, enclosed in parentheses. Note that this may be used + * with a +/- prefix to require any of a set of + * terms. + *
+ * + * Thus, in BNF, the query grammar is: + *
+ *   Query  ::= ( Clause )*
+ *   Clause ::= ["+", "-"] [<TERM> ":"] ( <TERM> | "(" Query ")" )
+ * 
+ * + *

+ * Examples of appropriately formatted queries can be found in the query syntax + * documentation. + *

+ * + *

+ * In {@link TermRangeQuery}s, InvenioQueryParser tries to detect date values, e.g. + * date:[6/1/2005 TO 6/4/2005] produces a range query that searches + * for "date" fields between 2005-06-01 and 2005-06-04. Note that the format + * of the accepted input depends on {@link #setLocale(Locale) the locale}. + * By default a date is converted into a search term using the deprecated + * {@link DateField} for compatibility reasons. + * To use the new {@link DateTools} to convert dates, a + * {@link org.apache.lucene.document.DateTools.Resolution} has to be set. + *

+ *

+ * The date resolution that shall be used for RangeQueries can be set + * using {@link #setDateResolution(DateTools.Resolution)} + * or {@link #setDateResolution(String, DateTools.Resolution)}. The former + * sets the default date resolution for all fields, whereas the latter can + * be used to set field specific date resolutions. Field specific date + * resolutions take, if set, precedence over the default date resolution. + *

+ *

+ * If you use neither {@link DateField} nor {@link DateTools} in your + * index, you can create your own + * query parser that inherits InvenioQueryParser and overwrites + * {@link #getRangeQuery(String, String, String, boolean)} to + * use a different method for date conversion. + *

+ * + *

Note that InvenioQueryParser is not thread-safe.

+ * + *

NOTE: there is a new InvenioQueryParser in contrib, which matches + * the same syntax as this class, but is more modular, + * enabling substantial customization to how a query is created. + * + * + *

NOTE: You must specify the required {@link Version} + * compatibility when creating InvenioQueryParser: + *

+ */ +public class InvenioQueryParser { + + private static final int CONJ_NONE = 0; + private static final int CONJ_AND = 1; + private static final int CONJ_OR = 2; + + private static final int MOD_NONE = 0; + private static final int MOD_NOT = 10; + private static final int MOD_REQ = 11; + private static final int MOD_SECOND = 12; + + // make it possible to call setDefaultOperator() without accessing + // the nested class: + /** Alternative form of InvenioQueryParser.Operator.AND */ + public static final Operator AND_OPERATOR = Operator.AND; + /** Alternative form of InvenioQueryParser.Operator.OR */ + public static final Operator OR_OPERATOR = Operator.OR; + + /** The actual operator that parser uses to combine query terms */ + private Operator operator = OR_OPERATOR; + + boolean lowercaseExpandedTerms = true; + MultiTermQuery.RewriteMethod multiTermRewriteMethod = MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT; + boolean allowLeadingWildcard = false; + boolean enablePositionIncrements = true; + + Analyzer analyzer; + String field; + int phraseSlop = 0; + float fuzzyMinSim = FuzzyQuery.defaultMinSimilarity; + int fuzzyPrefixLength = FuzzyQuery.defaultPrefixLength; + Locale locale = Locale.getDefault(); + + // the default date resolution + DateTools.Resolution dateResolution = null; + // maps field names to date resolutions + Map fieldToDateResolution = null; + + // The collator to use when determining range inclusion, + // for use when constructing RangeQuerys. + Collator rangeCollator = null; + + /** The default operator for parsing queries. + * Use {@link InvenioQueryParser#setDefaultOperator} to change it. + */ + static public final class Operator extends Parameter { + private Operator(String name) { + super(name); + } + static public final Operator OR = new Operator("OR"); + static public final Operator AND = new Operator("AND"); + } + + + /** Constructs a query parser. + * @param f the default field for query terms. + * @param a used to find terms in the query text. + * @deprecated Use {@link #InvenioQueryParser(Version, String, Analyzer)} instead + */ + public InvenioQueryParser(String f, Analyzer a) { + this(Version.LUCENE_24, f, a); + } + + /** Constructs a query parser. + * @param matchVersion Lucene version to match. See {@link above) + * @param f the default field for query terms. + * @param a used to find terms in the query text. + */ + public InvenioQueryParser(Version matchVersion, String f, Analyzer a) { + this(new FastCharStream(new StringReader(""))); + analyzer = a; + field = f; + if (matchVersion.onOrAfter(Version.LUCENE_29)) { + enablePositionIncrements = true; + } else { + enablePositionIncrements = false; + } + } + + /** Parses a query string, returning a {@link org.apache.lucene.search.Query}. + * @param query the query string to be parsed. + * @throws ParseException if the parsing fails + */ + public Query parse(String query) throws ParseException { + ReInit(new FastCharStream(new StringReader(query))); + try { + // TopLevelQuery is a Query followed by the end-of-input (EOF) + Query res = TopLevelQuery(field); + return res!=null ? res : newBooleanQuery(false); + } + catch (ParseException tme) { + // rethrow to include the original query: + ParseException e = new ParseException("Cannot parse '" +query+ "': " + tme.getMessage()); + e.initCause(tme); + throw e; + } + catch (TokenMgrError tme) { + ParseException e = new ParseException("Cannot parse '" +query+ "': " + tme.getMessage()); + e.initCause(tme); + throw e; + } + catch (BooleanQuery.TooManyClauses tmc) { + ParseException e = new ParseException("Cannot parse '" +query+ "': too many boolean clauses"); + e.initCause(tmc); + throw e; + } + } + + /** + * @return Returns the analyzer. + */ + public Analyzer getAnalyzer() { + return analyzer; + } + + /** + * @return Returns the field. + */ + public String getField() { + return field; + } + + /** + * Get the minimal similarity for fuzzy queries. + */ + public float getFuzzyMinSim() { + return fuzzyMinSim; + } + + /** + * Set the minimum similarity for fuzzy queries. + * Default is 0.5f. + */ + public void setFuzzyMinSim(float fuzzyMinSim) { + this.fuzzyMinSim = fuzzyMinSim; + } + + /** + * Get the prefix length for fuzzy queries. + * @return Returns the fuzzyPrefixLength. + */ + public int getFuzzyPrefixLength() { + return fuzzyPrefixLength; + } + + /** + * Set the prefix length for fuzzy queries. Default is 0. + * @param fuzzyPrefixLength The fuzzyPrefixLength to set. + */ + public void setFuzzyPrefixLength(int fuzzyPrefixLength) { + this.fuzzyPrefixLength = fuzzyPrefixLength; + } + + /** + * Sets the default slop for phrases. If zero, then exact phrase matches + * are required. Default value is zero. + */ + public void setPhraseSlop(int phraseSlop) { + this.phraseSlop = phraseSlop; + } + + /** + * Gets the default slop for phrases. + */ + public int getPhraseSlop() { + return phraseSlop; + } + + + /** + * Set to true to allow leading wildcard characters. + *

+ * When set, * or ? are allowed as + * the first character of a PrefixQuery and WildcardQuery. + * Note that this can produce very slow + * queries on big indexes. + *

+ * Default: false. + */ + public void setAllowLeadingWildcard(boolean allowLeadingWildcard) { + this.allowLeadingWildcard = allowLeadingWildcard; + } + + /** + * @see #setAllowLeadingWildcard(boolean) + */ + public boolean getAllowLeadingWildcard() { + return allowLeadingWildcard; + } + + /** + * Set to true to enable position increments in result query. + *

+ * When set, result phrase and multi-phrase queries will + * be aware of position increments. + * Useful when e.g. a StopFilter increases the position increment of + * the token that follows an omitted token. + *

+ * Default: false. + */ + public void setEnablePositionIncrements(boolean enable) { + this.enablePositionIncrements = enable; + } + + /** + * @see #setEnablePositionIncrements(boolean) + */ + public boolean getEnablePositionIncrements() { + return enablePositionIncrements; + } + + /** + * Sets the boolean operator of the InvenioQueryParser. + * In default mode (OR_OPERATOR) terms without any modifiers + * are considered optional: for example capital of Hungary is equal to + * capital OR of OR Hungary.
+ * In AND_OPERATOR mode terms are considered to be in conjunction: the + * above mentioned query is parsed as capital AND of AND Hungary + */ + public void setDefaultOperator(Operator op) { + this.operator = op; + } + + + /** + * Gets implicit operator setting, which will be either AND_OPERATOR + * or OR_OPERATOR. + */ + public Operator getDefaultOperator() { + return operator; + } + + + /** + * Whether terms of wildcard, prefix, fuzzy and range queries are to be automatically + * lower-cased or not. Default is true. + */ + public void setLowercaseExpandedTerms(boolean lowercaseExpandedTerms) { + this.lowercaseExpandedTerms = lowercaseExpandedTerms; + } + + + /** + * @see #setLowercaseExpandedTerms(boolean) + */ + public boolean getLowercaseExpandedTerms() { + return lowercaseExpandedTerms; + } + + /** + * @deprecated Please use {@link #setMultiTermRewriteMethod} instead. + */ + public void setUseOldRangeQuery(boolean useOldRangeQuery) { + if (useOldRangeQuery) { + setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); + } else { + setMultiTermRewriteMethod(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT); + } + } + + + /** + * @deprecated Please use {@link #getMultiTermRewriteMethod} instead. + */ + public boolean getUseOldRangeQuery() { + if (getMultiTermRewriteMethod() == MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE) { + return true; + } else { + return false; + } + } + + /** + * By default InvenioQueryParser uses {@link MultiTermQuery#CONSTANT_SCORE_AUTO_REWRITE_DEFAULT} + * when creating a PrefixQuery, WildcardQuery or RangeQuery. This implementation is generally preferable because it + * a) Runs faster b) Does not have the scarcity of terms unduly influence score + * c) avoids any "TooManyBooleanClauses" exception. + * However, if your application really needs to use the + * old-fashioned BooleanQuery expansion rewriting and the above + * points are not relevant then use this to change + * the rewrite method. + */ + public void setMultiTermRewriteMethod(MultiTermQuery.RewriteMethod method) { + multiTermRewriteMethod = method; + } + + + /** + * @see #setMultiTermRewriteMethod + */ + public MultiTermQuery.RewriteMethod getMultiTermRewriteMethod() { + return multiTermRewriteMethod; + } + + /** + * Set locale used by date range parsing. + */ + public void setLocale(Locale locale) { + this.locale = locale; + } + + /** + * Returns current locale, allowing access by subclasses. + */ + public Locale getLocale() { + return locale; + } + + /** + * Sets the default date resolution used by RangeQueries for fields for which no + * specific date resolutions has been set. Field specific resolutions can be set + * with {@link #setDateResolution(String, DateTools.Resolution)}. + * + * @param dateResolution the default date resolution to set + */ + public void setDateResolution(DateTools.Resolution dateResolution) { + this.dateResolution = dateResolution; + } + + /** + * Sets the date resolution used by RangeQueries for a specific field. + * + * @param fieldName field for which the date resolution is to be set + * @param dateResolution date resolution to set + */ + public void setDateResolution(String fieldName, DateTools.Resolution dateResolution) { + if (fieldName == null) { + throw new IllegalArgumentException("Field cannot be null."); + } + + if (fieldToDateResolution == null) { + // lazily initialize HashMap + fieldToDateResolution = new HashMap(); + } + + fieldToDateResolution.put(fieldName, dateResolution); + } + + /** + * Returns the date resolution that is used by RangeQueries for the given field. + * Returns null, if no default or field specific date resolution has been set + * for the given field. + * + */ + public DateTools.Resolution getDateResolution(String fieldName) { + if (fieldName == null) { + throw new IllegalArgumentException("Field cannot be null."); + } + + if (fieldToDateResolution == null) { + // no field specific date resolutions set; return default date resolution instead + return this.dateResolution; + } + + DateTools.Resolution resolution = (DateTools.Resolution) fieldToDateResolution.get(fieldName); + if (resolution == null) { + // no date resolutions set for the given field; return default date resolution instead + resolution = this.dateResolution; + } + + return resolution; + } + + /** + * Sets the collator used to determine index term inclusion in ranges + * for RangeQuerys. + *

+ * WARNING: Setting the rangeCollator to a non-null + * collator using this method will cause every single index Term in the + * Field referenced by lowerTerm and/or upperTerm to be examined. + * Depending on the number of index Terms in this Field, the operation could + * be very slow. + * + * @param rc the collator to use when constructing RangeQuerys + */ + public void setRangeCollator(Collator rc) { + rangeCollator = rc; + } + + /** + * @return the collator used to determine index term inclusion in ranges + * for RangeQuerys. + */ + public Collator getRangeCollator() { + return rangeCollator; + } + + /** + * @deprecated use {@link #addClause(List, int, int, Query)} instead. + */ + protected void addClause(Vector clauses, int conj, int mods, Query q) { + addClause((List) clauses, conj, mods, q); + } + + protected void addClause(List clauses, int conj, int mods, Query q) { + boolean required, prohibited; + + // If this term is introduced by AND, make the preceding term required, + // unless it's already prohibited + if (clauses.size() > 0 && conj == CONJ_AND) { + BooleanClause c = (BooleanClause) clauses.get(clauses.size()-1); + if (!c.isProhibited()) + c.setOccur(BooleanClause.Occur.MUST); + } + + if (clauses.size() > 0 && operator == AND_OPERATOR && conj == CONJ_OR) { + // If this term is introduced by OR, make the preceding term optional, + // unless it's prohibited (that means we leave -a OR b but +a OR b-->a OR b) + // notice if the input is a OR b, first term is parsed as required; without + // this modification a OR b would parsed as +a OR b + BooleanClause c = (BooleanClause) clauses.get(clauses.size()-1); + if (!c.isProhibited()) + c.setOccur(BooleanClause.Occur.SHOULD); + } + + // We might have been passed a null query; the term might have been + // filtered away by the analyzer. + if (q == null) + return; + + if (operator == OR_OPERATOR) { + // We set REQUIRED if we're introduced by AND or +; PROHIBITED if + // introduced by NOT or -; make sure not to set both. + prohibited = (mods == MOD_NOT); + required = (mods == MOD_REQ); + if (conj == CONJ_AND && !prohibited) { + required = true; + } + } else { + // We set PROHIBITED if we're introduced by NOT or -; We set REQUIRED + // if not PROHIBITED and not introduced by OR + prohibited = (mods == MOD_NOT); + required = (!prohibited && conj != CONJ_OR); + } + if (required && !prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST)); + else if (!required && !prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.SHOULD)); + else if (!required && prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST_NOT)); + else + throw new RuntimeException("Clause cannot be both required and prohibited"); + } + + + /** + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFieldQuery(String field, String queryText) throws ParseException { + // Use the analyzer to get all the tokens, and then build a TermQuery, + // PhraseQuery, or nothing based on the term count + + TokenStream source; + try { + source = analyzer.reusableTokenStream(field, new StringReader(queryText)); + source.reset(); + } catch (IOException e) { + source = analyzer.tokenStream(field, new StringReader(queryText)); + } + CachingTokenFilter buffer = new CachingTokenFilter(source); + TermAttribute termAtt = null; + PositionIncrementAttribute posIncrAtt = null; + int numTokens = 0; + + boolean success = false; + try { + buffer.reset(); + success = true; + } catch (IOException e) { + // success==false if we hit an exception + } + if (success) { + if (buffer.hasAttribute(TermAttribute.class)) { + termAtt = (TermAttribute) buffer.getAttribute(TermAttribute.class); + } + if (buffer.hasAttribute(PositionIncrementAttribute.class)) { + posIncrAtt = (PositionIncrementAttribute) buffer.getAttribute(PositionIncrementAttribute.class); + } + } + + int positionCount = 0; + boolean severalTokensAtSamePosition = false; + + boolean hasMoreTokens = false; + if (termAtt != null) { + try { + hasMoreTokens = buffer.incrementToken(); + while (hasMoreTokens) { + numTokens++; + int positionIncrement = (posIncrAtt != null) ? posIncrAtt.getPositionIncrement() : 1; + if (positionIncrement != 0) { + positionCount += positionIncrement; + } else { + severalTokensAtSamePosition = true; + } + hasMoreTokens = buffer.incrementToken(); + } + } catch (IOException e) { + // ignore + } + } + try { + // rewind the buffer stream + buffer.reset(); + + // close original stream - all tokens buffered + source.close(); + } + catch (IOException e) { + // ignore + } + + if (numTokens == 0) + return null; + else if (numTokens == 1) { + String term = null; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + return newTermQuery(new Term(field, term)); + } else { + if (severalTokensAtSamePosition) { + if (positionCount == 1) { + // no phrase query: + BooleanQuery q = newBooleanQuery(true); + for (int i = 0; i < numTokens; i++) { + String term = null; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + Query currentQuery = newTermQuery( + new Term(field, term)); + q.add(currentQuery, BooleanClause.Occur.SHOULD); + } + return q; + } + else { + // phrase query: + MultiPhraseQuery mpq = newMultiPhraseQuery(); + mpq.setSlop(phraseSlop); + List multiTerms = new ArrayList(); + int position = -1; + for (int i = 0; i < numTokens; i++) { + String term = null; + int positionIncrement = 1; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + if (posIncrAtt != null) { + positionIncrement = posIncrAtt.getPositionIncrement(); + } + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + if (positionIncrement > 0 && multiTerms.size() > 0) { + if (enablePositionIncrements) { + mpq.add((Term[])multiTerms.toArray(new Term[0]),position); + } else { + mpq.add((Term[])multiTerms.toArray(new Term[0])); + } + multiTerms.clear(); + } + position += positionIncrement; + multiTerms.add(new Term(field, term)); + } + if (enablePositionIncrements) { + mpq.add((Term[])multiTerms.toArray(new Term[0]),position); + } else { + mpq.add((Term[])multiTerms.toArray(new Term[0])); + } + return mpq; + } + } + else { + PhraseQuery pq = newPhraseQuery(); + pq.setSlop(phraseSlop); + int position = -1; + + + for (int i = 0; i < numTokens; i++) { + String term = null; + int positionIncrement = 1; + + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + if (posIncrAtt != null) { + positionIncrement = posIncrAtt.getPositionIncrement(); + } + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + if (enablePositionIncrements) { + position += positionIncrement; + pq.add(new Term(field, term),position); + } else { + pq.add(new Term(field, term)); + } + } + return pq; + } + } + } + + + + /** + * Base implementation delegates to {@link #getFieldQuery(String,String)}. + * This method may be overridden, for example, to return + * a SpanNearQuery instead of a PhraseQuery. + * + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFieldQuery(String field, String queryText, int slop) + throws ParseException { + Query query = getFieldQuery(field, queryText); + + if (query instanceof PhraseQuery) { + ((PhraseQuery) query).setSlop(slop); + } + if (query instanceof MultiPhraseQuery) { + ((MultiPhraseQuery) query).setSlop(slop); + } + + return query; + } + + + /** + * @exception ParseException throw in overridden method to disallow + */ + protected Query getRangeQuery(String field, + String part1, + String part2, + boolean inclusive) throws ParseException + { + if (lowercaseExpandedTerms) { + part1 = part1.toLowerCase(); + part2 = part2.toLowerCase(); + } + try { + DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT, locale); + df.setLenient(true); + Date d1 = df.parse(part1); + Date d2 = df.parse(part2); + if (inclusive) { + // The user can only specify the date, not the time, so make sure + // the time is set to the latest possible time of that date to really + // include all documents: + Calendar cal = Calendar.getInstance(locale); + cal.setTime(d2); + cal.set(Calendar.HOUR_OF_DAY, 23); + cal.set(Calendar.MINUTE, 59); + cal.set(Calendar.SECOND, 59); + cal.set(Calendar.MILLISECOND, 999); + d2 = cal.getTime(); + } + DateTools.Resolution resolution = getDateResolution(field); + if (resolution == null) { + // no default or field specific date resolution has been set, + // use deprecated DateField to maintain compatibility with + // pre-1.9 Lucene versions. + part1 = DateField.dateToString(d1); + part2 = DateField.dateToString(d2); + } else { + part1 = DateTools.dateToString(d1, resolution); + part2 = DateTools.dateToString(d2, resolution); + } + } + catch (Exception e) { } + + return newRangeQuery(field, part1, part2, inclusive); + } + + /** + * Builds a new BooleanQuery instance + * @param disableCoord disable coord + * @return new BooleanQuery instance + */ + protected BooleanQuery newBooleanQuery(boolean disableCoord) { + return new BooleanQuery(disableCoord); + } + + /** + * Builds a new BooleanClause instance + * @param q sub query + * @param occur how this clause should occur when matching documents + * @return new BooleanClause instance + */ + protected BooleanClause newBooleanClause(Query q, BooleanClause.Occur occur) { + return new BooleanClause(q, occur); + } + + /** + * Builds a new TermQuery instance + * @param term term + * @return new TermQuery instance + */ + protected Query newTermQuery(Term term){ + return new TermQuery(term); + } + + /** + * Builds a new PhraseQuery instance + * @return new PhraseQuery instance + */ + protected PhraseQuery newPhraseQuery(){ + return new PhraseQuery(); + } + + /** + * Builds a new MultiPhraseQuery instance + * @return new MultiPhraseQuery instance + */ + protected MultiPhraseQuery newMultiPhraseQuery(){ + return new MultiPhraseQuery(); + } + + /** + * Builds a new PrefixQuery instance + * @param prefix Prefix term + * @return new PrefixQuery instance + */ + protected Query newPrefixQuery(Term prefix){ + PrefixQuery query = new PrefixQuery(prefix); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Builds a new FuzzyQuery instance + * @param term Term + * @param minimumSimilarity minimum similarity + * @param prefixLength prefix length + * @return new FuzzyQuery Instance + */ + protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) { + // FuzzyQuery doesn't yet allow constant score rewrite + return new FuzzyQuery(term,minimumSimilarity,prefixLength); + } + + /** + * Builds a new TermRangeQuery instance + * @param field Field + * @param part1 min + * @param part2 max + * @param inclusive true if range is inclusive + * @return new TermRangeQuery instance + */ + protected Query newRangeQuery(String field, String part1, String part2, boolean inclusive) { + final TermRangeQuery query = new TermRangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Builds a new MatchAllDocsQuery instance + * @return new MatchAllDocsQuery instance + */ + protected Query newMatchAllDocsQuery() { + return new MatchAllDocsQuery(); + } + + /** + * Builds a new WildcardQuery instance + * @param t wildcard term + * @return new WildcardQuery instance + */ + protected Query newWildcardQuery(Term t) { + WildcardQuery query = new WildcardQuery(t); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + * @deprecated use {@link #getBooleanQuery(List)} instead + */ + protected Query getBooleanQuery(Vector clauses) throws ParseException { + return getBooleanQuery((List) clauses, false); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + */ + protected Query getBooleanQuery(List clauses) throws ParseException { + return getBooleanQuery(clauses, false); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * @param disableCoord true if coord scoring should be disabled. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + * @deprecated use {@link #getBooleanQuery(List, boolean)} instead + */ + protected Query getBooleanQuery(Vector clauses, boolean disableCoord) + throws ParseException + { + return getBooleanQuery((List) clauses, disableCoord); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * @param disableCoord true if coord scoring should be disabled. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + */ + protected Query getBooleanQuery(List clauses, boolean disableCoord) + throws ParseException + { + if (clauses.size()==0) { + return null; // all clause words were filtered away by the analyzer. + } + BooleanQuery query = newBooleanQuery(disableCoord); + for (int i = 0; i < clauses.size(); i++) { + query.add((BooleanClause)clauses.get(i)); + } + return query; + } + + /** + * Factory method for generating a query. Called when parser + * parses an input term token that contains one or more wildcard + * characters (? and *), but is not a prefix term token (one + * that has just a single * character at the end) + *

+ * Depending on settings, prefix term may be lower-cased + * automatically. It will not go through the default Analyzer, + * however, since normal Analyzers are unlikely to work properly + * with wildcard templates. + *

+ * Can be overridden by extending classes, to provide custom handling for + * wildcard queries, which may be necessary due to missing analyzer calls. + * + * @param field Name of the field query will use. + * @param termStr Term token that contains one or more wild card + * characters (? or *), but is not simple prefix term + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getWildcardQuery(String field, String termStr) throws ParseException + { + if ("*".equals(field)) { + if ("*".equals(termStr)) return newMatchAllDocsQuery(); + } + if (!allowLeadingWildcard && (termStr.startsWith("*") || termStr.startsWith("?"))) + throw new ParseException("'*' or '?' not allowed as first character in WildcardQuery"); + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newWildcardQuery(t); + } + + /** + * Factory method for generating a query (similar to + * {@link #getWildcardQuery}). Called when parser parses an input term + * token that uses prefix notation; that is, contains a single '*' wildcard + * character as its last character. Since this is a special case + * of generic wildcard term, and such a query can be optimized easily, + * this usually results in a different query object. + *

+ * Depending on settings, a prefix term may be lower-cased + * automatically. It will not go through the default Analyzer, + * however, since normal Analyzers are unlikely to work properly + * with wildcard templates. + *

+ * Can be overridden by extending classes, to provide custom handling for + * wild card queries, which may be necessary due to missing analyzer calls. + * + * @param field Name of the field query will use. + * @param termStr Term token to use for building term for the query + * (without trailing '*' character!) + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getPrefixQuery(String field, String termStr) throws ParseException + { + if (!allowLeadingWildcard && termStr.startsWith("*")) + throw new ParseException("'*' not allowed as first character in PrefixQuery"); + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newPrefixQuery(t); + } + + /** + * Factory method for generating a query (similar to + * {@link #getWildcardQuery}). Called when parser parses + * an input term token that has the fuzzy suffix (~) appended. + * + * @param field Name of the field query will use. + * @param termStr Term token to use for building term for the query + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFuzzyQuery(String field, String termStr, float minSimilarity) throws ParseException + { + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newFuzzyQuery(t, minSimilarity, fuzzyPrefixLength); + } + + /** + * Returns a String where the escape char has been + * removed, or kept only once if there was a double escape. + * + * Supports escaped unicode characters, e. g. translates + * \\u0041 to A. + * + */ + private String discardEscapeChar(String input) throws ParseException { + // Create char array to hold unescaped char sequence + char[] output = new char[input.length()]; + + // The length of the output can be less than the input + // due to discarded escape chars. This variable holds + // the actual length of the output + int length = 0; + + // We remember whether the last processed character was + // an escape character + boolean lastCharWasEscapeChar = false; + + // The multiplier the current unicode digit must be multiplied with. + // E. g. the first digit must be multiplied with 16^3, the second with 16^2... + int codePointMultiplier = 0; + + // Used to calculate the codepoint of the escaped unicode character + int codePoint = 0; + + for (int i = 0; i < input.length(); i++) { + char curChar = input.charAt(i); + if (codePointMultiplier > 0) { + codePoint += hexToInt(curChar) * codePointMultiplier; + codePointMultiplier >>>= 4; + if (codePointMultiplier == 0) { + output[length++] = (char)codePoint; + codePoint = 0; + } + } else if (lastCharWasEscapeChar) { + if (curChar == 'u') { + // found an escaped unicode character + codePointMultiplier = 16 * 16 * 16; + } else { + // this character was escaped + output[length] = curChar; + length++; + } + lastCharWasEscapeChar = false; + } else { + if (curChar == '\\') { + lastCharWasEscapeChar = true; + } else { + output[length] = curChar; + length++; + } + } + } + + if (codePointMultiplier > 0) { + throw new ParseException("Truncated unicode escape sequence."); + } + + if (lastCharWasEscapeChar) { + throw new ParseException("Term can not end with escape character."); + } + + return new String(output, 0, length); + } + + /** Returns the numeric value of the hexadecimal character */ + private static final int hexToInt(char c) throws ParseException { + if ('0' <= c && c <= '9') { + return c - '0'; + } else if ('a' <= c && c <= 'f'){ + return c - 'a' + 10; + } else if ('A' <= c && c <= 'F') { + return c - 'A' + 10; + } else { + throw new ParseException("None-hex character in unicode escape sequence: " + c); + } + } + + /** + * Returns a String where those characters that InvenioQueryParser + * expects to be escaped are escaped by a preceding \. + */ + public static String escape(String s) { + StringBuffer sb = new StringBuffer(); + for (int i = 0; i < s.length(); i++) { + char c = s.charAt(i); + // These characters are part of the query syntax and must be escaped + if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':' + || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c == '}' || c == '~' + || c == '*' || c == '?' || c == '|' || c == '&') { + sb.append('\\'); + } + sb.append(c); + } + return sb.toString(); + } + + /** + * Command line tool to test InvenioQueryParser, using {@link org.apache.lucene.analysis.SimpleAnalyzer}. + * Usage:
+ * java org.apache.lucene.queryParser.InvenioQueryParser <input> + */ + public static void main(String[] args) throws Exception { + if (args.length == 0) { + System.out.println("Usage: java org.apache.lucene.queryParser.InvenioQueryParser "); + System.exit(0); + } + InvenioQueryParser qp = new InvenioQueryParser(Version.LUCENE_CURRENT, "field", + new org.apache.lucene.analysis.SimpleAnalyzer()); + Query q = qp.parse(args[0]); + System.out.println(q.toString("field")); + } +} + +PARSER_END(InvenioQueryParser) + +/* ***************** */ +/* Token Definitions */ +/* ***************** */ + +<*> TOKEN : { + <#_NUM_CHAR: ["0"-"9"] > +// every character that follows a backslash is considered as an escaped character +| <#_ESCAPED_CHAR: "\\" ~[] > +| <#_TERM_START_CHAR: ( ~[ " ", "\t", "\n", "\r", "\u3000", "+", "-", "!", "(", ")", "^", ":", + "[", "]", "\"", "{", "}", "~", "*", "?", "\\" ] + | <_ESCAPED_CHAR> ) > +| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" | "+" ) > +| <#_WHITESPACE: ( " " | "\t" | "\n" | "\r" | "\u3000") > +| <#_QUOTED_CHAR: ( ~[ "\"", "\\" ] | <_ESCAPED_CHAR> ) > +| <#_SECOND_OPERATOR: ( "refersto:" | "citedby:" | "cited:" ) > +| <#_OTHER: ( ~["/"]) > +} + + SKIP : { + < <_WHITESPACE>> +} + + TOKEN : { + +| +| +| +| +| +| +| +| +| : Boost +| )* "\""> +| )* "'"> +| (<_TERM_CHAR>)* > +| )+ ( "." (<_NUM_CHAR>)+ )? )? > +| (<_TERM_CHAR>)* "*" ) > +| | [ "*", "?" ]) (<_TERM_CHAR> | ( [ "*", "?" ] ))* > +| : RangeIn +| : RangeEx +//| > +| | <_ESCAPED_CHAR> )* ("/" | "$")) > +} + + TOKEN : { +)+ ( "." (<_NUM_CHAR>)+ )? > : DEFAULT +} + + TOKEN : { + +| : DEFAULT +| +| +} + + TOKEN : { + +| : DEFAULT +| +| +} + +// * Query ::= ( Clause )* +// * Clause ::= ["+", "-"] [ ":"] ( | "(" Query ")" ) + +int Conjunction() : { + int ret = CONJ_NONE; +} +{ + [ + { ret = CONJ_AND; } + | { ret = CONJ_OR; } + ] + { return ret; } +} + +int Modifiers() : { + int ret = MOD_NONE; +} +{ + [ + { ret = MOD_REQ; } + | { ret = MOD_NOT; } + | { ret = MOD_NOT; } + ] + { return ret; } +} + +// This makes sure that there is no garbage after the query string +Query TopLevelQuery(String field) : +{ + Query q; +} +{ + q=Query(field) + { + return q; + } +} + +Query Query(String field) : +{ + List clauses = new ArrayList(); + Query q, firstQuery=null; + int conj, mods; +} +{ + mods=Modifiers() q=Clause(field) + { + addClause(clauses, CONJ_NONE, mods, q); + if (mods == MOD_NONE) + firstQuery=q; + } + ( + conj=Conjunction() mods=Modifiers() q=Clause(field) + { addClause(clauses, conj, mods, q); } + )* + { + if (clauses.size() == 1 && firstQuery != null) + return firstQuery; + else { + return getBooleanQuery(clauses); + } + } +} + +Query Clause(String field) : { + Query q; + Token fieldToken=null, boost=null; +} +{ + [ + LOOKAHEAD(2) + ( + fieldToken= {field=discardEscapeChar(fieldToken.image);} + | {field="*";} + ) + ] + + ( + q=Term(field) + | q=Query(field) ( boost=)? + + ) + { + if (boost != null) { + float f = (float)1.0; + try { + f = Float.valueOf(boost.image).floatValue(); + q.setBoost(f); + } catch (Exception ignored) { } + } + return q; + } +} + + +Query Term(String field) : { + Token term, boost=null, fuzzySlop=null, goop1, goop2; + boolean prefix = false; + boolean wildcard = false; + boolean fuzzy = false; + Query q; +} +{ + ( + ( + term= + | term= { wildcard=true; } + | term= { prefix=true; } + | term= { wildcard=true; } + | term= + | term= + //| term=< ( <_SECOND_OPERATOR> )+> + ) + [ fuzzySlop= { fuzzy=true; } ] + [ boost= [ fuzzySlop= { fuzzy=true; } ] ] + { + String termImage=discardEscapeChar(term.image); + if (wildcard) { + q = getWildcardQuery(field, termImage); + } else if (prefix) { + q = getPrefixQuery(field, + discardEscapeChar(term.image.substring + (0, term.image.length()-1))); + } else if (fuzzy) { + float fms = fuzzyMinSim; + try { + fms = Float.valueOf(fuzzySlop.image.substring(1)).floatValue(); + } catch (Exception ignored) { } + if(fms < 0.0f || fms > 1.0f){ + throw new ParseException("Minimum similarity for a FuzzyQuery has to be between 0.0f and 1.0f !"); + } + q = getFuzzyQuery(field, termImage,fms); + } else { + q = getFieldQuery(field, termImage); + } + } + | ( ( goop1=|goop1= ) + [ ] ( goop2=|goop2= ) + ) + [ boost= ] + { + if (goop1.kind == RANGEIN_QUOTED) { + goop1.image = goop1.image.substring(1, goop1.image.length()-1); + } + if (goop2.kind == RANGEIN_QUOTED) { + goop2.image = goop2.image.substring(1, goop2.image.length()-1); + } + q = getRangeQuery(field, discardEscapeChar(goop1.image), discardEscapeChar(goop2.image), true); + } + | ( ( goop1=|goop1= ) + [ ] ( goop2=|goop2= ) + ) + [ boost= ] + { + if (goop1.kind == RANGEEX_QUOTED) { + goop1.image = goop1.image.substring(1, goop1.image.length()-1); + } + if (goop2.kind == RANGEEX_QUOTED) { + goop2.image = goop2.image.substring(1, goop2.image.length()-1); + } + + q = getRangeQuery(field, discardEscapeChar(goop1.image), discardEscapeChar(goop2.image), false); + } + | term= + [ fuzzySlop= ] + [ boost= ] + { + int s = phraseSlop; + + if (fuzzySlop != null) { + try { + s = Float.valueOf(fuzzySlop.image.substring(1)).intValue(); + } + catch (Exception ignored) { } + } + q = getFieldQuery(field, discardEscapeChar(term.image.substring(1, term.image.length()-1)), s); + } + | term= + [ fuzzySlop= ] + [ boost= ] + { + int partialSlop = 2; + + if (fuzzySlop != null) { + try { + partialSlop = Float.valueOf(fuzzySlop.image.substring(1)).intValue(); + } + catch (Exception ignored) { + partialSlop = 2; + } + } + q = getFieldQuery(field, discardEscapeChar(term.image.substring(1, term.image.length()-1)), partialSlop); + } + + ) + { + if (boost != null) { + float f = (float) 1.0; + try { + f = Float.valueOf(boost.image).floatValue(); + } + catch (Exception ignored) { + /* Should this be handled somehow? (defaults to "no boost", if + * boost number is invalid) + */ + } + + // avoid boosting null queries, such as those caused by stop words + if (q != null) { + q.setBoost(f); + } + } + return q; + } +} diff --git a/src/java/org/apache/lucene/queryParser/InvenioQueryParserConstants.java b/src/java/org/apache/lucene/queryParser/InvenioQueryParserConstants.java new file mode 100644 index 000000000..c91e28bb1 --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/InvenioQueryParserConstants.java @@ -0,0 +1,137 @@ +/* Generated By:JavaCC: Do not edit this line. InvenioQueryParserConstants.java */ +package org.apache.lucene.queryParser; + + +/** + * Token literal values and constants. + * Generated by org.javacc.parser.OtherFilesGen#start() + */ +public interface InvenioQueryParserConstants { + + /** End of File. */ + int EOF = 0; + /** RegularExpression Id. */ + int _NUM_CHAR = 1; + /** RegularExpression Id. */ + int _ESCAPED_CHAR = 2; + /** RegularExpression Id. */ + int _TERM_START_CHAR = 3; + /** RegularExpression Id. */ + int _TERM_CHAR = 4; + /** RegularExpression Id. */ + int _WHITESPACE = 5; + /** RegularExpression Id. */ + int _QUOTED_CHAR = 6; + /** RegularExpression Id. */ + int _SECOND_OPERATOR = 7; + /** RegularExpression Id. */ + int _OTHER = 8; + /** RegularExpression Id. */ + int AND = 10; + /** RegularExpression Id. */ + int OR = 11; + /** RegularExpression Id. */ + int NOT = 12; + /** RegularExpression Id. */ + int PLUS = 13; + /** RegularExpression Id. */ + int MINUS = 14; + /** RegularExpression Id. */ + int LPAREN = 15; + /** RegularExpression Id. */ + int RPAREN = 16; + /** RegularExpression Id. */ + int COLON = 17; + /** RegularExpression Id. */ + int STAR = 18; + /** RegularExpression Id. */ + int CARAT = 19; + /** RegularExpression Id. */ + int QUOTED = 20; + /** RegularExpression Id. */ + int QUOTED_PARTIAL = 21; + /** RegularExpression Id. */ + int TERM = 22; + /** RegularExpression Id. */ + int FUZZY_SLOP = 23; + /** RegularExpression Id. */ + int PREFIXTERM = 24; + /** RegularExpression Id. */ + int WILDTERM = 25; + /** RegularExpression Id. */ + int RANGEIN_START = 26; + /** RegularExpression Id. */ + int RANGEEX_START = 27; + /** RegularExpression Id. */ + int REGEX_TERM = 28; + /** RegularExpression Id. */ + int NUMBER = 29; + /** RegularExpression Id. */ + int RANGEIN_TO = 30; + /** RegularExpression Id. */ + int RANGEIN_END = 31; + /** RegularExpression Id. */ + int RANGEIN_QUOTED = 32; + /** RegularExpression Id. */ + int RANGEIN_GOOP = 33; + /** RegularExpression Id. */ + int RANGEEX_TO = 34; + /** RegularExpression Id. */ + int RANGEEX_END = 35; + /** RegularExpression Id. */ + int RANGEEX_QUOTED = 36; + /** RegularExpression Id. */ + int RANGEEX_GOOP = 37; + + /** Lexical state. */ + int Boost = 0; + /** Lexical state. */ + int RangeEx = 1; + /** Lexical state. */ + int RangeIn = 2; + /** Lexical state. */ + int DEFAULT = 3; + + /** Literal token values. */ + String[] tokenImage = { + "", + "<_NUM_CHAR>", + "<_ESCAPED_CHAR>", + "<_TERM_START_CHAR>", + "<_TERM_CHAR>", + "<_WHITESPACE>", + "<_QUOTED_CHAR>", + "<_SECOND_OPERATOR>", + "<_OTHER>", + "", + "", + "", + "\"!\"", + "\"+\"", + "", + "\"(\"", + "\")\"", + "\":\"", + "\"*\"", + "\"^\"", + "", + "", + "", + "", + "", + "", + "\"[\"", + "\"{\"", + "", + "", + "\"TO\"", + "\"]\"", + "", + "", + "\"TO\"", + "\"}\"", + "", + "", + }; + +} diff --git a/src/java/org/apache/lucene/queryParser/InvenioQueryParserTokenManager.java b/src/java/org/apache/lucene/queryParser/InvenioQueryParserTokenManager.java new file mode 100644 index 000000000..473766562 --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/InvenioQueryParserTokenManager.java @@ -0,0 +1,1372 @@ +/* Generated By:JavaCC: Do not edit this line. InvenioQueryParserTokenManager.java */ +package org.apache.lucene.queryParser; +import java.io.IOException; +import java.io.StringReader; +import java.text.Collator; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.Date; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Vector; +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.CachingTokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.analysis.tokenattributes.TermAttribute; +import org.apache.lucene.document.DateField; +import org.apache.lucene.document.DateTools; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.BooleanClause; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.FuzzyQuery; +import org.apache.lucene.search.MultiTermQuery; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.search.MultiPhraseQuery; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.PrefixQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.TermRangeQuery; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.WildcardQuery; +import org.apache.lucene.util.Parameter; +import org.apache.lucene.util.Version; + +/** Token Manager. */ +public class InvenioQueryParserTokenManager implements InvenioQueryParserConstants +{ + + /** Debug output. */ + public java.io.PrintStream debugStream = System.out; + /** Set debug output. */ + public void setDebugStream(java.io.PrintStream ds) { debugStream = ds; } +private final int jjStopStringLiteralDfa_3(int pos, long active0) +{ + switch (pos) + { + default : + return -1; + } +} +private final int jjStartNfa_3(int pos, long active0) +{ + return jjMoveNfa_3(jjStopStringLiteralDfa_3(pos, active0), pos + 1); +} +private int jjStopAtPos(int pos, int kind) +{ + jjmatchedKind = kind; + jjmatchedPos = pos; + return pos + 1; +} +private int jjMoveStringLiteralDfa0_3() +{ + switch(curChar) + { + case 33: + return jjStopAtPos(0, 12); + case 40: + return jjStopAtPos(0, 15); + case 41: + return jjStopAtPos(0, 16); + case 42: + return jjStartNfaWithStates_3(0, 18, 55); + case 43: + return jjStopAtPos(0, 13); + case 58: + return jjStopAtPos(0, 17); + case 91: + return jjStopAtPos(0, 26); + case 94: + return jjStopAtPos(0, 19); + case 123: + return jjStopAtPos(0, 27); + default : + return jjMoveNfa_3(0, 0); + } +} +private int jjStartNfaWithStates_3(int pos, int kind, int state) +{ + jjmatchedKind = kind; + jjmatchedPos = pos; + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { return pos + 1; } + return jjMoveNfa_3(state, pos + 1); +} +static final long[] jjbitVec0 = { + 0x1L, 0x0L, 0x0L, 0x0L +}; +static final long[] jjbitVec1 = { + 0xfffffffffffffffeL, 0xffffffffffffffffL, 0xffffffffffffffffL, 0xffffffffffffffffL +}; +static final long[] jjbitVec3 = { + 0x0L, 0x0L, 0xffffffffffffffffL, 0xffffffffffffffffL +}; +static final long[] jjbitVec4 = { + 0xfffefffffffffffeL, 0xffffffffffffffffL, 0xffffffffffffffffL, 0xffffffffffffffffL +}; +private int jjMoveNfa_3(int startState, int curPos) +{ + int startsAt = 0; + jjnewStateCnt = 55; + int i = 1; + jjstateSet[0] = startState; + int kind = 0x7fffffff; + for (;;) + { + if (++jjround == 0x7fffffff) + ReInitRounds(); + if (curChar < 64) + { + long l = 1L << curChar; + do + { + switch(jjstateSet[--i]) + { + case 0: + if ((0xfbffd4f8ffffd9ffL & l) != 0L) + { + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + } + else if ((0x100002600L & l) != 0L) + { + if (kind > 9) + kind = 9; + } + else if (curChar == 34) + jjCheckNAddStates(0, 2); + else if (curChar == 45) + { + if (kind > 14) + kind = 14; + } + if ((0x7bffd0f8ffffd9ffL & l) != 0L) + { + if (kind > 22) + kind = 22; + jjCheckNAddStates(3, 7); + } + else if (curChar == 42) + { + if (kind > 24) + kind = 24; + } + if ((0x801000000000L & l) != 0L) + jjCheckNAddStates(8, 10); + else if (curChar == 39) + jjCheckNAddStates(11, 13); + else if (curChar == 38) + jjstateSet[jjnewStateCnt++] = 7; + break; + case 55: + case 39: + if ((0xfbfffcf8ffffd9ffL & l) == 0L) + break; + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 7: + if (curChar == 38 && kind > 10) + kind = 10; + break; + case 8: + if (curChar == 38) + jjstateSet[jjnewStateCnt++] = 7; + break; + case 16: + if (curChar == 45 && kind > 14) + kind = 14; + break; + case 23: + if (curChar == 34) + jjCheckNAddStates(0, 2); + break; + case 24: + if ((0xfffffffbffffffffL & l) != 0L) + jjCheckNAddStates(0, 2); + break; + case 26: + jjCheckNAddStates(0, 2); + break; + case 27: + if (curChar == 34 && kind > 20) + kind = 20; + break; + case 28: + if (curChar == 39) + jjCheckNAddStates(11, 13); + break; + case 29: + if ((0xfffffffbffffffffL & l) != 0L) + jjCheckNAddStates(11, 13); + break; + case 31: + jjCheckNAddStates(11, 13); + break; + case 32: + if (curChar == 39 && kind > 21) + kind = 21; + break; + case 34: + if ((0x3ff000000000000L & l) == 0L) + break; + if (kind > 23) + kind = 23; + jjAddStates(14, 15); + break; + case 35: + if (curChar == 46) + jjCheckNAdd(36); + break; + case 36: + if ((0x3ff000000000000L & l) == 0L) + break; + if (kind > 23) + kind = 23; + jjCheckNAdd(36); + break; + case 37: + if (curChar == 42 && kind > 24) + kind = 24; + break; + case 38: + if ((0xfbffd4f8ffffd9ffL & l) == 0L) + break; + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 41: + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 42: + if ((0x801000000000L & l) != 0L) + jjCheckNAddStates(8, 10); + break; + case 43: + if ((0xffff7fffffffffffL & l) != 0L) + jjCheckNAddStates(8, 10); + break; + case 45: + jjCheckNAddStates(8, 10); + break; + case 46: + if ((0x801000000000L & l) != 0L && kind > 28) + kind = 28; + break; + case 47: + if ((0x7bffd0f8ffffd9ffL & l) == 0L) + break; + if (kind > 22) + kind = 22; + jjCheckNAddStates(3, 7); + break; + case 48: + if ((0x7bfff8f8ffffd9ffL & l) == 0L) + break; + if (kind > 22) + kind = 22; + jjCheckNAddTwoStates(48, 49); + break; + case 50: + if (kind > 22) + kind = 22; + jjCheckNAddTwoStates(48, 49); + break; + case 51: + if ((0x7bfff8f8ffffd9ffL & l) != 0L) + jjCheckNAddStates(16, 18); + break; + case 53: + jjCheckNAddStates(16, 18); + break; + default : break; + } + } while(i != startsAt); + } + else if (curChar < 128) + { + long l = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + case 0: + if ((0x97ffffff87ffffffL & l) != 0L) + { + if (kind > 22) + kind = 22; + jjCheckNAddStates(3, 7); + } + else if (curChar == 92) + jjCheckNAddStates(19, 21); + else if (curChar == 126) + { + if (kind > 23) + kind = 23; + jjstateSet[jjnewStateCnt++] = 34; + } + if ((0x97ffffff87ffffffL & l) != 0L) + { + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + } + if (curChar == 110) + jjstateSet[jjnewStateCnt++] = 21; + else if (curChar == 78) + jjstateSet[jjnewStateCnt++] = 18; + else if (curChar == 124) + { + if (kind > 11) + kind = 11; + } + else if (curChar == 111) + jjstateSet[jjnewStateCnt++] = 13; + else if (curChar == 79) + jjstateSet[jjnewStateCnt++] = 9; + else if (curChar == 97) + jjstateSet[jjnewStateCnt++] = 5; + else if (curChar == 65) + jjstateSet[jjnewStateCnt++] = 2; + if (curChar == 124) + jjstateSet[jjnewStateCnt++] = 11; + break; + case 55: + if ((0x97ffffff87ffffffL & l) != 0L) + { + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + } + else if (curChar == 92) + jjCheckNAddTwoStates(41, 41); + break; + case 1: + if (curChar == 68 && kind > 10) + kind = 10; + break; + case 2: + if (curChar == 78) + jjstateSet[jjnewStateCnt++] = 1; + break; + case 3: + if (curChar == 65) + jjstateSet[jjnewStateCnt++] = 2; + break; + case 4: + if (curChar == 100 && kind > 10) + kind = 10; + break; + case 5: + if (curChar == 110) + jjstateSet[jjnewStateCnt++] = 4; + break; + case 6: + if (curChar == 97) + jjstateSet[jjnewStateCnt++] = 5; + break; + case 9: + if (curChar == 82 && kind > 11) + kind = 11; + break; + case 10: + if (curChar == 79) + jjstateSet[jjnewStateCnt++] = 9; + break; + case 11: + if (curChar == 124 && kind > 11) + kind = 11; + break; + case 12: + if (curChar == 124) + jjstateSet[jjnewStateCnt++] = 11; + break; + case 13: + if (curChar == 114 && kind > 11) + kind = 11; + break; + case 14: + if (curChar == 111) + jjstateSet[jjnewStateCnt++] = 13; + break; + case 15: + if (curChar == 124 && kind > 11) + kind = 11; + break; + case 17: + if (curChar == 84 && kind > 14) + kind = 14; + break; + case 18: + if (curChar == 79) + jjstateSet[jjnewStateCnt++] = 17; + break; + case 19: + if (curChar == 78) + jjstateSet[jjnewStateCnt++] = 18; + break; + case 20: + if (curChar == 116 && kind > 14) + kind = 14; + break; + case 21: + if (curChar == 111) + jjstateSet[jjnewStateCnt++] = 20; + break; + case 22: + if (curChar == 110) + jjstateSet[jjnewStateCnt++] = 21; + break; + case 24: + if ((0xffffffffefffffffL & l) != 0L) + jjCheckNAddStates(0, 2); + break; + case 25: + if (curChar == 92) + jjstateSet[jjnewStateCnt++] = 26; + break; + case 26: + jjCheckNAddStates(0, 2); + break; + case 29: + if ((0xffffffffefffffffL & l) != 0L) + jjCheckNAddStates(11, 13); + break; + case 30: + if (curChar == 92) + jjstateSet[jjnewStateCnt++] = 31; + break; + case 31: + jjCheckNAddStates(11, 13); + break; + case 33: + if (curChar != 126) + break; + if (kind > 23) + kind = 23; + jjstateSet[jjnewStateCnt++] = 34; + break; + case 38: + if ((0x97ffffff87ffffffL & l) == 0L) + break; + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 39: + if ((0x97ffffff87ffffffL & l) == 0L) + break; + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 40: + if (curChar == 92) + jjCheckNAddTwoStates(41, 41); + break; + case 41: + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 43: + case 45: + jjCheckNAddStates(8, 10); + break; + case 44: + if (curChar == 92) + jjstateSet[jjnewStateCnt++] = 45; + break; + case 47: + if ((0x97ffffff87ffffffL & l) == 0L) + break; + if (kind > 22) + kind = 22; + jjCheckNAddStates(3, 7); + break; + case 48: + if ((0x97ffffff87ffffffL & l) == 0L) + break; + if (kind > 22) + kind = 22; + jjCheckNAddTwoStates(48, 49); + break; + case 49: + if (curChar == 92) + jjCheckNAddTwoStates(50, 50); + break; + case 50: + if (kind > 22) + kind = 22; + jjCheckNAddTwoStates(48, 49); + break; + case 51: + if ((0x97ffffff87ffffffL & l) != 0L) + jjCheckNAddStates(16, 18); + break; + case 52: + if (curChar == 92) + jjCheckNAddTwoStates(53, 53); + break; + case 53: + jjCheckNAddStates(16, 18); + break; + case 54: + if (curChar == 92) + jjCheckNAddStates(19, 21); + break; + default : break; + } + } while(i != startsAt); + } + else + { + int hiByte = (int)(curChar >> 8); + int i1 = hiByte >> 6; + long l1 = 1L << (hiByte & 077); + int i2 = (curChar & 0xff) >> 6; + long l2 = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + case 0: + if (jjCanMove_0(hiByte, i1, i2, l1, l2)) + { + if (kind > 9) + kind = 9; + } + if (jjCanMove_2(hiByte, i1, i2, l1, l2)) + { + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + } + if (jjCanMove_2(hiByte, i1, i2, l1, l2)) + { + if (kind > 22) + kind = 22; + jjCheckNAddStates(3, 7); + } + break; + case 55: + case 39: + if (!jjCanMove_2(hiByte, i1, i2, l1, l2)) + break; + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 24: + case 26: + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + jjCheckNAddStates(0, 2); + break; + case 29: + case 31: + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + jjCheckNAddStates(11, 13); + break; + case 38: + if (!jjCanMove_2(hiByte, i1, i2, l1, l2)) + break; + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 41: + if (!jjCanMove_1(hiByte, i1, i2, l1, l2)) + break; + if (kind > 25) + kind = 25; + jjCheckNAddTwoStates(39, 40); + break; + case 43: + case 45: + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + jjCheckNAddStates(8, 10); + break; + case 47: + if (!jjCanMove_2(hiByte, i1, i2, l1, l2)) + break; + if (kind > 22) + kind = 22; + jjCheckNAddStates(3, 7); + break; + case 48: + if (!jjCanMove_2(hiByte, i1, i2, l1, l2)) + break; + if (kind > 22) + kind = 22; + jjCheckNAddTwoStates(48, 49); + break; + case 50: + if (!jjCanMove_1(hiByte, i1, i2, l1, l2)) + break; + if (kind > 22) + kind = 22; + jjCheckNAddTwoStates(48, 49); + break; + case 51: + if (jjCanMove_2(hiByte, i1, i2, l1, l2)) + jjCheckNAddStates(16, 18); + break; + case 53: + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + jjCheckNAddStates(16, 18); + break; + default : break; + } + } while(i != startsAt); + } + if (kind != 0x7fffffff) + { + jjmatchedKind = kind; + jjmatchedPos = curPos; + kind = 0x7fffffff; + } + ++curPos; + if ((i = jjnewStateCnt) == (startsAt = 55 - (jjnewStateCnt = startsAt))) + return curPos; + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { return curPos; } + } +} +private final int jjStopStringLiteralDfa_1(int pos, long active0) +{ + switch (pos) + { + case 0: + if ((active0 & 0x400000000L) != 0L) + { + jjmatchedKind = 37; + return 6; + } + return -1; + default : + return -1; + } +} +private final int jjStartNfa_1(int pos, long active0) +{ + return jjMoveNfa_1(jjStopStringLiteralDfa_1(pos, active0), pos + 1); +} +private int jjMoveStringLiteralDfa0_1() +{ + switch(curChar) + { + case 84: + return jjMoveStringLiteralDfa1_1(0x400000000L); + case 125: + return jjStopAtPos(0, 35); + default : + return jjMoveNfa_1(0, 0); + } +} +private int jjMoveStringLiteralDfa1_1(long active0) +{ + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { + jjStopStringLiteralDfa_1(0, active0); + return 1; + } + switch(curChar) + { + case 79: + if ((active0 & 0x400000000L) != 0L) + return jjStartNfaWithStates_1(1, 34, 6); + break; + default : + break; + } + return jjStartNfa_1(0, active0); +} +private int jjStartNfaWithStates_1(int pos, int kind, int state) +{ + jjmatchedKind = kind; + jjmatchedPos = pos; + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { return pos + 1; } + return jjMoveNfa_1(state, pos + 1); +} +private int jjMoveNfa_1(int startState, int curPos) +{ + int startsAt = 0; + jjnewStateCnt = 7; + int i = 1; + jjstateSet[0] = startState; + int kind = 0x7fffffff; + for (;;) + { + if (++jjround == 0x7fffffff) + ReInitRounds(); + if (curChar < 64) + { + long l = 1L << curChar; + do + { + switch(jjstateSet[--i]) + { + case 0: + if ((0xfffffffeffffffffL & l) != 0L) + { + if (kind > 37) + kind = 37; + jjCheckNAdd(6); + } + if ((0x100002600L & l) != 0L) + { + if (kind > 9) + kind = 9; + } + else if (curChar == 34) + jjCheckNAddTwoStates(2, 4); + break; + case 1: + if (curChar == 34) + jjCheckNAddTwoStates(2, 4); + break; + case 2: + if ((0xfffffffbffffffffL & l) != 0L) + jjCheckNAddStates(22, 24); + break; + case 3: + if (curChar == 34) + jjCheckNAddStates(22, 24); + break; + case 5: + if (curChar == 34 && kind > 36) + kind = 36; + break; + case 6: + if ((0xfffffffeffffffffL & l) == 0L) + break; + if (kind > 37) + kind = 37; + jjCheckNAdd(6); + break; + default : break; + } + } while(i != startsAt); + } + else if (curChar < 128) + { + long l = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + case 0: + case 6: + if ((0xdfffffffffffffffL & l) == 0L) + break; + if (kind > 37) + kind = 37; + jjCheckNAdd(6); + break; + case 2: + jjAddStates(22, 24); + break; + case 4: + if (curChar == 92) + jjstateSet[jjnewStateCnt++] = 3; + break; + default : break; + } + } while(i != startsAt); + } + else + { + int hiByte = (int)(curChar >> 8); + int i1 = hiByte >> 6; + long l1 = 1L << (hiByte & 077); + int i2 = (curChar & 0xff) >> 6; + long l2 = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + case 0: + if (jjCanMove_0(hiByte, i1, i2, l1, l2)) + { + if (kind > 9) + kind = 9; + } + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + { + if (kind > 37) + kind = 37; + jjCheckNAdd(6); + } + break; + case 2: + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + jjAddStates(22, 24); + break; + case 6: + if (!jjCanMove_1(hiByte, i1, i2, l1, l2)) + break; + if (kind > 37) + kind = 37; + jjCheckNAdd(6); + break; + default : break; + } + } while(i != startsAt); + } + if (kind != 0x7fffffff) + { + jjmatchedKind = kind; + jjmatchedPos = curPos; + kind = 0x7fffffff; + } + ++curPos; + if ((i = jjnewStateCnt) == (startsAt = 7 - (jjnewStateCnt = startsAt))) + return curPos; + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { return curPos; } + } +} +private int jjMoveStringLiteralDfa0_0() +{ + return jjMoveNfa_0(0, 0); +} +private int jjMoveNfa_0(int startState, int curPos) +{ + int startsAt = 0; + jjnewStateCnt = 3; + int i = 1; + jjstateSet[0] = startState; + int kind = 0x7fffffff; + for (;;) + { + if (++jjround == 0x7fffffff) + ReInitRounds(); + if (curChar < 64) + { + long l = 1L << curChar; + do + { + switch(jjstateSet[--i]) + { + case 0: + if ((0x3ff000000000000L & l) == 0L) + break; + if (kind > 29) + kind = 29; + jjAddStates(25, 26); + break; + case 1: + if (curChar == 46) + jjCheckNAdd(2); + break; + case 2: + if ((0x3ff000000000000L & l) == 0L) + break; + if (kind > 29) + kind = 29; + jjCheckNAdd(2); + break; + default : break; + } + } while(i != startsAt); + } + else if (curChar < 128) + { + long l = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + default : break; + } + } while(i != startsAt); + } + else + { + int hiByte = (int)(curChar >> 8); + int i1 = hiByte >> 6; + long l1 = 1L << (hiByte & 077); + int i2 = (curChar & 0xff) >> 6; + long l2 = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + default : break; + } + } while(i != startsAt); + } + if (kind != 0x7fffffff) + { + jjmatchedKind = kind; + jjmatchedPos = curPos; + kind = 0x7fffffff; + } + ++curPos; + if ((i = jjnewStateCnt) == (startsAt = 3 - (jjnewStateCnt = startsAt))) + return curPos; + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { return curPos; } + } +} +private final int jjStopStringLiteralDfa_2(int pos, long active0) +{ + switch (pos) + { + case 0: + if ((active0 & 0x40000000L) != 0L) + { + jjmatchedKind = 33; + return 6; + } + return -1; + default : + return -1; + } +} +private final int jjStartNfa_2(int pos, long active0) +{ + return jjMoveNfa_2(jjStopStringLiteralDfa_2(pos, active0), pos + 1); +} +private int jjMoveStringLiteralDfa0_2() +{ + switch(curChar) + { + case 84: + return jjMoveStringLiteralDfa1_2(0x40000000L); + case 93: + return jjStopAtPos(0, 31); + default : + return jjMoveNfa_2(0, 0); + } +} +private int jjMoveStringLiteralDfa1_2(long active0) +{ + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { + jjStopStringLiteralDfa_2(0, active0); + return 1; + } + switch(curChar) + { + case 79: + if ((active0 & 0x40000000L) != 0L) + return jjStartNfaWithStates_2(1, 30, 6); + break; + default : + break; + } + return jjStartNfa_2(0, active0); +} +private int jjStartNfaWithStates_2(int pos, int kind, int state) +{ + jjmatchedKind = kind; + jjmatchedPos = pos; + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { return pos + 1; } + return jjMoveNfa_2(state, pos + 1); +} +private int jjMoveNfa_2(int startState, int curPos) +{ + int startsAt = 0; + jjnewStateCnt = 7; + int i = 1; + jjstateSet[0] = startState; + int kind = 0x7fffffff; + for (;;) + { + if (++jjround == 0x7fffffff) + ReInitRounds(); + if (curChar < 64) + { + long l = 1L << curChar; + do + { + switch(jjstateSet[--i]) + { + case 0: + if ((0xfffffffeffffffffL & l) != 0L) + { + if (kind > 33) + kind = 33; + jjCheckNAdd(6); + } + if ((0x100002600L & l) != 0L) + { + if (kind > 9) + kind = 9; + } + else if (curChar == 34) + jjCheckNAddTwoStates(2, 4); + break; + case 1: + if (curChar == 34) + jjCheckNAddTwoStates(2, 4); + break; + case 2: + if ((0xfffffffbffffffffL & l) != 0L) + jjCheckNAddStates(22, 24); + break; + case 3: + if (curChar == 34) + jjCheckNAddStates(22, 24); + break; + case 5: + if (curChar == 34 && kind > 32) + kind = 32; + break; + case 6: + if ((0xfffffffeffffffffL & l) == 0L) + break; + if (kind > 33) + kind = 33; + jjCheckNAdd(6); + break; + default : break; + } + } while(i != startsAt); + } + else if (curChar < 128) + { + long l = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + case 0: + case 6: + if ((0xffffffffdfffffffL & l) == 0L) + break; + if (kind > 33) + kind = 33; + jjCheckNAdd(6); + break; + case 2: + jjAddStates(22, 24); + break; + case 4: + if (curChar == 92) + jjstateSet[jjnewStateCnt++] = 3; + break; + default : break; + } + } while(i != startsAt); + } + else + { + int hiByte = (int)(curChar >> 8); + int i1 = hiByte >> 6; + long l1 = 1L << (hiByte & 077); + int i2 = (curChar & 0xff) >> 6; + long l2 = 1L << (curChar & 077); + do + { + switch(jjstateSet[--i]) + { + case 0: + if (jjCanMove_0(hiByte, i1, i2, l1, l2)) + { + if (kind > 9) + kind = 9; + } + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + { + if (kind > 33) + kind = 33; + jjCheckNAdd(6); + } + break; + case 2: + if (jjCanMove_1(hiByte, i1, i2, l1, l2)) + jjAddStates(22, 24); + break; + case 6: + if (!jjCanMove_1(hiByte, i1, i2, l1, l2)) + break; + if (kind > 33) + kind = 33; + jjCheckNAdd(6); + break; + default : break; + } + } while(i != startsAt); + } + if (kind != 0x7fffffff) + { + jjmatchedKind = kind; + jjmatchedPos = curPos; + kind = 0x7fffffff; + } + ++curPos; + if ((i = jjnewStateCnt) == (startsAt = 7 - (jjnewStateCnt = startsAt))) + return curPos; + try { curChar = input_stream.readChar(); } + catch(java.io.IOException e) { return curPos; } + } +} +static final int[] jjnextStates = { + 24, 25, 27, 48, 51, 37, 52, 49, 43, 44, 46, 29, 30, 32, 34, 35, + 51, 37, 52, 50, 53, 41, 2, 4, 5, 0, 1, +}; +private static final boolean jjCanMove_0(int hiByte, int i1, int i2, long l1, long l2) +{ + switch(hiByte) + { + case 48: + return ((jjbitVec0[i2] & l2) != 0L); + default : + return false; + } +} +private static final boolean jjCanMove_1(int hiByte, int i1, int i2, long l1, long l2) +{ + switch(hiByte) + { + case 0: + return ((jjbitVec3[i2] & l2) != 0L); + default : + if ((jjbitVec1[i1] & l1) != 0L) + return true; + return false; + } +} +private static final boolean jjCanMove_2(int hiByte, int i1, int i2, long l1, long l2) +{ + switch(hiByte) + { + case 0: + return ((jjbitVec3[i2] & l2) != 0L); + case 48: + return ((jjbitVec1[i2] & l2) != 0L); + default : + if ((jjbitVec4[i1] & l1) != 0L) + return true; + return false; + } +} + +/** Token literal values. */ +public static final String[] jjstrLiteralImages = { +"", null, null, null, null, null, null, null, null, null, null, null, "\41", +"\53", null, "\50", "\51", "\72", "\52", "\136", null, null, null, null, null, null, +"\133", "\173", null, null, "\124\117", "\135", null, null, "\124\117", "\175", null, +null, }; + +/** Lexer state names. */ +public static final String[] lexStateNames = { + "Boost", + "RangeEx", + "RangeIn", + "DEFAULT", +}; + +/** Lex State array. */ +public static final int[] jjnewLexState = { + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, + -1, 2, 1, -1, 3, -1, 3, -1, -1, -1, 3, -1, -1, +}; +static final long[] jjtoToken = { + 0x3ffffffc01L, +}; +static final long[] jjtoSkip = { + 0x200L, +}; +protected CharStream input_stream; +private final int[] jjrounds = new int[55]; +private final int[] jjstateSet = new int[110]; +protected char curChar; +/** Constructor. */ +public InvenioQueryParserTokenManager(CharStream stream){ + input_stream = stream; +} + +/** Constructor. */ +public InvenioQueryParserTokenManager(CharStream stream, int lexState){ + this(stream); + SwitchTo(lexState); +} + +/** Reinitialise parser. */ +public void ReInit(CharStream stream) +{ + jjmatchedPos = jjnewStateCnt = 0; + curLexState = defaultLexState; + input_stream = stream; + ReInitRounds(); +} +private void ReInitRounds() +{ + int i; + jjround = 0x80000001; + for (i = 55; i-- > 0;) + jjrounds[i] = 0x80000000; +} + +/** Reinitialise parser. */ +public void ReInit(CharStream stream, int lexState) +{ + ReInit(stream); + SwitchTo(lexState); +} + +/** Switch to specified lex state. */ +public void SwitchTo(int lexState) +{ + if (lexState >= 4 || lexState < 0) + throw new TokenMgrError("Error: Ignoring invalid lexical state : " + lexState + ". State unchanged.", TokenMgrError.INVALID_LEXICAL_STATE); + else + curLexState = lexState; +} + +protected Token jjFillToken() +{ + final Token t; + final String curTokenImage; + final int beginLine; + final int endLine; + final int beginColumn; + final int endColumn; + String im = jjstrLiteralImages[jjmatchedKind]; + curTokenImage = (im == null) ? input_stream.GetImage() : im; + beginLine = input_stream.getBeginLine(); + beginColumn = input_stream.getBeginColumn(); + endLine = input_stream.getEndLine(); + endColumn = input_stream.getEndColumn(); + t = Token.newToken(jjmatchedKind, curTokenImage); + + t.beginLine = beginLine; + t.endLine = endLine; + t.beginColumn = beginColumn; + t.endColumn = endColumn; + + return t; +} + +int curLexState = 3; +int defaultLexState = 3; +int jjnewStateCnt; +int jjround; +int jjmatchedPos; +int jjmatchedKind; + +/** Get the next Token. */ +public Token getNextToken() +{ + Token matchedToken; + int curPos = 0; + + EOFLoop : + for (;;) + { + try + { + curChar = input_stream.BeginToken(); + } + catch(java.io.IOException e) + { + jjmatchedKind = 0; + matchedToken = jjFillToken(); + return matchedToken; + } + + switch(curLexState) + { + case 0: + jjmatchedKind = 0x7fffffff; + jjmatchedPos = 0; + curPos = jjMoveStringLiteralDfa0_0(); + break; + case 1: + jjmatchedKind = 0x7fffffff; + jjmatchedPos = 0; + curPos = jjMoveStringLiteralDfa0_1(); + break; + case 2: + jjmatchedKind = 0x7fffffff; + jjmatchedPos = 0; + curPos = jjMoveStringLiteralDfa0_2(); + break; + case 3: + jjmatchedKind = 0x7fffffff; + jjmatchedPos = 0; + curPos = jjMoveStringLiteralDfa0_3(); + break; + } + if (jjmatchedKind != 0x7fffffff) + { + if (jjmatchedPos + 1 < curPos) + input_stream.backup(curPos - jjmatchedPos - 1); + if ((jjtoToken[jjmatchedKind >> 6] & (1L << (jjmatchedKind & 077))) != 0L) + { + matchedToken = jjFillToken(); + if (jjnewLexState[jjmatchedKind] != -1) + curLexState = jjnewLexState[jjmatchedKind]; + return matchedToken; + } + else + { + if (jjnewLexState[jjmatchedKind] != -1) + curLexState = jjnewLexState[jjmatchedKind]; + continue EOFLoop; + } + } + int error_line = input_stream.getEndLine(); + int error_column = input_stream.getEndColumn(); + String error_after = null; + boolean EOFSeen = false; + try { input_stream.readChar(); input_stream.backup(1); } + catch (java.io.IOException e1) { + EOFSeen = true; + error_after = curPos <= 1 ? "" : input_stream.GetImage(); + if (curChar == '\n' || curChar == '\r') { + error_line++; + error_column = 0; + } + else + error_column++; + } + if (!EOFSeen) { + input_stream.backup(1); + error_after = curPos <= 1 ? "" : input_stream.GetImage(); + } + throw new TokenMgrError(EOFSeen, curLexState, error_line, error_column, error_after, curChar, TokenMgrError.LEXICAL_ERROR); + } +} + +private void jjCheckNAdd(int state) +{ + if (jjrounds[state] != jjround) + { + jjstateSet[jjnewStateCnt++] = state; + jjrounds[state] = jjround; + } +} +private void jjAddStates(int start, int end) +{ + do { + jjstateSet[jjnewStateCnt++] = jjnextStates[start]; + } while (start++ != end); +} +private void jjCheckNAddTwoStates(int state1, int state2) +{ + jjCheckNAdd(state1); + jjCheckNAdd(state2); +} + +private void jjCheckNAddStates(int start, int end) +{ + do { + jjCheckNAdd(jjnextStates[start]); + } while (start++ != end); +} + +} diff --git a/src/java/org/apache/lucene/queryParser/ParseException.java b/src/java/org/apache/lucene/queryParser/ParseException.java new file mode 100644 index 000000000..276fb74aa --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/ParseException.java @@ -0,0 +1,187 @@ +/* Generated By:JavaCC: Do not edit this line. ParseException.java Version 5.0 */ +/* JavaCCOptions:KEEP_LINE_COL=null */ +package org.apache.lucene.queryParser; + +/** + * This exception is thrown when parse errors are encountered. + * You can explicitly create objects of this exception type by + * calling the method generateParseException in the generated + * parser. + * + * You can modify this class to customize your error reporting + * mechanisms so long as you retain the public fields. + */ +public class ParseException extends Exception { + + /** + * The version identifier for this Serializable class. + * Increment only if the serialized form of the + * class changes. + */ + private static final long serialVersionUID = 1L; + + /** + * This constructor is used by the method "generateParseException" + * in the generated parser. Calling this constructor generates + * a new object of this type with the fields "currentToken", + * "expectedTokenSequences", and "tokenImage" set. + */ + public ParseException(Token currentTokenVal, + int[][] expectedTokenSequencesVal, + String[] tokenImageVal + ) + { + super(initialise(currentTokenVal, expectedTokenSequencesVal, tokenImageVal)); + currentToken = currentTokenVal; + expectedTokenSequences = expectedTokenSequencesVal; + tokenImage = tokenImageVal; + } + + /** + * The following constructors are for use by you for whatever + * purpose you can think of. Constructing the exception in this + * manner makes the exception behave in the normal way - i.e., as + * documented in the class "Throwable". The fields "errorToken", + * "expectedTokenSequences", and "tokenImage" do not contain + * relevant information. The JavaCC generated code does not use + * these constructors. + */ + + public ParseException() { + super(); + } + + /** Constructor with message. */ + public ParseException(String message) { + super(message); + } + + + /** + * This is the last token that has been consumed successfully. If + * this object has been created due to a parse error, the token + * followng this token will (therefore) be the first error token. + */ + public Token currentToken; + + /** + * Each entry in this array is an array of integers. Each array + * of integers represents a sequence of tokens (by their ordinal + * values) that is expected at this point of the parse. + */ + public int[][] expectedTokenSequences; + + /** + * This is a reference to the "tokenImage" array of the generated + * parser within which the parse error occurred. This array is + * defined in the generated ...Constants interface. + */ + public String[] tokenImage; + + /** + * It uses "currentToken" and "expectedTokenSequences" to generate a parse + * error message and returns it. If this object has been created + * due to a parse error, and you do not catch it (it gets thrown + * from the parser) the correct error message + * gets displayed. + */ + private static String initialise(Token currentToken, + int[][] expectedTokenSequences, + String[] tokenImage) { + String eol = System.getProperty("line.separator", "\n"); + StringBuffer expected = new StringBuffer(); + int maxSize = 0; + for (int i = 0; i < expectedTokenSequences.length; i++) { + if (maxSize < expectedTokenSequences[i].length) { + maxSize = expectedTokenSequences[i].length; + } + for (int j = 0; j < expectedTokenSequences[i].length; j++) { + expected.append(tokenImage[expectedTokenSequences[i][j]]).append(' '); + } + if (expectedTokenSequences[i][expectedTokenSequences[i].length - 1] != 0) { + expected.append("..."); + } + expected.append(eol).append(" "); + } + String retval = "Encountered \""; + Token tok = currentToken.next; + for (int i = 0; i < maxSize; i++) { + if (i != 0) retval += " "; + if (tok.kind == 0) { + retval += tokenImage[0]; + break; + } + retval += " " + tokenImage[tok.kind]; + retval += " \""; + retval += add_escapes(tok.image); + retval += " \""; + tok = tok.next; + } + retval += "\" at line " + currentToken.next.beginLine + ", column " + currentToken.next.beginColumn; + retval += "." + eol; + if (expectedTokenSequences.length == 1) { + retval += "Was expecting:" + eol + " "; + } else { + retval += "Was expecting one of:" + eol + " "; + } + retval += expected.toString(); + return retval; + } + + /** + * The end of line string for this machine. + */ + protected String eol = System.getProperty("line.separator", "\n"); + + /** + * Used to convert raw characters to their escaped version + * when these raw version cannot be used as part of an ASCII + * string literal. + */ + static String add_escapes(String str) { + StringBuffer retval = new StringBuffer(); + char ch; + for (int i = 0; i < str.length(); i++) { + switch (str.charAt(i)) + { + case 0 : + continue; + case '\b': + retval.append("\\b"); + continue; + case '\t': + retval.append("\\t"); + continue; + case '\n': + retval.append("\\n"); + continue; + case '\f': + retval.append("\\f"); + continue; + case '\r': + retval.append("\\r"); + continue; + case '\"': + retval.append("\\\""); + continue; + case '\'': + retval.append("\\\'"); + continue; + case '\\': + retval.append("\\\\"); + continue; + default: + if ((ch = str.charAt(i)) < 0x20 || ch > 0x7e) { + String s = "0000" + Integer.toString(ch, 16); + retval.append("\\u" + s.substring(s.length() - 4, s.length())); + } else { + retval.append(ch); + } + continue; + } + } + return retval.toString(); + } + +} +/* JavaCC - OriginalChecksum=2e7670d6260cd2ac6c9cbda0075541b7 (do not edit this line) */ diff --git a/src/java/org/apache/lucene/queryParser/QueryParser.jj b/src/java/org/apache/lucene/queryParser/QueryParser.jj new file mode 100755 index 000000000..c739104f7 --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/QueryParser.jj @@ -0,0 +1,1483 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +options { + STATIC=false; + JAVA_UNICODE_ESCAPE=true; + USER_CHAR_STREAM=true; +} + +PARSER_BEGIN(QueryParser) + +package org.apache.lucene.queryParser; + +import java.io.IOException; +import java.io.StringReader; +import java.text.Collator; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.Date; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Vector; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.CachingTokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.analysis.tokenattributes.TermAttribute; +import org.apache.lucene.document.DateField; +import org.apache.lucene.document.DateTools; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.BooleanClause; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.FuzzyQuery; +import org.apache.lucene.search.MultiTermQuery; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.search.MultiPhraseQuery; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.PrefixQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.TermRangeQuery; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.WildcardQuery; +import org.apache.lucene.util.Parameter; +import org.apache.lucene.util.Version; + +/** + * This class is generated by JavaCC. The most important method is + * {@link #parse(String)}. + * + * The syntax for query strings is as follows: + * A Query is a series of clauses. + * A clause may be prefixed by: + *

    + *
  • a plus (+) or a minus (-) sign, indicating + * that the clause is required or prohibited respectively; or + *
  • a term followed by a colon, indicating the field to be searched. + * This enables one to construct queries which search multiple fields. + *
+ * + * A clause may be either: + *
    + *
  • a term, indicating all the documents that contain this term; or + *
  • a nested query, enclosed in parentheses. Note that this may be used + * with a +/- prefix to require any of a set of + * terms. + *
+ * + * Thus, in BNF, the query grammar is: + *
+ *   Query  ::= ( Clause )*
+ *   Clause ::= ["+", "-"] [<TERM> ":"] ( <TERM> | "(" Query ")" )
+ * 
+ * + *

+ * Examples of appropriately formatted queries can be found in the query syntax + * documentation. + *

+ * + *

+ * In {@link TermRangeQuery}s, QueryParser tries to detect date values, e.g. + * date:[6/1/2005 TO 6/4/2005] produces a range query that searches + * for "date" fields between 2005-06-01 and 2005-06-04. Note that the format + * of the accepted input depends on {@link #setLocale(Locale) the locale}. + * By default a date is converted into a search term using the deprecated + * {@link DateField} for compatibility reasons. + * To use the new {@link DateTools} to convert dates, a + * {@link org.apache.lucene.document.DateTools.Resolution} has to be set. + *

+ *

+ * The date resolution that shall be used for RangeQueries can be set + * using {@link #setDateResolution(DateTools.Resolution)} + * or {@link #setDateResolution(String, DateTools.Resolution)}. The former + * sets the default date resolution for all fields, whereas the latter can + * be used to set field specific date resolutions. Field specific date + * resolutions take, if set, precedence over the default date resolution. + *

+ *

+ * If you use neither {@link DateField} nor {@link DateTools} in your + * index, you can create your own + * query parser that inherits QueryParser and overwrites + * {@link #getRangeQuery(String, String, String, boolean)} to + * use a different method for date conversion. + *

+ * + *

Note that QueryParser is not thread-safe.

+ * + *

NOTE: there is a new QueryParser in contrib, which matches + * the same syntax as this class, but is more modular, + * enabling substantial customization to how a query is created. + * + * + *

NOTE: You must specify the required {@link Version} + * compatibility when creating QueryParser: + *

+ */ +public class QueryParser { + + private static final int CONJ_NONE = 0; + private static final int CONJ_AND = 1; + private static final int CONJ_OR = 2; + + private static final int MOD_NONE = 0; + private static final int MOD_NOT = 10; + private static final int MOD_REQ = 11; + + // make it possible to call setDefaultOperator() without accessing + // the nested class: + /** Alternative form of QueryParser.Operator.AND */ + public static final Operator AND_OPERATOR = Operator.AND; + /** Alternative form of QueryParser.Operator.OR */ + public static final Operator OR_OPERATOR = Operator.OR; + + /** The actual operator that parser uses to combine query terms */ + private Operator operator = OR_OPERATOR; + + boolean lowercaseExpandedTerms = true; + MultiTermQuery.RewriteMethod multiTermRewriteMethod = MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT; + boolean allowLeadingWildcard = false; + boolean enablePositionIncrements = true; + + Analyzer analyzer; + String field; + int phraseSlop = 0; + float fuzzyMinSim = FuzzyQuery.defaultMinSimilarity; + int fuzzyPrefixLength = FuzzyQuery.defaultPrefixLength; + Locale locale = Locale.getDefault(); + + // the default date resolution + DateTools.Resolution dateResolution = null; + // maps field names to date resolutions + Map fieldToDateResolution = null; + + // The collator to use when determining range inclusion, + // for use when constructing RangeQuerys. + Collator rangeCollator = null; + + /** The default operator for parsing queries. + * Use {@link QueryParser#setDefaultOperator} to change it. + */ + static public final class Operator extends Parameter { + private Operator(String name) { + super(name); + } + static public final Operator OR = new Operator("OR"); + static public final Operator AND = new Operator("AND"); + } + + + /** Constructs a query parser. + * @param f the default field for query terms. + * @param a used to find terms in the query text. + * @deprecated Use {@link #QueryParser(Version, String, Analyzer)} instead + */ + public QueryParser(String f, Analyzer a) { + this(Version.LUCENE_24, f, a); + } + + /** Constructs a query parser. + * @param matchVersion Lucene version to match. See {@link above) + * @param f the default field for query terms. + * @param a used to find terms in the query text. + */ + public QueryParser(Version matchVersion, String f, Analyzer a) { + this(new FastCharStream(new StringReader(""))); + analyzer = a; + field = f; + if (matchVersion.onOrAfter(Version.LUCENE_29)) { + enablePositionIncrements = true; + } else { + enablePositionIncrements = false; + } + } + + /** Parses a query string, returning a {@link org.apache.lucene.search.Query}. + * @param query the query string to be parsed. + * @throws ParseException if the parsing fails + */ + public Query parse(String query) throws ParseException { + ReInit(new FastCharStream(new StringReader(query))); + try { + // TopLevelQuery is a Query followed by the end-of-input (EOF) + Query res = TopLevelQuery(field); + return res!=null ? res : newBooleanQuery(false); + } + catch (ParseException tme) { + // rethrow to include the original query: + ParseException e = new ParseException("Cannot parse '" +query+ "': " + tme.getMessage()); + e.initCause(tme); + throw e; + } + catch (TokenMgrError tme) { + ParseException e = new ParseException("Cannot parse '" +query+ "': " + tme.getMessage()); + e.initCause(tme); + throw e; + } + catch (BooleanQuery.TooManyClauses tmc) { + ParseException e = new ParseException("Cannot parse '" +query+ "': too many boolean clauses"); + e.initCause(tmc); + throw e; + } + } + + /** + * @return Returns the analyzer. + */ + public Analyzer getAnalyzer() { + return analyzer; + } + + /** + * @return Returns the field. + */ + public String getField() { + return field; + } + + /** + * Get the minimal similarity for fuzzy queries. + */ + public float getFuzzyMinSim() { + return fuzzyMinSim; + } + + /** + * Set the minimum similarity for fuzzy queries. + * Default is 0.5f. + */ + public void setFuzzyMinSim(float fuzzyMinSim) { + this.fuzzyMinSim = fuzzyMinSim; + } + + /** + * Get the prefix length for fuzzy queries. + * @return Returns the fuzzyPrefixLength. + */ + public int getFuzzyPrefixLength() { + return fuzzyPrefixLength; + } + + /** + * Set the prefix length for fuzzy queries. Default is 0. + * @param fuzzyPrefixLength The fuzzyPrefixLength to set. + */ + public void setFuzzyPrefixLength(int fuzzyPrefixLength) { + this.fuzzyPrefixLength = fuzzyPrefixLength; + } + + /** + * Sets the default slop for phrases. If zero, then exact phrase matches + * are required. Default value is zero. + */ + public void setPhraseSlop(int phraseSlop) { + this.phraseSlop = phraseSlop; + } + + /** + * Gets the default slop for phrases. + */ + public int getPhraseSlop() { + return phraseSlop; + } + + + /** + * Set to true to allow leading wildcard characters. + *

+ * When set, * or ? are allowed as + * the first character of a PrefixQuery and WildcardQuery. + * Note that this can produce very slow + * queries on big indexes. + *

+ * Default: false. + */ + public void setAllowLeadingWildcard(boolean allowLeadingWildcard) { + this.allowLeadingWildcard = allowLeadingWildcard; + } + + /** + * @see #setAllowLeadingWildcard(boolean) + */ + public boolean getAllowLeadingWildcard() { + return allowLeadingWildcard; + } + + /** + * Set to true to enable position increments in result query. + *

+ * When set, result phrase and multi-phrase queries will + * be aware of position increments. + * Useful when e.g. a StopFilter increases the position increment of + * the token that follows an omitted token. + *

+ * Default: false. + */ + public void setEnablePositionIncrements(boolean enable) { + this.enablePositionIncrements = enable; + } + + /** + * @see #setEnablePositionIncrements(boolean) + */ + public boolean getEnablePositionIncrements() { + return enablePositionIncrements; + } + + /** + * Sets the boolean operator of the QueryParser. + * In default mode (OR_OPERATOR) terms without any modifiers + * are considered optional: for example capital of Hungary is equal to + * capital OR of OR Hungary.
+ * In AND_OPERATOR mode terms are considered to be in conjunction: the + * above mentioned query is parsed as capital AND of AND Hungary + */ + public void setDefaultOperator(Operator op) { + this.operator = op; + } + + + /** + * Gets implicit operator setting, which will be either AND_OPERATOR + * or OR_OPERATOR. + */ + public Operator getDefaultOperator() { + return operator; + } + + + /** + * Whether terms of wildcard, prefix, fuzzy and range queries are to be automatically + * lower-cased or not. Default is true. + */ + public void setLowercaseExpandedTerms(boolean lowercaseExpandedTerms) { + this.lowercaseExpandedTerms = lowercaseExpandedTerms; + } + + + /** + * @see #setLowercaseExpandedTerms(boolean) + */ + public boolean getLowercaseExpandedTerms() { + return lowercaseExpandedTerms; + } + + /** + * @deprecated Please use {@link #setMultiTermRewriteMethod} instead. + */ + public void setUseOldRangeQuery(boolean useOldRangeQuery) { + if (useOldRangeQuery) { + setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); + } else { + setMultiTermRewriteMethod(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT); + } + } + + + /** + * @deprecated Please use {@link #getMultiTermRewriteMethod} instead. + */ + public boolean getUseOldRangeQuery() { + if (getMultiTermRewriteMethod() == MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE) { + return true; + } else { + return false; + } + } + + /** + * By default QueryParser uses {@link MultiTermQuery#CONSTANT_SCORE_AUTO_REWRITE_DEFAULT} + * when creating a PrefixQuery, WildcardQuery or RangeQuery. This implementation is generally preferable because it + * a) Runs faster b) Does not have the scarcity of terms unduly influence score + * c) avoids any "TooManyBooleanClauses" exception. + * However, if your application really needs to use the + * old-fashioned BooleanQuery expansion rewriting and the above + * points are not relevant then use this to change + * the rewrite method. + */ + public void setMultiTermRewriteMethod(MultiTermQuery.RewriteMethod method) { + multiTermRewriteMethod = method; + } + + + /** + * @see #setMultiTermRewriteMethod + */ + public MultiTermQuery.RewriteMethod getMultiTermRewriteMethod() { + return multiTermRewriteMethod; + } + + /** + * Set locale used by date range parsing. + */ + public void setLocale(Locale locale) { + this.locale = locale; + } + + /** + * Returns current locale, allowing access by subclasses. + */ + public Locale getLocale() { + return locale; + } + + /** + * Sets the default date resolution used by RangeQueries for fields for which no + * specific date resolutions has been set. Field specific resolutions can be set + * with {@link #setDateResolution(String, DateTools.Resolution)}. + * + * @param dateResolution the default date resolution to set + */ + public void setDateResolution(DateTools.Resolution dateResolution) { + this.dateResolution = dateResolution; + } + + /** + * Sets the date resolution used by RangeQueries for a specific field. + * + * @param fieldName field for which the date resolution is to be set + * @param dateResolution date resolution to set + */ + public void setDateResolution(String fieldName, DateTools.Resolution dateResolution) { + if (fieldName == null) { + throw new IllegalArgumentException("Field cannot be null."); + } + + if (fieldToDateResolution == null) { + // lazily initialize HashMap + fieldToDateResolution = new HashMap(); + } + + fieldToDateResolution.put(fieldName, dateResolution); + } + + /** + * Returns the date resolution that is used by RangeQueries for the given field. + * Returns null, if no default or field specific date resolution has been set + * for the given field. + * + */ + public DateTools.Resolution getDateResolution(String fieldName) { + if (fieldName == null) { + throw new IllegalArgumentException("Field cannot be null."); + } + + if (fieldToDateResolution == null) { + // no field specific date resolutions set; return default date resolution instead + return this.dateResolution; + } + + DateTools.Resolution resolution = (DateTools.Resolution) fieldToDateResolution.get(fieldName); + if (resolution == null) { + // no date resolutions set for the given field; return default date resolution instead + resolution = this.dateResolution; + } + + return resolution; + } + + /** + * Sets the collator used to determine index term inclusion in ranges + * for RangeQuerys. + *

+ * WARNING: Setting the rangeCollator to a non-null + * collator using this method will cause every single index Term in the + * Field referenced by lowerTerm and/or upperTerm to be examined. + * Depending on the number of index Terms in this Field, the operation could + * be very slow. + * + * @param rc the collator to use when constructing RangeQuerys + */ + public void setRangeCollator(Collator rc) { + rangeCollator = rc; + } + + /** + * @return the collator used to determine index term inclusion in ranges + * for RangeQuerys. + */ + public Collator getRangeCollator() { + return rangeCollator; + } + + /** + * @deprecated use {@link #addClause(List, int, int, Query)} instead. + */ + protected void addClause(Vector clauses, int conj, int mods, Query q) { + addClause((List) clauses, conj, mods, q); + } + + protected void addClause(List clauses, int conj, int mods, Query q) { + boolean required, prohibited; + + // If this term is introduced by AND, make the preceding term required, + // unless it's already prohibited + if (clauses.size() > 0 && conj == CONJ_AND) { + BooleanClause c = (BooleanClause) clauses.get(clauses.size()-1); + if (!c.isProhibited()) + c.setOccur(BooleanClause.Occur.MUST); + } + + if (clauses.size() > 0 && operator == AND_OPERATOR && conj == CONJ_OR) { + // If this term is introduced by OR, make the preceding term optional, + // unless it's prohibited (that means we leave -a OR b but +a OR b-->a OR b) + // notice if the input is a OR b, first term is parsed as required; without + // this modification a OR b would parsed as +a OR b + BooleanClause c = (BooleanClause) clauses.get(clauses.size()-1); + if (!c.isProhibited()) + c.setOccur(BooleanClause.Occur.SHOULD); + } + + // We might have been passed a null query; the term might have been + // filtered away by the analyzer. + if (q == null) + return; + + if (operator == OR_OPERATOR) { + // We set REQUIRED if we're introduced by AND or +; PROHIBITED if + // introduced by NOT or -; make sure not to set both. + prohibited = (mods == MOD_NOT); + required = (mods == MOD_REQ); + if (conj == CONJ_AND && !prohibited) { + required = true; + } + } else { + // We set PROHIBITED if we're introduced by NOT or -; We set REQUIRED + // if not PROHIBITED and not introduced by OR + prohibited = (mods == MOD_NOT); + required = (!prohibited && conj != CONJ_OR); + } + if (required && !prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST)); + else if (!required && !prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.SHOULD)); + else if (!required && prohibited) + clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST_NOT)); + else + throw new RuntimeException("Clause cannot be both required and prohibited"); + } + + + /** + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFieldQuery(String field, String queryText) throws ParseException { + // Use the analyzer to get all the tokens, and then build a TermQuery, + // PhraseQuery, or nothing based on the term count + + TokenStream source; + try { + source = analyzer.reusableTokenStream(field, new StringReader(queryText)); + source.reset(); + } catch (IOException e) { + source = analyzer.tokenStream(field, new StringReader(queryText)); + } + CachingTokenFilter buffer = new CachingTokenFilter(source); + TermAttribute termAtt = null; + PositionIncrementAttribute posIncrAtt = null; + int numTokens = 0; + + boolean success = false; + try { + buffer.reset(); + success = true; + } catch (IOException e) { + // success==false if we hit an exception + } + if (success) { + if (buffer.hasAttribute(TermAttribute.class)) { + termAtt = (TermAttribute) buffer.getAttribute(TermAttribute.class); + } + if (buffer.hasAttribute(PositionIncrementAttribute.class)) { + posIncrAtt = (PositionIncrementAttribute) buffer.getAttribute(PositionIncrementAttribute.class); + } + } + + int positionCount = 0; + boolean severalTokensAtSamePosition = false; + + boolean hasMoreTokens = false; + if (termAtt != null) { + try { + hasMoreTokens = buffer.incrementToken(); + while (hasMoreTokens) { + numTokens++; + int positionIncrement = (posIncrAtt != null) ? posIncrAtt.getPositionIncrement() : 1; + if (positionIncrement != 0) { + positionCount += positionIncrement; + } else { + severalTokensAtSamePosition = true; + } + hasMoreTokens = buffer.incrementToken(); + } + } catch (IOException e) { + // ignore + } + } + try { + // rewind the buffer stream + buffer.reset(); + + // close original stream - all tokens buffered + source.close(); + } + catch (IOException e) { + // ignore + } + + if (numTokens == 0) + return null; + else if (numTokens == 1) { + String term = null; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + return newTermQuery(new Term(field, term)); + } else { + if (severalTokensAtSamePosition) { + if (positionCount == 1) { + // no phrase query: + BooleanQuery q = newBooleanQuery(true); + for (int i = 0; i < numTokens; i++) { + String term = null; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + Query currentQuery = newTermQuery( + new Term(field, term)); + q.add(currentQuery, BooleanClause.Occur.SHOULD); + } + return q; + } + else { + // phrase query: + MultiPhraseQuery mpq = newMultiPhraseQuery(); + mpq.setSlop(phraseSlop); + List multiTerms = new ArrayList(); + int position = -1; + for (int i = 0; i < numTokens; i++) { + String term = null; + int positionIncrement = 1; + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + if (posIncrAtt != null) { + positionIncrement = posIncrAtt.getPositionIncrement(); + } + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + if (positionIncrement > 0 && multiTerms.size() > 0) { + if (enablePositionIncrements) { + mpq.add((Term[])multiTerms.toArray(new Term[0]),position); + } else { + mpq.add((Term[])multiTerms.toArray(new Term[0])); + } + multiTerms.clear(); + } + position += positionIncrement; + multiTerms.add(new Term(field, term)); + } + if (enablePositionIncrements) { + mpq.add((Term[])multiTerms.toArray(new Term[0]),position); + } else { + mpq.add((Term[])multiTerms.toArray(new Term[0])); + } + return mpq; + } + } + else { + PhraseQuery pq = newPhraseQuery(); + pq.setSlop(phraseSlop); + int position = -1; + + + for (int i = 0; i < numTokens; i++) { + String term = null; + int positionIncrement = 1; + + try { + boolean hasNext = buffer.incrementToken(); + assert hasNext == true; + term = termAtt.term(); + if (posIncrAtt != null) { + positionIncrement = posIncrAtt.getPositionIncrement(); + } + } catch (IOException e) { + // safe to ignore, because we know the number of tokens + } + + if (enablePositionIncrements) { + position += positionIncrement; + pq.add(new Term(field, term),position); + } else { + pq.add(new Term(field, term)); + } + } + return pq; + } + } + } + + + + /** + * Base implementation delegates to {@link #getFieldQuery(String,String)}. + * This method may be overridden, for example, to return + * a SpanNearQuery instead of a PhraseQuery. + * + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFieldQuery(String field, String queryText, int slop) + throws ParseException { + Query query = getFieldQuery(field, queryText); + + if (query instanceof PhraseQuery) { + ((PhraseQuery) query).setSlop(slop); + } + if (query instanceof MultiPhraseQuery) { + ((MultiPhraseQuery) query).setSlop(slop); + } + + return query; + } + + + /** + * @exception ParseException throw in overridden method to disallow + */ + protected Query getRangeQuery(String field, + String part1, + String part2, + boolean inclusive) throws ParseException + { + if (lowercaseExpandedTerms) { + part1 = part1.toLowerCase(); + part2 = part2.toLowerCase(); + } + try { + DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT, locale); + df.setLenient(true); + Date d1 = df.parse(part1); + Date d2 = df.parse(part2); + if (inclusive) { + // The user can only specify the date, not the time, so make sure + // the time is set to the latest possible time of that date to really + // include all documents: + Calendar cal = Calendar.getInstance(locale); + cal.setTime(d2); + cal.set(Calendar.HOUR_OF_DAY, 23); + cal.set(Calendar.MINUTE, 59); + cal.set(Calendar.SECOND, 59); + cal.set(Calendar.MILLISECOND, 999); + d2 = cal.getTime(); + } + DateTools.Resolution resolution = getDateResolution(field); + if (resolution == null) { + // no default or field specific date resolution has been set, + // use deprecated DateField to maintain compatibility with + // pre-1.9 Lucene versions. + part1 = DateField.dateToString(d1); + part2 = DateField.dateToString(d2); + } else { + part1 = DateTools.dateToString(d1, resolution); + part2 = DateTools.dateToString(d2, resolution); + } + } + catch (Exception e) { } + + return newRangeQuery(field, part1, part2, inclusive); + } + + /** + * Builds a new BooleanQuery instance + * @param disableCoord disable coord + * @return new BooleanQuery instance + */ + protected BooleanQuery newBooleanQuery(boolean disableCoord) { + return new BooleanQuery(disableCoord); + } + + /** + * Builds a new BooleanClause instance + * @param q sub query + * @param occur how this clause should occur when matching documents + * @return new BooleanClause instance + */ + protected BooleanClause newBooleanClause(Query q, BooleanClause.Occur occur) { + return new BooleanClause(q, occur); + } + + /** + * Builds a new TermQuery instance + * @param term term + * @return new TermQuery instance + */ + protected Query newTermQuery(Term term){ + return new TermQuery(term); + } + + /** + * Builds a new PhraseQuery instance + * @return new PhraseQuery instance + */ + protected PhraseQuery newPhraseQuery(){ + return new PhraseQuery(); + } + + /** + * Builds a new MultiPhraseQuery instance + * @return new MultiPhraseQuery instance + */ + protected MultiPhraseQuery newMultiPhraseQuery(){ + return new MultiPhraseQuery(); + } + + /** + * Builds a new PrefixQuery instance + * @param prefix Prefix term + * @return new PrefixQuery instance + */ + protected Query newPrefixQuery(Term prefix){ + PrefixQuery query = new PrefixQuery(prefix); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Builds a new FuzzyQuery instance + * @param term Term + * @param minimumSimilarity minimum similarity + * @param prefixLength prefix length + * @return new FuzzyQuery Instance + */ + protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) { + // FuzzyQuery doesn't yet allow constant score rewrite + return new FuzzyQuery(term,minimumSimilarity,prefixLength); + } + + /** + * Builds a new TermRangeQuery instance + * @param field Field + * @param part1 min + * @param part2 max + * @param inclusive true if range is inclusive + * @return new TermRangeQuery instance + */ + protected Query newRangeQuery(String field, String part1, String part2, boolean inclusive) { + final TermRangeQuery query = new TermRangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Builds a new MatchAllDocsQuery instance + * @return new MatchAllDocsQuery instance + */ + protected Query newMatchAllDocsQuery() { + return new MatchAllDocsQuery(); + } + + /** + * Builds a new WildcardQuery instance + * @param t wildcard term + * @return new WildcardQuery instance + */ + protected Query newWildcardQuery(Term t) { + WildcardQuery query = new WildcardQuery(t); + query.setRewriteMethod(multiTermRewriteMethod); + return query; + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + * @deprecated use {@link #getBooleanQuery(List)} instead + */ + protected Query getBooleanQuery(Vector clauses) throws ParseException { + return getBooleanQuery((List) clauses, false); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + */ + protected Query getBooleanQuery(List clauses) throws ParseException { + return getBooleanQuery(clauses, false); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * @param disableCoord true if coord scoring should be disabled. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + * @deprecated use {@link #getBooleanQuery(List, boolean)} instead + */ + protected Query getBooleanQuery(Vector clauses, boolean disableCoord) + throws ParseException + { + return getBooleanQuery((List) clauses, disableCoord); + } + + /** + * Factory method for generating query, given a set of clauses. + * By default creates a boolean query composed of clauses passed in. + * + * Can be overridden by extending classes, to modify query being + * returned. + * + * @param clauses List that contains {@link BooleanClause} instances + * to join. + * @param disableCoord true if coord scoring should be disabled. + * + * @return Resulting {@link Query} object. + * @exception ParseException throw in overridden method to disallow + */ + protected Query getBooleanQuery(List clauses, boolean disableCoord) + throws ParseException + { + if (clauses.size()==0) { + return null; // all clause words were filtered away by the analyzer. + } + BooleanQuery query = newBooleanQuery(disableCoord); + for (int i = 0; i < clauses.size(); i++) { + query.add((BooleanClause)clauses.get(i)); + } + return query; + } + + /** + * Factory method for generating a query. Called when parser + * parses an input term token that contains one or more wildcard + * characters (? and *), but is not a prefix term token (one + * that has just a single * character at the end) + *

+ * Depending on settings, prefix term may be lower-cased + * automatically. It will not go through the default Analyzer, + * however, since normal Analyzers are unlikely to work properly + * with wildcard templates. + *

+ * Can be overridden by extending classes, to provide custom handling for + * wildcard queries, which may be necessary due to missing analyzer calls. + * + * @param field Name of the field query will use. + * @param termStr Term token that contains one or more wild card + * characters (? or *), but is not simple prefix term + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getWildcardQuery(String field, String termStr) throws ParseException + { + if ("*".equals(field)) { + if ("*".equals(termStr)) return newMatchAllDocsQuery(); + } + if (!allowLeadingWildcard && (termStr.startsWith("*") || termStr.startsWith("?"))) + throw new ParseException("'*' or '?' not allowed as first character in WildcardQuery"); + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newWildcardQuery(t); + } + + /** + * Factory method for generating a query (similar to + * {@link #getWildcardQuery}). Called when parser parses an input term + * token that uses prefix notation; that is, contains a single '*' wildcard + * character as its last character. Since this is a special case + * of generic wildcard term, and such a query can be optimized easily, + * this usually results in a different query object. + *

+ * Depending on settings, a prefix term may be lower-cased + * automatically. It will not go through the default Analyzer, + * however, since normal Analyzers are unlikely to work properly + * with wildcard templates. + *

+ * Can be overridden by extending classes, to provide custom handling for + * wild card queries, which may be necessary due to missing analyzer calls. + * + * @param field Name of the field query will use. + * @param termStr Term token to use for building term for the query + * (without trailing '*' character!) + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getPrefixQuery(String field, String termStr) throws ParseException + { + if (!allowLeadingWildcard && termStr.startsWith("*")) + throw new ParseException("'*' not allowed as first character in PrefixQuery"); + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newPrefixQuery(t); + } + + /** + * Factory method for generating a query (similar to + * {@link #getWildcardQuery}). Called when parser parses + * an input term token that has the fuzzy suffix (~) appended. + * + * @param field Name of the field query will use. + * @param termStr Term token to use for building term for the query + * + * @return Resulting {@link Query} built for the term + * @exception ParseException throw in overridden method to disallow + */ + protected Query getFuzzyQuery(String field, String termStr, float minSimilarity) throws ParseException + { + if (lowercaseExpandedTerms) { + termStr = termStr.toLowerCase(); + } + Term t = new Term(field, termStr); + return newFuzzyQuery(t, minSimilarity, fuzzyPrefixLength); + } + + /** + * Returns a String where the escape char has been + * removed, or kept only once if there was a double escape. + * + * Supports escaped unicode characters, e. g. translates + * \\u0041 to A. + * + */ + private String discardEscapeChar(String input) throws ParseException { + // Create char array to hold unescaped char sequence + char[] output = new char[input.length()]; + + // The length of the output can be less than the input + // due to discarded escape chars. This variable holds + // the actual length of the output + int length = 0; + + // We remember whether the last processed character was + // an escape character + boolean lastCharWasEscapeChar = false; + + // The multiplier the current unicode digit must be multiplied with. + // E. g. the first digit must be multiplied with 16^3, the second with 16^2... + int codePointMultiplier = 0; + + // Used to calculate the codepoint of the escaped unicode character + int codePoint = 0; + + for (int i = 0; i < input.length(); i++) { + char curChar = input.charAt(i); + if (codePointMultiplier > 0) { + codePoint += hexToInt(curChar) * codePointMultiplier; + codePointMultiplier >>>= 4; + if (codePointMultiplier == 0) { + output[length++] = (char)codePoint; + codePoint = 0; + } + } else if (lastCharWasEscapeChar) { + if (curChar == 'u') { + // found an escaped unicode character + codePointMultiplier = 16 * 16 * 16; + } else { + // this character was escaped + output[length] = curChar; + length++; + } + lastCharWasEscapeChar = false; + } else { + if (curChar == '\\') { + lastCharWasEscapeChar = true; + } else { + output[length] = curChar; + length++; + } + } + } + + if (codePointMultiplier > 0) { + throw new ParseException("Truncated unicode escape sequence."); + } + + if (lastCharWasEscapeChar) { + throw new ParseException("Term can not end with escape character."); + } + + return new String(output, 0, length); + } + + /** Returns the numeric value of the hexadecimal character */ + private static final int hexToInt(char c) throws ParseException { + if ('0' <= c && c <= '9') { + return c - '0'; + } else if ('a' <= c && c <= 'f'){ + return c - 'a' + 10; + } else if ('A' <= c && c <= 'F') { + return c - 'A' + 10; + } else { + throw new ParseException("None-hex character in unicode escape sequence: " + c); + } + } + + /** + * Returns a String where those characters that QueryParser + * expects to be escaped are escaped by a preceding \. + */ + public static String escape(String s) { + StringBuffer sb = new StringBuffer(); + for (int i = 0; i < s.length(); i++) { + char c = s.charAt(i); + // These characters are part of the query syntax and must be escaped + if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':' + || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c == '}' || c == '~' + || c == '*' || c == '?' || c == '|' || c == '&') { + sb.append('\\'); + } + sb.append(c); + } + return sb.toString(); + } + + /** + * Command line tool to test QueryParser, using {@link org.apache.lucene.analysis.SimpleAnalyzer}. + * Usage:
+ * java org.apache.lucene.queryParser.QueryParser <input> + */ + public static void main(String[] args) throws Exception { + if (args.length == 0) { + System.out.println("Usage: java org.apache.lucene.queryParser.QueryParser "); + System.exit(0); + } + QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, "field", + new org.apache.lucene.analysis.SimpleAnalyzer()); + Query q = qp.parse(args[0]); + System.out.println(q.toString("field")); + } +} + +PARSER_END(QueryParser) + +/* ***************** */ +/* Token Definitions */ +/* ***************** */ + +<*> TOKEN : { + <#_NUM_CHAR: ["0"-"9"] > +// every character that follows a backslash is considered as an escaped character +| <#_ESCAPED_CHAR: "\\" ~[] > +| <#_TERM_START_CHAR: ( ~[ " ", "\t", "\n", "\r", "\u3000", "+", "-", "!", "(", ")", ":", "^", + "[", "]", "\"", "{", "}", "~", "*", "?", "\\" ] + | <_ESCAPED_CHAR> ) > +| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" | "+" ) > +| <#_WHITESPACE: ( " " | "\t" | "\n" | "\r" | "\u3000") > +| <#_QUOTED_CHAR: ( ~[ "\"", "\\" ] | <_ESCAPED_CHAR> ) > +} + + SKIP : { + < <_WHITESPACE>> +} + + TOKEN : { + +| +| +| +| +| +| +| +| +| : Boost +| )* "\""> +| )* "'"> +| (<_TERM_CHAR>)* > +| )+ ( "." (<_NUM_CHAR>)+ )? )? > +| (<_TERM_CHAR>)* "*" ) > +| | [ "*", "?" ]) (<_TERM_CHAR> | ( [ "*", "?" ] ))* > +| : RangeIn +| : RangeEx +} + + TOKEN : { +)+ ( "." (<_NUM_CHAR>)+ )? > : DEFAULT +} + + TOKEN : { + +| : DEFAULT +| +| +} + + TOKEN : { + +| : DEFAULT +| +| +} + +// * Query ::= ( Clause )* +// * Clause ::= ["+", "-"] [ ":"] ( | "(" Query ")" ) + +int Conjunction() : { + int ret = CONJ_NONE; +} +{ + [ + { ret = CONJ_AND; } + | { ret = CONJ_OR; } + ] + { return ret; } +} + +int Modifiers() : { + int ret = MOD_NONE; +} +{ + [ + { ret = MOD_REQ; } + | { ret = MOD_NOT; } + | { ret = MOD_NOT; } + ] + { return ret; } +} + +// This makes sure that there is no garbage after the query string +Query TopLevelQuery(String field) : +{ + Query q; +} +{ + q=Query(field) + { + return q; + } +} + +Query Query(String field) : +{ + List clauses = new ArrayList(); + Query q, firstQuery=null; + int conj, mods; +} +{ + mods=Modifiers() q=Clause(field) + { + addClause(clauses, CONJ_NONE, mods, q); + if (mods == MOD_NONE) + firstQuery=q; + } + ( + conj=Conjunction() mods=Modifiers() q=Clause(field) + { addClause(clauses, conj, mods, q); } + )* + { + if (clauses.size() == 1 && firstQuery != null) + return firstQuery; + else { + return getBooleanQuery(clauses); + } + } +} + +Query Clause(String field) : { + Query q; + Token fieldToken=null, boost=null; +} +{ + [ + LOOKAHEAD(2) + ( + fieldToken= {field=discardEscapeChar(fieldToken.image);} + | {field="*";} + ) + ] + + ( + q=Term(field) + | q=Query(field) ( boost=)? + + ) + { + if (boost != null) { + float f = (float)1.0; + try { + f = Float.valueOf(boost.image).floatValue(); + q.setBoost(f); + } catch (Exception ignored) { } + } + return q; + } +} + + +Query Term(String field) : { + Token term, boost=null, fuzzySlop=null, goop1, goop2; + boolean prefix = false; + boolean wildcard = false; + boolean fuzzy = false; + Query q; +} +{ + ( + ( + term= + | term= { wildcard=true; } + | term= { prefix=true; } + | term= { wildcard=true; } + | term= + ) + [ fuzzySlop= { fuzzy=true; } ] + [ boost= [ fuzzySlop= { fuzzy=true; } ] ] + { + String termImage=discardEscapeChar(term.image); + if (wildcard) { + q = getWildcardQuery(field, termImage); + } else if (prefix) { + q = getPrefixQuery(field, + discardEscapeChar(term.image.substring + (0, term.image.length()-1))); + } else if (fuzzy) { + float fms = fuzzyMinSim; + try { + fms = Float.valueOf(fuzzySlop.image.substring(1)).floatValue(); + } catch (Exception ignored) { } + if(fms < 0.0f || fms > 1.0f){ + throw new ParseException("Minimum similarity for a FuzzyQuery has to be between 0.0f and 1.0f !"); + } + q = getFuzzyQuery(field, termImage,fms); + } else { + q = getFieldQuery(field, termImage); + } + } + | ( ( goop1=|goop1= ) + [ ] ( goop2=|goop2= ) + ) + [ boost= ] + { + if (goop1.kind == RANGEIN_QUOTED) { + goop1.image = goop1.image.substring(1, goop1.image.length()-1); + } + if (goop2.kind == RANGEIN_QUOTED) { + goop2.image = goop2.image.substring(1, goop2.image.length()-1); + } + q = getRangeQuery(field, discardEscapeChar(goop1.image), discardEscapeChar(goop2.image), true); + } + | ( ( goop1=|goop1= ) + [ ] ( goop2=|goop2= ) + ) + [ boost= ] + { + if (goop1.kind == RANGEEX_QUOTED) { + goop1.image = goop1.image.substring(1, goop1.image.length()-1); + } + if (goop2.kind == RANGEEX_QUOTED) { + goop2.image = goop2.image.substring(1, goop2.image.length()-1); + } + + q = getRangeQuery(field, discardEscapeChar(goop1.image), discardEscapeChar(goop2.image), false); + } + | term= + [ fuzzySlop= ] + [ boost= ] + { + int s = phraseSlop; + + if (fuzzySlop != null) { + try { + s = Float.valueOf(fuzzySlop.image.substring(1)).intValue(); + } + catch (Exception ignored) { } + } + q = getFieldQuery(field, discardEscapeChar(term.image.substring(1, term.image.length()-1)), s); + } + | term= + [ fuzzySlop= ] + [ boost= ] + { + int s = phraseSlop; + + if (fuzzySlop != null) { + try { + s = Float.valueOf(fuzzySlop.image.substring(1)).intValue(); + } + catch (Exception ignored) { } + } + q = getFieldQuery(field, discardEscapeChar(term.image.substring(0, term.image.length())), s); + } + ) + { + if (boost != null) { + float f = (float) 1.0; + try { + f = Float.valueOf(boost.image).floatValue(); + } + catch (Exception ignored) { + /* Should this be handled somehow? (defaults to "no boost", if + * boost number is invalid) + */ + } + + // avoid boosting null queries, such as those caused by stop words + if (q != null) { + q.setBoost(f); + } + } + return q; + } +} diff --git a/src/java/org/apache/lucene/queryParser/Token.java b/src/java/org/apache/lucene/queryParser/Token.java new file mode 100644 index 000000000..2bac4905a --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/Token.java @@ -0,0 +1,131 @@ +/* Generated By:JavaCC: Do not edit this line. Token.java Version 5.0 */ +/* JavaCCOptions:TOKEN_EXTENDS=,KEEP_LINE_COL=null,SUPPORT_CLASS_VISIBILITY_PUBLIC=true */ +package org.apache.lucene.queryParser; + +/** + * Describes the input token stream. + */ + +public class Token implements java.io.Serializable { + + /** + * The version identifier for this Serializable class. + * Increment only if the serialized form of the + * class changes. + */ + private static final long serialVersionUID = 1L; + + /** + * An integer that describes the kind of this token. This numbering + * system is determined by JavaCCParser, and a table of these numbers is + * stored in the file ...Constants.java. + */ + public int kind; + + /** The line number of the first character of this Token. */ + public int beginLine; + /** The column number of the first character of this Token. */ + public int beginColumn; + /** The line number of the last character of this Token. */ + public int endLine; + /** The column number of the last character of this Token. */ + public int endColumn; + + /** + * The string image of the token. + */ + public String image; + + /** + * A reference to the next regular (non-special) token from the input + * stream. If this is the last token from the input stream, or if the + * token manager has not read tokens beyond this one, this field is + * set to null. This is true only if this token is also a regular + * token. Otherwise, see below for a description of the contents of + * this field. + */ + public Token next; + + /** + * This field is used to access special tokens that occur prior to this + * token, but after the immediately preceding regular (non-special) token. + * If there are no such special tokens, this field is set to null. + * When there are more than one such special token, this field refers + * to the last of these special tokens, which in turn refers to the next + * previous special token through its specialToken field, and so on + * until the first special token (whose specialToken field is null). + * The next fields of special tokens refer to other special tokens that + * immediately follow it (without an intervening regular token). If there + * is no such token, this field is null. + */ + public Token specialToken; + + /** + * An optional attribute value of the Token. + * Tokens which are not used as syntactic sugar will often contain + * meaningful values that will be used later on by the compiler or + * interpreter. This attribute value is often different from the image. + * Any subclass of Token that actually wants to return a non-null value can + * override this method as appropriate. + */ + public Object getValue() { + return null; + } + + /** + * No-argument constructor + */ + public Token() {} + + /** + * Constructs a new token for the specified Image. + */ + public Token(int kind) + { + this(kind, null); + } + + /** + * Constructs a new token for the specified Image and Kind. + */ + public Token(int kind, String image) + { + this.kind = kind; + this.image = image; + } + + /** + * Returns the image. + */ + public String toString() + { + return image; + } + + /** + * Returns a new Token object, by default. However, if you want, you + * can create and return subclass objects based on the value of ofKind. + * Simply add the cases to the switch for all those special cases. + * For example, if you have a subclass of Token called IDToken that + * you want to create if ofKind is ID, simply add something like : + * + * case MyParserConstants.ID : return new IDToken(ofKind, image); + * + * to the following switch statement. Then you can cast matchedToken + * variable to the appropriate type and use sit in your lexical actions. + */ + public static Token newToken(int ofKind, String image) + { + switch(ofKind) + { + default : return new Token(ofKind, image); + } + } + + public static Token newToken(int ofKind) + { + return newToken(ofKind, null); + } + +} +/* JavaCC - OriginalChecksum=da95d0ec7daad286fab4e748b17294d8 (do not edit this line) */ diff --git a/src/java/org/apache/lucene/queryParser/TokenMgrError.java b/src/java/org/apache/lucene/queryParser/TokenMgrError.java new file mode 100644 index 000000000..6b2243ab1 --- /dev/null +++ b/src/java/org/apache/lucene/queryParser/TokenMgrError.java @@ -0,0 +1,147 @@ +/* Generated By:JavaCC: Do not edit this line. TokenMgrError.java Version 5.0 */ +/* JavaCCOptions: */ +package org.apache.lucene.queryParser; + +/** Token Manager Error. */ +public class TokenMgrError extends Error +{ + + /** + * The version identifier for this Serializable class. + * Increment only if the serialized form of the + * class changes. + */ + private static final long serialVersionUID = 1L; + + /* + * Ordinals for various reasons why an Error of this type can be thrown. + */ + + /** + * Lexical error occurred. + */ + static final int LEXICAL_ERROR = 0; + + /** + * An attempt was made to create a second instance of a static token manager. + */ + static final int STATIC_LEXER_ERROR = 1; + + /** + * Tried to change to an invalid lexical state. + */ + static final int INVALID_LEXICAL_STATE = 2; + + /** + * Detected (and bailed out of) an infinite loop in the token manager. + */ + static final int LOOP_DETECTED = 3; + + /** + * Indicates the reason why the exception is thrown. It will have + * one of the above 4 values. + */ + int errorCode; + + /** + * Replaces unprintable characters by their escaped (or unicode escaped) + * equivalents in the given string + */ + protected static final String addEscapes(String str) { + StringBuffer retval = new StringBuffer(); + char ch; + for (int i = 0; i < str.length(); i++) { + switch (str.charAt(i)) + { + case 0 : + continue; + case '\b': + retval.append("\\b"); + continue; + case '\t': + retval.append("\\t"); + continue; + case '\n': + retval.append("\\n"); + continue; + case '\f': + retval.append("\\f"); + continue; + case '\r': + retval.append("\\r"); + continue; + case '\"': + retval.append("\\\""); + continue; + case '\'': + retval.append("\\\'"); + continue; + case '\\': + retval.append("\\\\"); + continue; + default: + if ((ch = str.charAt(i)) < 0x20 || ch > 0x7e) { + String s = "0000" + Integer.toString(ch, 16); + retval.append("\\u" + s.substring(s.length() - 4, s.length())); + } else { + retval.append(ch); + } + continue; + } + } + return retval.toString(); + } + + /** + * Returns a detailed message for the Error when it is thrown by the + * token manager to indicate a lexical error. + * Parameters : + * EOFSeen : indicates if EOF caused the lexical error + * curLexState : lexical state in which this error occurred + * errorLine : line number when the error occurred + * errorColumn : column number when the error occurred + * errorAfter : prefix that was seen before this error occurred + * curchar : the offending character + * Note: You can customize the lexical error message by modifying this method. + */ + protected static String LexicalError(boolean EOFSeen, int lexState, int errorLine, int errorColumn, String errorAfter, char curChar) { + return("Lexical error at line " + + errorLine + ", column " + + errorColumn + ". Encountered: " + + (EOFSeen ? " " : ("\"" + addEscapes(String.valueOf(curChar)) + "\"") + " (" + (int)curChar + "), ") + + "after : \"" + addEscapes(errorAfter) + "\""); + } + + /** + * You can also modify the body of this method to customize your error messages. + * For example, cases like LOOP_DETECTED and INVALID_LEXICAL_STATE are not + * of end-users concern, so you can return something like : + * + * "Internal Error : Please file a bug report .... " + * + * from this method for such cases in the release version of your parser. + */ + public String getMessage() { + return super.getMessage(); + } + + /* + * Constructors of various flavors follow. + */ + + /** No arg constructor. */ + public TokenMgrError() { + } + + /** Constructor with message and reason. */ + public TokenMgrError(String message, int reason) { + super(message); + errorCode = reason; + } + + /** Full Constructor. */ + public TokenMgrError(boolean EOFSeen, int lexState, int errorLine, int errorColumn, String errorAfter, char curChar, int reason) { + this(LexicalError(EOFSeen, lexState, errorLine, errorColumn, errorAfter, curChar), reason); + } +} +/* JavaCC - OriginalChecksum=03df10dce345f1870429faa756473d14 (do not edit this line) */ diff --git a/src/java/org/apache/solr/handler/InvenioHandler.java b/src/java/org/apache/solr/handler/InvenioHandler.java new file mode 100644 index 000000000..27db6b71f --- /dev/null +++ b/src/java/org/apache/solr/handler/InvenioHandler.java @@ -0,0 +1,47 @@ +package org.apache.solr.handler; + +import java.util.HashMap; +import java.util.Map; + +import org.apache.solr.common.params.CommonParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.handler.component.SearchHandler; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.request.SolrQueryResponse; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import org.apache.solr.util.WebUtils; + + + +public class InvenioHandler extends SearchHandler { + + public static final Logger log = LoggerFactory + .getLogger(InvenioHandler.class); + + + public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) + throws Exception { + SolrParams params = req.getParams(); + String q = params.get(CommonParams.Q); + + // get the invenio parameters and set them into the request + String invParams = params.get("inv.params"); + Map qs = null; + if (invParams != null) { + qs = WebUtils.parseQueryString(invParams); + } + else { + log.warn("Received no parameters from Invenio (inv.params)"); + qs = new HashMap(); + } + Map context = req.getContext(); + context.put("inv.params", qs); + + + super.handleRequestBody(req, rsp); + } + + +} diff --git a/src/java/org/apache/solr/handler/PythonDiagnosticHandler.java b/src/java/org/apache/solr/handler/PythonDiagnosticHandler.java new file mode 100644 index 000000000..67fa18d0b --- /dev/null +++ b/src/java/org/apache/solr/handler/PythonDiagnosticHandler.java @@ -0,0 +1,156 @@ +package org.apache.solr.handler; + +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.PrintStream; +import java.util.Map; + +import org.apache.solr.common.params.CommonParams; +import org.apache.solr.common.params.ModifiableSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.core.SolrCore; +import org.apache.solr.handler.component.SearchHandler; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.request.SolrQueryResponse; +import org.apache.solr.request.SolrRequestHandler; +import org.apache.solr.search.DocSlice; +import org.apache.solr.util.DictionaryCache; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + + + + +public class PythonDiagnosticHandler extends SearchHandler { + + public static final Logger log = LoggerFactory + .getLogger(PythonDiagnosticHandler.class); + + + public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) + throws Exception { + SolrParams params = req.getParams(); + String q = params.get(CommonParams.Q); + + log.info("======= start diagnostics ======="); + + PythonMessage message = MontySolrVM.INSTANCE + .createMessage("diagnostic_test") + .setParam("query", q); + + try { + MontySolrVM.INSTANCE.sendMessage(message); + } catch (InterruptedException e) { + e.printStackTrace(); + throw new IOException("Error searching Invenio!"); + } + + Object result = message.getResults(); + if (result != null) { + String res = (String) result; + rsp.add("diagnostic_message", res); + log.info("Diagnostic message: \n" + res); + } + else { + log.info("Diagnostic message: null"); + } + + // run invenio querys + String[] queries = {"boson", "title:boson", "inv_title:boson", "a*", "{!iq iq.mode=maxinv}title:boson", "year:1->99999999"}; + SolrCore core = req.getCore(); + SolrRequestHandler handler = core.getRequestHandler( "/invenio" ); + String qu = null; + Object pyresult = null; + int[] recids = null; + String r1 = null; + String r2 = null; + + Map recidToDocid = DictionaryCache.INSTANCE.getTranslationCache(req.getSearcher().getReader(), + req.getSchema().getUniqueKeyField().getName()); + + for (int i=0; i 0) { + rinfo += "[0]=" + lCache[0] + ", [" + d + "]=" + lCache[d] + ", [" + (lCache.length-1) + "]=" + lCache[lCache.length-1]; + } + else { + rinfo += " the cache is empty. You should visit /invenio_update"; + } + rsp.add("recids", rinfo); + log.info(rinfo); + log.info("======== end diagnostics ========"); + + } + + +} diff --git a/src/java/org/apache/solr/handler/component/InvenioFormatter.java b/src/java/org/apache/solr/handler/component/InvenioFormatter.java new file mode 100644 index 000000000..56dde168a --- /dev/null +++ b/src/java/org/apache/solr/handler/component/InvenioFormatter.java @@ -0,0 +1,176 @@ +package org.apache.solr.handler.component; + +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.lucene.search.Query; +import org.apache.solr.common.params.ModifiableSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.request.SolrQueryResponse; +import org.apache.solr.search.DocIterator; +import org.apache.solr.search.DocList; +import org.apache.solr.search.DocListAndSet; +import org.apache.solr.search.DocSlice; +import org.apache.solr.search.SolrIndexReader; +import org.apache.solr.search.SolrIndexSearcher.QueryCommand; +import org.apache.solr.search.SortSpec; +import org.apache.solr.util.DictionaryCache; + + +public class InvenioFormatter extends SearchComponent +{ + public static final String COMPONENT_NAME = "invenio-formatter"; + private boolean activated = false; + private Map invParams = null; + + @Override + public void prepare(ResponseBuilder rb) throws IOException { + activated = false; + SolrParams params = rb.req.getParams(); + Map context = rb.req.getContext(); + + if (context.containsKey("inv.params")) { + invParams = (Map) context.get("inv.params"); + if (invParams.containsKey("of")) { + String of = invParams.get("of"); + if (of.equals("hcs")) { // citation summary + ModifiableSolrParams rawParams = new ModifiableSolrParams(rb.req.getParams()); + Integer old_limit = params.getInt("rows", 10); + int max_len = params.getInt("inv.rows", 25000); + rawParams.set("rows", max_len); + rawParams.set("old_rows", old_limit); + rb.req.setParams(rawParams); + SortSpec sortSpec = rb.getSortSpec(); + SortSpec nss = new SortSpec(sortSpec.getSort(), sortSpec.getOffset(), max_len); + rb.setSortSpec(nss); + activated = true; + } + else if(invParams.containsKey("rm") && ((String)invParams.get("rm")).length() > 0) { + activated = true; + } + else if(invParams.containsKey("sf") && ((String)invParams.get("sf")).length() > 0) { + activated = true; + } + } + } + + } + + @Override + public void process(ResponseBuilder rb) throws IOException { + + if ( activated ) { + SolrParams params = rb.req.getParams(); + Integer original_limit = params.getInt("old_rows", params.getInt("rows")); + + SolrQueryRequest req = rb.req; + SolrQueryResponse rsp = rb.rsp; + + DocListAndSet results = rb.getResults(); + DocList dl = results.docList; + + if (dl.size() < 1) { + return; + } + + int[] recids = new int[dl.size()]; + DocIterator it = dl.iterator(); + + SolrIndexReader reader = rb.req.getSearcher().getReader(); + int[] docidMap = DictionaryCache.INSTANCE.getLuceneCache(reader, "id"); + + // translate into Invenio ID's + for (int i=0;it.hasNext();i++) { + recids[i] = docidMap[it.next()]; + } + + PythonMessage message = MontySolrVM.INSTANCE.createMessage("sort_and_format") + .setSender("InvenioFormatter") + .setSolrQueryRequest(req) + .setSolrQueryResponse(rsp) + .setParam("recids", recids) + .setParam("kwargs", invParams); + + try { + MontySolrVM.INSTANCE.sendMessage(message); + } catch (InterruptedException e) { + // TODO Auto-generated catch block + e.printStackTrace(); + } + + Object result = message.getResults(); + String t = (String) message.getParam("rtype"); + if (result != null && t.contains("str")) { + rb.rsp.add("inv_response", (String)result); + + // truncate the number of retrieved documents back into reasonable size + int[] luceneIds = new int[original_limit>dl.size() ? dl.size() : original_limit]; + it = dl.iterator(); + for (int i=0;i recidToDocid = DictionaryCache.INSTANCE + .getTranslationCache(reader, rb.req.getSchema().getUniqueKeyField().getName()); + int[] recs = (int[]) result; + + // truncate the number of retrieved documents back into reasonable size + int[] luceneIds = new int[original_limit>recs.length ? recs.length : original_limit]; + for (int i=0;i ds, Map session) { + super(dataConfig, core, ds, session); + } + + + public boolean isBusy() { + return importLock.isLocked(); + } + + public void doFullImport(SolrWriter writer, RequestParams requestParams) { + LOG.info("Starting Full Import"); + setStatus(Status.RUNNING_FULL_DUMP); + + setIndexStartTime(new Date()); + + try { + docBuilder = new DocBuilder(this, writer, requestParams); + docBuilder.execute(); + if (!requestParams.debug) + cumulativeStatistics.add(docBuilder.importStatistics); + } catch (Throwable t) { + LOG.error("Full Import failed", t); + //docBuilder.rollback(); + } finally { + setStatus(Status.IDLE); + super.getConfig().clearCaches(); + DocBuilder.INSTANCE.set(null); + } + + } + + public void doDeltaImport(SolrWriter writer, RequestParams requestParams) { + LOG.info("Starting Delta Import"); + setStatus(Status.RUNNING_DELTA_DUMP); + + try { + setIndexStartTime(new Date()); + docBuilder = new DocBuilder(this, writer, requestParams); + docBuilder.execute(); + if (!requestParams.debug) + cumulativeStatistics.add(docBuilder.importStatistics); + } catch (Throwable t) { + LOG.error("Delta Import Failed", t); + //docBuilder.rollback(); + } finally { + setStatus(Status.IDLE); + super.getConfig().clearCaches(); + DocBuilder.INSTANCE.set(null); + } + + } + + void runCmd(RequestParams reqParams, SolrWriter sw) { + String command = reqParams.command; + if (command.equals(ABORT_CMD)) { + if (docBuilder != null) { + docBuilder.abort(); + } + return; + } + if (!importLock.tryLock()){ + LOG.warn("Import command failed . another import is running"); + return; + } + try { + if (FULL_IMPORT_CMD.equals(command) || IMPORT_CMD.equals(command)) { + doFullImport(sw, reqParams); + } else if (command.equals(DELTA_IMPORT_CMD)) { + doDeltaImport(sw, reqParams); + } + } finally { + importLock.unlock(); + } + } + + +} diff --git a/src/java/org/apache/solr/handler/dataimport/WaitingDataImportHandler.java b/src/java/org/apache/solr/handler/dataimport/WaitingDataImportHandler.java new file mode 100644 index 000000000..d40e4ba81 --- /dev/null +++ b/src/java/org/apache/solr/handler/dataimport/WaitingDataImportHandler.java @@ -0,0 +1,374 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.solr.handler.dataimport; + +import static org.apache.solr.handler.dataimport.DataImporter.IMPORT_CMD; +import org.apache.solr.common.SolrException; +import org.apache.solr.common.SolrInputDocument; +import org.apache.solr.common.params.CommonParams; +import org.apache.solr.common.params.ModifiableSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.params.UpdateParams; +import org.apache.solr.common.util.ContentStreamBase; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.common.util.ContentStream; +import org.apache.solr.core.SolrConfig; +import org.apache.solr.core.SolrCore; +import org.apache.solr.core.SolrResourceLoader; +import org.apache.solr.handler.RequestHandlerBase; +import org.apache.solr.handler.RequestHandlerUtils; +import org.apache.solr.request.RawResponseWriter; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.request.SolrQueryResponse; +import org.apache.solr.request.SolrRequestHandler; +import org.apache.solr.update.processor.UpdateRequestProcessor; +import org.apache.solr.update.processor.UpdateRequestProcessorChain; +import org.apache.solr.util.plugin.SolrCoreAware; + +import java.util.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + *

+ * Solr Request Handler for data import from databases and REST data sources. + *

+ *

+ * It is configured in solrconfig.xml + *

+ *

+ *

+ * Refer to http://wiki.apache.org/solr/DataImportHandler + * for more details. + *

+ *

+ * This API is experimental and subject to change + * + * @version $Id: DataImportHandler.java 788580 2009-06-26 05:20:23Z noble $ + * @since solr 1.3 + * + * NOTE: this is a slightly modified DataImportHandler that waits until the importer stops to be + * busy. + * + */ +public class WaitingDataImportHandler extends RequestHandlerBase implements + SolrCoreAware { + + private static final Logger LOG = LoggerFactory.getLogger(DataImportHandler.class); + + private DataImporter importer; + + private Map dataSources = new HashMap(); + + private List debugDocuments; + + private boolean debugEnabled = true; + + private String myName = "dataimport"; + + private Map coreScopeSession = new HashMap(); + + @Override + @SuppressWarnings("unchecked") + public void init(NamedList args) { + super.init(args); + } + + @SuppressWarnings("unchecked") + public void inform(SolrCore core) { + try { + //hack to get the name of this handler + for (Map.Entry e : core.getRequestHandlers().entrySet()) { + SolrRequestHandler handler = e.getValue(); + //this will not work if startup=lazy is set + if( this == handler) { + String name= e.getKey(); + if(name.startsWith("/")){ + myName = name.substring(1); + } + // some users may have '/' in the handler name. replace with '_' + myName = myName.replaceAll("/","_") ; + } + } + String debug = (String) initArgs.get(ENABLE_DEBUG); + if (debug != null && "no".equals(debug)) + debugEnabled = false; + NamedList defaults = (NamedList) initArgs.get("defaults"); + if (defaults != null) { + String configLoc = (String) defaults.get("config"); + if (configLoc != null && configLoc.length() != 0) { + processConfiguration(defaults); + + importer = new NoRollbackDataImporter(SolrWriter.getResourceAsString(core + .getResourceLoader().openResource(configLoc)), core, + dataSources, coreScopeSession); + } + } + } catch (Throwable e) { + SolrConfig.severeErrors.add(e); + LOG.error( DataImporter.MSG.LOAD_EXP, e); + throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, + DataImporter.MSG.INVALID_CONFIG, e); + } + } + + @Override + @SuppressWarnings("unchecked") + public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) + throws Exception { + rsp.setHttpCaching(false); + SolrParams params = req.getParams(); + DataImporter.RequestParams requestParams = new DataImporter.RequestParams(getParamsMap(params)); + String command = requestParams.command; + Iterable streams = req.getContentStreams(); + if(streams != null){ + for (ContentStream stream : streams) { + requestParams.contentStream = stream; + break; + } + } + if (DataImporter.SHOW_CONF_CMD.equals(command)) { + // Modify incoming request params to add wt=raw + ModifiableSolrParams rawParams = new ModifiableSolrParams(req.getParams()); + rawParams.set(CommonParams.WT, "raw"); + req.setParams(rawParams); + String dataConfigFile = defaults.get("config"); + ContentStreamBase content = new ContentStreamBase.StringStream(SolrWriter + .getResourceAsString(req.getCore().getResourceLoader().openResource( + dataConfigFile))); + rsp.add(RawResponseWriter.CONTENT, content); + return; + } + + rsp.add("initArgs", initArgs); + String message = ""; + + if (command != null) + rsp.add("command", command); + + if (requestParams.debug && (importer == null || !importer.isBusy())) { + // Reload the data-config.xml + importer = null; + if (requestParams.dataConfig != null) { + try { + processConfiguration((NamedList) initArgs.get("defaults")); + importer = new DataImporter(requestParams.dataConfig, req.getCore() + , dataSources, coreScopeSession); + } catch (RuntimeException e) { + rsp.add("exception", DebugLogger.getStacktraceString(e)); + importer = null; + return; + } + } else { + inform(req.getCore()); + } + message = DataImporter.MSG.CONFIG_RELOADED; + } + + // If importer is still null + if (importer == null) { + rsp.add("status", DataImporter.MSG.NO_INIT); + return; + } + + if (command != null && DataImporter.ABORT_CMD.equals(command)) { + importer.runCmd(requestParams, null); + } + else { + if (importer.isBusy()) { + while(true) { + Thread.sleep(30); + if (!importer.isBusy()) { + break; + } + } + } + + if (command != null) { + if (DataImporter.FULL_IMPORT_CMD.equals(command) + || DataImporter.DELTA_IMPORT_CMD.equals(command) || + IMPORT_CMD.equals(command)) { + + UpdateRequestProcessorChain processorChain = + req.getCore().getUpdateProcessingChain(params.get(UpdateParams.UPDATE_PROCESSOR)); + UpdateRequestProcessor processor = processorChain.createProcessor(req, rsp); + SolrResourceLoader loader = req.getCore().getResourceLoader(); + SolrWriter sw = getSolrWriter(processor, loader, requestParams); + + if (requestParams.debug) { + if (debugEnabled) { + // Synchronous request for the debug mode + importer.runCmd(requestParams, sw); + rsp.add("mode", "debug"); + rsp.add("documents", debugDocuments); + if (sw.debugLogger != null) + rsp.add("verbose-output", sw.debugLogger.output); + debugDocuments = null; + } else { + message = DataImporter.MSG.DEBUG_NOT_ENABLED; + } + } else { + // Asynchronous request for normal mode + if(requestParams.contentStream == null){ + importer.runAsync(requestParams, sw); + } else { + importer.runCmd(requestParams, sw); + } + } + } else if (DataImporter.RELOAD_CONF_CMD.equals(command)) { + importer = null; + inform(req.getCore()); + message = DataImporter.MSG.CONFIG_RELOADED; + } + } + } + rsp.add("status", importer.isBusy() ? "busy" : "idle"); + rsp.add("importResponse", message); + rsp.add("statusMessages", importer.getStatusMessages()); + + RequestHandlerUtils.addExperimentalFormatWarning(rsp); + } + + private Map getParamsMap(SolrParams params) { + Iterator names = params.getParameterNamesIterator(); + Map result = new HashMap(); + while (names.hasNext()) { + String s = names.next(); + String[] val = params.getParams(s); + if (val == null || val.length < 1) + continue; + if (val.length == 1) + result.put(s, val[0]); + else + result.put(s, Arrays.asList(val)); + } + return result; + } + + @SuppressWarnings("unchecked") + private void processConfiguration(NamedList defaults) { + if (defaults == null) { + LOG.info("No configuration specified in solrconfig.xml for DataImportHandler"); + return; + } + + LOG.info("Processing configuration from solrconfig.xml: " + defaults); + + dataSources = new HashMap(); + + int position = 0; + + while (position < defaults.size()) { + if (defaults.getName(position) == null) + break; + + String name = defaults.getName(position); + if (name.equals("datasource")) { + NamedList dsConfig = (NamedList) defaults.getVal(position); + Properties props = new Properties(); + for (int i = 0; i < dsConfig.size(); i++) + props.put(dsConfig.getName(i), dsConfig.getVal(i)); + LOG.info("Adding properties to datasource: " + props); + dataSources.put((String) dsConfig.get("name"), props); + } + position++; + } + } + + private SolrWriter getSolrWriter(final UpdateRequestProcessor processor, + final SolrResourceLoader loader, final DataImporter.RequestParams requestParams) { + + return new SolrWriter(processor, loader.getConfigDir(), myName) { + + @Override + public boolean upload(SolrInputDocument document) { + try { + if (requestParams.debug) { + if (debugDocuments == null) + debugDocuments = new ArrayList(); + debugDocuments.add(document); + } + return super.upload(document); + } catch (RuntimeException e) { + LOG.error( "Exception while adding: " + document, e); + return false; + } + } + }; + } + + @Override + @SuppressWarnings("unchecked") + public NamedList getStatistics() { + if (importer == null) + return super.getStatistics(); + + DocBuilder.Statistics cumulative = importer.cumulativeStatistics; + NamedList result = new NamedList(); + + result.add("Status", importer.getStatus().toString()); + + if (importer.docBuilder != null) { + DocBuilder.Statistics running = importer.docBuilder.importStatistics; + result.add("Documents Processed", running.docCount); + result.add("Requests made to DataSource", running.queryCount); + result.add("Rows Fetched", running.rowsCount); + result.add("Documents Deleted", running.deletedDocCount); + result.add("Documents Skipped", running.skipDocCount); + } + + result.add(DataImporter.MSG.TOTAL_DOC_PROCESSED, cumulative.docCount); + result.add(DataImporter.MSG.TOTAL_QUERIES_EXECUTED, cumulative.queryCount); + result.add(DataImporter.MSG.TOTAL_ROWS_EXECUTED, cumulative.rowsCount); + result.add(DataImporter.MSG.TOTAL_DOCS_DELETED, cumulative.deletedDocCount); + result.add(DataImporter.MSG.TOTAL_DOCS_SKIPPED, cumulative.skipDocCount); + + NamedList requestStatistics = super.getStatistics(); + if (requestStatistics != null) { + for (int i = 0; i < requestStatistics.size(); i++) { + result.add(requestStatistics.getName(i), requestStatistics.getVal(i)); + } + } + + return result; + } + + // //////////////////////SolrInfoMBeans methods ////////////////////// + + @Override + public String getDescription() { + return DataImporter.MSG.JMX_DESC; + } + + @Override + public String getSourceId() { + return "$Id: DataImportHandler.java 788580 2009-06-26 05:20:23Z noble $"; + } + + @Override + public String getVersion() { + return "1.0"; + } + + @Override + public String getSource() { + return "$URL: http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataImportHandler.java $"; + } + + public static final String ENABLE_DEBUG = "enableDebug"; +} diff --git a/src/java/org/apache/solr/schema/FileResolverTextField.java b/src/java/org/apache/solr/schema/FileResolverTextField.java new file mode 100644 index 000000000..9fa88ac33 --- /dev/null +++ b/src/java/org/apache/solr/schema/FileResolverTextField.java @@ -0,0 +1,111 @@ + + +package org.apache.solr.schema; + +import org.apache.lucene.search.SortField; +import org.apache.lucene.document.Field; +import org.apache.lucene.document.Fieldable; +import org.apache.solr.request.XMLWriter; +import org.apache.solr.request.TextResponseWriter; + +import java.util.HashMap; +import java.util.Map; +import java.util.Scanner; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.IOException; + +/** TextField is the basic type for configurable text analysis. + * Analyzers for field types using this implementation should be defined in the schema. + * @version $Id: TextField.java 764291 2009-04-12 11:03:09Z shalin $ + */ +public class FileResolverTextField extends CompressableField { + protected void init(IndexSchema schema, Map args) { + properties |= TOKENIZED; + if (schema.getVersion()> 1.1f) properties &= ~OMIT_TF_POSITIONS; + + super.init(schema, args); + } + + public SortField getSortField(SchemaField field, boolean reverse) { + return getStringSort(field, reverse); + } + + public void write(XMLWriter xmlWriter, String name, Fieldable f) throws IOException { + xmlWriter.writeStr(name, f.stringValue()); + } + + public void write(TextResponseWriter writer, String name, Fieldable f) throws IOException { + writer.writeStr(name, f.stringValue(), true); + } + + public Field createField(SchemaField field, String externalVal, float boost) { + //System.out.println(externalVal); + //String val = externalVal.toLowerCase() + " Hey!"; //null; + String val = null; + String[] vals = externalVal.split("\\|"); + Map values = new HashMap(); + for (String v: vals) { + if (v.indexOf(':') > 0) { + String[] parts = v.split(":", 0); + String p = parts[1]; + if (p.startsWith("[") && p.endsWith("]")) { + p = p.substring(1, p.length()-1); + } + values.put(parts[0], p); + } + } + if (values.containsKey("src_dir") && values.containsKey("arxiv_id")) { + String[] dirs = values.get("src_dir").split(","); + + String arx = values.get("arxiv_id"); + String fname = null; + String topdir = null; + + if (arx.indexOf('/') > -1) { + String[] arx_parts = arx.split("/", 0); //hep-th/0002162 + topdir = arx_parts[1].substring(0, 4); + fname = arx_parts[0] + arx_parts[1]; + } + else if(arx.indexOf(':') > -1) { + String[] arx_parts = arx.replace("arXiv:", "").split("\\.", 0); //arXiv:0712.0712 + topdir = arx_parts[0]; + fname = arx_parts[0] + '.' + arx_parts[1]; + } + + if (fname != null) { + File f = null; + for (String d: dirs) { + String s = d + "/" + topdir + "/" + fname; + f = new File(s + ".txt"); + if (f.exists()) { + StringBuilder text = new StringBuilder(); + String fEncoding = "UTF-8"; + String NL = System.getProperty("line.separator"); + Scanner scanner; + try { + scanner = new Scanner(new FileInputStream(f), fEncoding ); + try { + while (scanner.hasNextLine()){ + text.append(scanner.nextLine() + NL); + } + } + finally{ + scanner.close(); + val = text.toString(); + } + } catch (FileNotFoundException e) { + // TODO Auto-generated catch block + e.printStackTrace(); + } + + break; + } + } + } + }// value has src_dir and arxiv_id + + return super.createField(field, val, boost); + } +} diff --git a/src/java/org/apache/solr/schema/PythonTextField.java b/src/java/org/apache/solr/schema/PythonTextField.java new file mode 100644 index 000000000..10d00495a --- /dev/null +++ b/src/java/org/apache/solr/schema/PythonTextField.java @@ -0,0 +1,72 @@ + + +package org.apache.solr.schema; + +import org.apache.lucene.search.SortField; +import org.apache.lucene.document.Field; +import org.apache.lucene.document.Fieldable; +import org.apache.solr.request.XMLWriter; +import org.apache.solr.request.TextResponseWriter; + + +import invenio.montysolr.jni.PythonBridge; +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import java.util.HashMap; +import java.util.Map; +import java.util.Scanner; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.IOException; + +/** TextField is the basic type for configurable text analysis. + * Analyzers for field types using this implementation should be defined in the schema. + * @version $Id: TextField.java 764291 2009-04-12 11:03:09Z shalin $ + */ +public class PythonTextField extends CompressableField { + protected void init(IndexSchema schema, Map args) { + properties |= TOKENIZED; + if (schema.getVersion()> 1.1f) properties &= ~OMIT_TF_POSITIONS; + + super.init(schema, args); + } + + public SortField getSortField(SchemaField field, boolean reverse) { + return getStringSort(field, reverse); + } + + public void write(XMLWriter xmlWriter, String name, Fieldable f) throws IOException { + xmlWriter.writeStr(name, f.stringValue()); + } + + public void write(TextResponseWriter writer, String name, Fieldable f) throws IOException { + writer.writeStr(name, f.stringValue(), true); + } + + public Field createField(SchemaField field, String externalVal, float boost) { + + //String val = bridge.workoutFieldValue(this.getClass().getName(), field, externalVal, boost); + PythonMessage message = MontySolrVM.INSTANCE.createMessage("workout_field_value") + .setSender("PythonTextField") + .setParam("field", field) + .setParam("externalVal", externalVal) + .setParam("boost", boost); + + try { + MontySolrVM.INSTANCE.sendMessage(message); + if (message.containsKey("result")) { + String val = (String) message.getResults(); + if (val != null) + return super.createField(field, val, boost); + } + } catch (InterruptedException e) { + // pass, we will not access the message object it may be + // in inconsistent state + } + + + return null; + } +} diff --git a/src/java/org/apache/solr/search/CitationQuery.java b/src/java/org/apache/solr/search/CitationQuery.java new file mode 100644 index 000000000..9a3604af3 --- /dev/null +++ b/src/java/org/apache/solr/search/CitationQuery.java @@ -0,0 +1,431 @@ +package org.apache.solr.search; + +import org.apache.jcc.PythonVM; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.DocIdSet; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.Explanation; +import org.apache.lucene.search.FieldCache; +import org.apache.lucene.search.Filter; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.Searcher; +import org.apache.lucene.search.Similarity; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.OpenBitSet; +import org.apache.lucene.util.ToStringUtils; + +import invenio.montysolr.jni.PythonBridge; +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import java.io.IOException; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Random; +import java.util.Set; +import java.util.BitSet; + +import org.apache.lucene.document.Document; +import org.apache.lucene.document.MapFieldSelector; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermDocs; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.util.DictionaryCache; + + + +public class CitationQuery extends Query { + private float boost = 1.0f; // query boost factor + Query query; + SolrParams localParams; + SolrQueryRequest req; + String idField = "id"; //TODO: make it configurable + String dictName = null; + + + public CitationQuery (Query query, SolrQueryRequest req, SolrParams localParams) { + this.query = query; + this.localParams = localParams; + this.req = req; + + String type = localParams.get("rel"); + if (type.contains("refersto")) { + dictName = "citationdict"; + } + else if (type.contains("citedby")) { + dictName = "reversedict"; + } + else { + dictName = "reversedict"; + } + } + + /** + * Sets the boost for this query clause to b. Documents + * matching this clause will (in addition to the normal weightings) have + * their score multiplied by b. + */ + public void setBoost(float b) { + boost = b; + } + + /** + * Gets the boost for this clause. Documents matching this clause will (in + * addition to the normal weightings) have their score multiplied by + * b. The boost is 1.0 by default. + */ + public float getBoost() { + return boost + query.getBoost(); + } + + public Map getDictCache() throws IOException { + try { + return getDictCache(this.dictName); + } catch (InterruptedException e) { + e.printStackTrace(); + // return empty map, that is ok because it will affect only + // this query, the next will get a new cache + return new HashMap(); + } + } + + public Map getDictCache(String dictname) throws IOException, InterruptedException { + + + Map cache = DictionaryCache.INSTANCE.getCache(dictname); + + + if (cache == null) { + + + + // Get mapping lucene_id->invenio_recid + // The simplest would be to load the field with a cache (but the + // field should be integer - and it is not now). The other reason + // for doint this is that we don't create unnecessary cache + + /** + + TermDocs td = reader.termDocs(); //FIXME: .termDocs(new Term(idField)) works not?! + String[] li = {idField}; + MapFieldSelector fieldSelector = new MapFieldSelector(li); + **/ + + SolrIndexSearcher searcher = req.getSearcher(); + SolrIndexReader reader = searcher.getReader(); + int[] idMapping = FieldCache.DEFAULT.getInts(reader, idField); + + Map fromValueToDocid = new HashMap(idMapping.length); + int i = 0; + for (int value: idMapping) { + fromValueToDocid.put(value, i); + i++; + } + + /** + //OpenBitSet bitSet = new OpenBitSet(reader.maxDoc()); + int i; + while (td.next()) { + i = td.doc(); + // not needed when term is null + //if (reader.isDeleted(i)) { + // continue; + //} + Document doc = reader.document(i); + + try { + //bitSet.set(Integer.parseInt(doc.get(idField))); + idMap.put(i, Integer.parseInt(doc.get(idField))); + } catch (Exception e) { + e.printStackTrace(); + } + } + **/ + + // now get the citation dictionary from Invenio + HashMap hm = new HashMap(); + + PythonMessage message = MontySolrVM.INSTANCE.createMessage("get_citation_dict") + .setSender("CitationQuery") + .setParam("dictname", dictName) + .setParam("result", hm); + MontySolrVM.INSTANCE.sendMessage(message); + + + Map citationDict = new HashMap(0); + if (message.containsKey("result")) { + + Map result = (Map) message.getResults(); + citationDict = new HashMap(result.size()); + for (Entry e: result.entrySet()) { + Integer recid = e.getKey(); + if (fromValueToDocid.containsKey(recid)) { + // translate recids into lucene-ids + + int[] recIds = (int[]) e.getValue(); + int[] lucIds = new int[recIds.length]; + for (int x=0;x getDictCacheX() { + HashMap hm = new HashMap(); + + int Min = 1; + int Max = 5000; + int r; + for (int i=0;i<5000; i++) { + r = Min + (int)(Math.random() * ((Max - Min) + 1)); + BitSet bs = new BitSet(r); + int ii = 0; + while (ii < 20) { + r = Min + (int)(Math.random() * ((Max - Min) + 1)); + bs.set(r); + ii += 1; + } + hm.put(i, bs); + } + return hm; + } + + + /** + * Expert: Constructs an appropriate Weight implementation for this query. + * + *

+ * Only implemented by primitive queries, which re-write to themselves. + */ + public Weight createWeight(Searcher searcher) throws IOException { + + final Weight weight = query.createWeight (searcher); + final Similarity similarity = query.getSimilarity(searcher); + + + + return new Weight() { + private float value; + + + // return a filtering scorer + public Scorer scorer(IndexReader indexReader, boolean scoreDocsInOrder, boolean topScorer) + throws IOException { + + final Scorer scorer = weight.scorer(indexReader, true, false); + final IndexReader reader = indexReader; + + if (scorer == null) { + return null; + } + + + // we override the Scorer for the CitationQuery + return new Scorer(similarity) { + + private int doc = -1; + + public void getCache() { + System.out.println(reader); + + } + + // here is the core of the processing + public void score(Collector collector) throws IOException { + collector.setScorer(this); + int doc; + + //TODO: we could as well collect the first matching documents + //and based on them retrieve all the citations. But probably + //that is not correct, because the citation search wants to + //retrieve the documents that are most cited/referred, therefore + //we have to search the whole space + + // get the respective dictionary + Map cache = getDictCache(); + BitSet aHitSet = new BitSet(reader.maxDoc()); + + if (cache.size() == 0) + return; + + // retrieve documents that matched the query and while we go + // collect the documents referenced by/from those docs + while ((doc = nextDoc()) != NO_MORE_DOCS) { + if (cache.containsKey(doc)) { + int[] v = cache.get(doc); + for (int i: v) + aHitSet.set(i); + } + } + + // now collect the big set of citing relations + doc = 0; + while ((doc = aHitSet.nextSetBit(doc)) != -1) { + collector.collect(doc); + doc += 1; + } + } + + public int nextDoc() throws IOException { + return scorer.nextDoc(); + } + + /** @deprecated use {@link #docID()} instead. */ + public int doc() { return scorer.doc(); } + public int docID() { return doc; } + + /** @deprecated use {@link #advance(int)} instead. */ + public boolean skipTo(int i) throws IOException { + return advance(i) != NO_MORE_DOCS; + } + + public int advance(int target) throws IOException { + return scorer.advance(target); + } + + //public float score() throws IOException { return getBoost() * scorer.score(); } + public float score() throws IOException { return getBoost() * 1.0f; } + };// Scorer + }// scorer + + // pass these methods through to enclosed query's weight + public float getValue() { return value; } + + public float sumOfSquaredWeights() throws IOException { + return weight.sumOfSquaredWeights() * getBoost() * getBoost(); + } + + public void normalize (float v) { + weight.normalize(v); + value = weight.getValue() * getBoost(); + } + + public Explanation explain (IndexReader ir, int i) throws IOException { + Explanation inner = weight.explain (ir, i); + if (getBoost()!=1) { + Explanation preBoost = inner; + inner = new Explanation(inner.getValue()*getBoost(),"product of:"); + inner.addDetail(new Explanation(getBoost(),"boost")); + inner.addDetail(preBoost); + } + inner.addDetail(new Explanation(0.0f, "TODO: add citation formula details")); + return inner; + } + + // return this query + public Query getQuery() { return CitationQuery.this; } + + }; //Weight + } + + /** + * Expert: Constructs and initializes a Weight for a top-level query. + */ + public Weight weight(Searcher searcher) throws IOException { + Query query = searcher.rewrite(this); + Weight weight = query.createWeight(searcher); + float sum = weight.sumOfSquaredWeights(); + float norm = getSimilarity(searcher).queryNorm(sum); + if (Float.isInfinite(norm) || Float.isNaN(norm)) + norm = 1.0f; + weight.normalize(norm); + return weight; + } + + /** + * Expert: called to re-write queries into primitive queries. For example, a + * PrefixQuery will be rewritten into a BooleanQuery that consists of + * TermQuerys. + */ + public Query rewrite(IndexReader reader) throws IOException { + Query rewritten = query.rewrite(reader); + if (rewritten != query) { + CitationQuery clone = (CitationQuery)this.clone(); + clone.query = rewritten; + return clone; + } else { + return this; + } + } + + /** + * Expert: called when re-writing queries under MultiSearcher. + * + * Create a single query suitable for use by all subsearchers (in 1-1 + * correspondence with queries). This is an optimization of the OR of all + * queries. We handle the common optimization cases of equal queries and + * overlapping clauses of boolean OR queries (as generated by + * MultiTermQuery.rewrite()). Be careful overriding this method as + * queries[0] determines which method will be called and is not necessarily + * of the same type as the other queries. + */ + public Query combine(Query[] queries) { + return query.combine(queries); + + } + + /** + * Expert: adds all terms occurring in this query to the terms set. Only + * works if this query is in its {@link #rewrite rewritten} form. + * + * @throws UnsupportedOperationException + * if this query is not yet rewritten + */ + public void extractTerms(Set terms) { + query.extractTerms(terms); + } + + + /** + * Expert: Returns the Similarity implementation to be used for this query. + * Subclasses may override this method to specify their own Similarity + * implementation, perhaps one that delegates through that of the Searcher. + * By default the Searcher's Similarity implementation is returned. + */ + public Similarity getSimilarity(Searcher searcher) { + return searcher.getSimilarity(); + } + + + + /** Prints a user-readable version of this query. */ + public String toString (String s) { + StringBuffer buffer = new StringBuffer(); + buffer.append("CitationQuery("); + buffer.append(query.toString(s)); + buffer.append(")->"); + buffer.append(ToStringUtils.boost(getBoost())); + return buffer.toString(); + } + + /** Returns true iff o is equal to this. */ + public boolean equals(Object o) { + if (o instanceof CitationRefersToQuery) { + CitationRefersToQuery fq = (CitationRefersToQuery) o; + return (query.equals(fq.query) && getBoost()==fq.getBoost()); + } + return false; + } + + /** Returns a hash code value for this object. */ + public int hashCode() { + return query.hashCode() ^ Float.floatToRawIntBits(getBoost()); + } +} diff --git a/src/java/org/apache/solr/search/CitationRefersToQParserPlugin.java b/src/java/org/apache/solr/search/CitationRefersToQParserPlugin.java new file mode 100644 index 000000000..91472af83 --- /dev/null +++ b/src/java/org/apache/solr/search/CitationRefersToQParserPlugin.java @@ -0,0 +1,317 @@ +package org.apache.solr.search; + +import java.io.IOException; +import java.util.Map; +import java.util.Set; + +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.queryParser.ParseException; +import org.apache.lucene.queryParser.QueryParser; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.DocIdSet; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.Explanation; +import org.apache.lucene.search.Filter; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.Searcher; +import org.apache.lucene.search.Similarity; +import org.apache.lucene.search.Weight; +import org.apache.solr.common.params.CommonParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.lucene.search.FilteredQuery; +import org.apache.lucene.util.OpenBitSet; +import org.apache.lucene.util.ToStringUtils; +import org.apache.solr.search.CitationQuery; + +/** + * Parse Invenio's variant on the refersto citation
+ * Other parameters: + *

    + *
  • q.op - the default operator "OR" or "AND"
  • + *
  • df - the default field name
  • + *
+ *
+ * Example: + * {!relation q.op=AND df=author sort='price asc'}coauthor:ellis +bar -baz + */ +public class CitationRefersToQParserPlugin extends QParserPlugin { + public static String NAME = "refersto"; + + @Override + public void init(NamedList args) { + } + + @Override + public QParser createParser(String qstr, SolrParams localParams, + SolrParams params, SolrQueryRequest req) { + return new InvenioRefersToQParser(qstr, localParams, params, req); + } + +} + +class InvenioRefersToQParser extends QParser { + String sortStr; + SolrQueryParser lparser; + + public InvenioRefersToQParser(String qstr, SolrParams localParams, + SolrParams params, SolrQueryRequest req) { + super(qstr, localParams, params, req); + } + + public Query parse() throws ParseException { + String qstr = getString(); + + String defaultField = getParam(CommonParams.DF); + if (defaultField == null) { + defaultField = getReq().getSchema().getDefaultSearchFieldName(); + } + lparser = new SolrQueryParser(this, defaultField); + + // these could either be checked & set here, or in the SolrQueryParser + // constructor + String opParam = getParam(QueryParsing.OP); + if (opParam != null) { + lparser.setDefaultOperator("AND".equals(opParam) ? QueryParser.Operator.AND + : QueryParser.Operator.OR); + } else { + // try to get default operator from schema + QueryParser.Operator operator = getReq().getSchema() + .getSolrQueryParser(null).getDefaultOperator(); + lparser.setDefaultOperator(null == operator ? QueryParser.Operator.OR + : operator); + } + + Query mainq = lparser.parse(qstr); + //Filter qfilter = new CitationRefersToFilter(); + + //return new CitationRefersToQuery(mainq, qfilter); + return new CitationQuery(mainq, req, localParams); + } + + public String[] getDefaultHighlightFields() { + return new String[] { lparser.getField() }; + } + +} + +class CitationRefersToQuery +extends Query { + + Query query; + Filter filter; + + /** + * Constructs a new query which applies a filter to the results of the original query. + * Filter.getDocIdSet() will be called every time this query is used in a search. + * @param query Query to be filtered, cannot be null. + * @param filter Filter to apply to query results, cannot be null. + */ + public CitationRefersToQuery (Query query, Filter filter) { + this.query = query; + this.filter = filter; + } + + /** + * Returns a Weight that applies the filter to the enclosed query's Weight. + * This is accomplished by overriding the Scorer returned by the Weight. + */ + public Weight createWeight(final Searcher searcher) throws IOException { + final Weight weight = query.createWeight (searcher); + final Similarity similarity = query.getSimilarity(searcher); + return new Weight() { + private float value; + + // pass these methods through to enclosed query's weight + public float getValue() { return value; } + public float sumOfSquaredWeights() throws IOException { + return weight.sumOfSquaredWeights() * getBoost() * getBoost(); + } + public void normalize (float v) { + weight.normalize(v); + value = weight.getValue() * getBoost(); + } + public Explanation explain (IndexReader ir, int i) throws IOException { + Explanation inner = weight.explain (ir, i); + if (getBoost()!=1) { + Explanation preBoost = inner; + inner = new Explanation(inner.getValue()*getBoost(),"product of:"); + inner.addDetail(new Explanation(getBoost(),"boost")); + inner.addDetail(preBoost); + } + Filter f = CitationRefersToQuery.this.filter; + DocIdSet docIdSet = f.getDocIdSet(ir); + DocIdSetIterator docIdSetIterator = docIdSet == null ? DocIdSet.EMPTY_DOCIDSET.iterator() : docIdSet.iterator(); + if (docIdSetIterator == null) { + docIdSetIterator = DocIdSet.EMPTY_DOCIDSET.iterator(); + } + if (docIdSetIterator.advance(i) == i) { + return inner; + } else { + Explanation result = new Explanation + (0.0f, "failure to match filter: " + f.toString()); + result.addDetail(inner); + return result; + } + } + + // return this query + public Query getQuery() { return CitationRefersToQuery.this; } + + // return a filtering scorer + public Scorer scorer(IndexReader indexReader, boolean scoreDocsInOrder, boolean topScorer) + throws IOException { + final Scorer scorer = weight.scorer(indexReader, true, false); + if (scorer == null) { + return null; + } + DocIdSet docIdSet = filter.getDocIdSet(indexReader); + if (docIdSet == null) { + return null; + } + final DocIdSetIterator docIdSetIterator = docIdSet.iterator(); + if (docIdSetIterator == null) { + return null; + } + + return new Scorer(similarity) { + + private int doc = -1; + + private int advanceToCommon(int scorerDoc, int disiDoc) throws IOException { + while (scorerDoc != disiDoc) { + if (scorerDoc < disiDoc) { + scorerDoc = scorer.advance(disiDoc); + } else { + disiDoc = docIdSetIterator.advance(scorerDoc); + } + } + return scorerDoc; + } + + public void score(Collector collector) throws IOException { + collector.setScorer(this); + int doc; + while ((doc = nextDoc()) != NO_MORE_DOCS) { + collector.collect(doc); + } + } + + /** @deprecated use {@link #nextDoc()} instead. */ + public boolean next() throws IOException { + return nextDoc() != NO_MORE_DOCS; + } + + public int nextDoc() throws IOException { + int scorerDoc, disiDoc; + return doc = (disiDoc = docIdSetIterator.nextDoc()) != NO_MORE_DOCS + && (scorerDoc = scorer.nextDoc()) != NO_MORE_DOCS + && advanceToCommon(scorerDoc, disiDoc) != NO_MORE_DOCS ? scorer.docID() : NO_MORE_DOCS; + } + + /** @deprecated use {@link #docID()} instead. */ + public int doc() { return scorer.doc(); } + public int docID() { return doc; } + + /** @deprecated use {@link #advance(int)} instead. */ + public boolean skipTo(int i) throws IOException { + return advance(i) != NO_MORE_DOCS; + } + + public int advance(int target) throws IOException { + int disiDoc, scorerDoc; + return doc = (disiDoc = docIdSetIterator.advance(target)) != NO_MORE_DOCS + && (scorerDoc = scorer.advance(disiDoc)) != NO_MORE_DOCS + && advanceToCommon(scorerDoc, disiDoc) != NO_MORE_DOCS ? scorer.docID() : NO_MORE_DOCS; + } + + public float score() throws IOException { return getBoost() * scorer.score(); } + + // add an explanation about whether the document was filtered + public Explanation explain (int i) throws IOException { + Explanation exp = scorer.explain(i); + + if (docIdSetIterator.advance(i) == i) { + exp.setDescription ("allowed by filter: "+exp.getDescription()); + exp.setValue(getBoost() * exp.getValue()); + } else { + exp.setDescription ("removed by filter: "+exp.getDescription()); + exp.setValue(0.0f); + } + return exp; + } + }; + } + }; + } + + /** Rewrites the wrapped query. */ + public Query rewrite(IndexReader reader) throws IOException { + Query rewritten = query.rewrite(reader); + if (rewritten != query) { + CitationRefersToQuery clone = (CitationRefersToQuery)this.clone(); + clone.query = rewritten; + return clone; + } else { + return this; + } + } + + public Query getQuery() { + return query; + } + + public Filter getFilter() { + return filter; + } + + // inherit javadoc + public void extractTerms(Set terms) { + getQuery().extractTerms(terms); + } + + /** Prints a user-readable version of this query. */ + public String toString (String s) { + StringBuffer buffer = new StringBuffer(); + buffer.append("filtered("); + buffer.append(query.toString(s)); + buffer.append(")->"); + buffer.append(filter); + buffer.append(ToStringUtils.boost(getBoost())); + return buffer.toString(); + } + + /** Returns true iff o is equal to this. */ + public boolean equals(Object o) { + if (o instanceof CitationRefersToQuery) { + CitationRefersToQuery fq = (CitationRefersToQuery) o; + return (query.equals(fq.query) && filter.equals(fq.filter) && getBoost()==fq.getBoost()); + } + return false; + } + + /** Returns a hash code value for this object. */ + public int hashCode() { + return query.hashCode() ^ filter.hashCode() + Float.floatToRawIntBits(getBoost()); + } + } + +class CitationRefersToFilter extends Filter { + + /** + * This method returns a set of documents that are referring (citing) + * the set of documents we retrieved in the underlying query + */ + @Override + public DocIdSet getDocIdSet(IndexReader reader) throws IOException { + final OpenBitSet bitSet = new OpenBitSet(reader.maxDoc()); + for (int i=0; i < reader.maxDoc(); i++) { + bitSet.set(i); + } + return bitSet; + } + + +} diff --git a/src/java/org/apache/solr/search/InvenioQParserPlugin.java b/src/java/org/apache/solr/search/InvenioQParserPlugin.java new file mode 100644 index 000000000..b772502e4 --- /dev/null +++ b/src/java/org/apache/solr/search/InvenioQParserPlugin.java @@ -0,0 +1,543 @@ +package org.apache.solr.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import org.apache.lucene.analysis.WhitespaceAnalyzer; +import org.apache.lucene.analysis.standard.StandardAnalyzer; +import org.apache.lucene.index.Term; +import org.apache.lucene.queryParser.ParseException; +import org.apache.lucene.queryParser.QueryParser; +import org.apache.lucene.queryParser.InvenioQueryParser; +import org.apache.solr.common.SolrException; +import org.apache.solr.common.params.CommonParams; +import org.apache.solr.common.params.DefaultSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.schema.FieldType; +import org.apache.solr.schema.IndexSchema; +import org.apache.lucene.search.BooleanClause; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.ConstantScoreQuery; +import org.apache.lucene.search.FuzzyQuery; +import org.apache.lucene.search.NumericRangeQuery; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.PrefixQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TermRangeQuery; +import org.apache.lucene.search.WildcardQuery; +import org.apache.lucene.util.Version; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import org.apache.solr.search.QueryParsing; +import org.apache.solr.util.DictionaryCache; + + + +/** + * Parse query that is made of the solr fields as well as Invenio query syntax, + * the field that are prefixed using the special code inv_ get + * automatically passed to Invenio + * + * Other parameters: + *
    + *
  • q.op - the default operator "OR" or "AND"
  • + *
  • df - the default field name
  • + *
+ *
+ * Example: {!iq mode=maxinv xfields=fulltext}035:arxiv +bar -baz + * + * The example above would query everything as Invenio field, but fulltext will + * be served by Solr. + * + * Example: + * {!iq iq.mode=maxsolr iq.xfields=fulltext,abstract iq.channel=bitset}035:arxiv +bar -baz + * + * The example above will try to map all the fields into the Solr schema, if the + * field exists, it will be served by Solr. The fulltext will be served by + * Invenio no matter if it is defined in schema. And communication between Java + * and Invenio is done using bitset + * + * If the query is written as:inv_field:value the search will be + * always passed to Invenio. + * + */ +public class InvenioQParserPlugin extends QParserPlugin { + public static String NAME = "iq"; + public static String FIELDNAME = "InvenioQuery"; + public static String PREFIX = "inv_"; + public static String IDFIELD = "id"; + + @Override + public void init(NamedList args) { + } + + @Override + public QParser createParser(String qstr, SolrParams localParams, + SolrParams params, SolrQueryRequest req) { + return new InvenioQParser(qstr, localParams, params, req); + } + +} + +class InvenioQParser extends QParser { + + public static final Logger log = LoggerFactory + .getLogger(InvenioQParser.class); + + public static Pattern fieldPattern = Pattern + .compile("\\b([a-zA-Z_0-9]+)\\:"); + + String sortStr; + SolrQueryParser lparser; + ArrayList xfields = null; + + private String operationMode = "maxinvenio"; + private String exchangeType = "ints"; + private String querySyntax = "invenio"; + private IndexSchema schema = null; + + public InvenioQParser(String qstr, SolrParams localParams, + SolrParams params, SolrQueryRequest req) { + super(qstr, localParams, params, req); + + SolrParams solrParams = localParams == null ? params : new DefaultSolrParams(localParams, params); + + schema = req.getSchema(); + + String m = solrParams.get("iq.mode"); + if (m != null ) { + if (m.contains("maxinv") && !schema.hasExplicitField("*")) { + throw new SolrException( + null, + "Query parser is configured to pass as many fields to Invenio as possible, for this to work, schema must contain a dynamic field declared as '*'"); + } + operationMode = m; + } + + + xfields = new ArrayList(); + String[] overriden_fields = solrParams.getParams("iq.xfields"); + if (overriden_fields != null) { + for (String f: overriden_fields) { + if (f.indexOf(",") > -1) { + for (String x: f.split(",")) { + xfields.add(x); + } + } + else { + xfields.add(f); + } + } + } + + String eType = solrParams.get("iq.channel", "default"); + if (eType.contains("bitset")) { + exchangeType = "bitset"; + } + + String sType = solrParams.get("iq.syntax", "invenio"); + if (sType.contains("lucene")) { + querySyntax = "lucene"; + } + } + + public Query parse() throws ParseException { + + if (getString() == null) { + throw new ParseException("The query parameter is empty"); + } + + setString(normalizeInvenioQuery(getString())); + String qstr = getString(); + + + + // detect field not in the schema, but only if we are not in the + // all-tracking mode (because in that mode we can do it much smarter and + // without regex) + if (operationMode.equals("maxsolr") && !schema.hasExplicitField("*")) { + String q2 = changeInvenioQuery(req, qstr); + if (!q2.equals(qstr)) { + log.info(qstr + " --> " + q2); + setString(q2); + } + } + + + String defaultField = getParam(CommonParams.DF); + if (defaultField == null) { + defaultField = getReq().getSchema().getDefaultSearchFieldName(); + } + + + // Now use the specific parser to fight with the syntax + Query mainq; + + if (querySyntax.equals("invenio")) { + InvenioQueryParser invParser = new InvenioQueryParser(Version.LUCENE_29, schema.getDefaultSearchFieldName(), schema.getAnalyzer()); + String opParam = getParam(QueryParsing.OP); + if (opParam != null) { + invParser.setDefaultOperator("AND".equals(opParam) ? InvenioQueryParser.Operator.AND + : InvenioQueryParser.Operator.OR); + } else { + // try to get default operator from schema + QueryParser.Operator operator = getReq().getSchema() + .getSolrQueryParser(null).getDefaultOperator(); + invParser.setDefaultOperator(null == operator ? InvenioQueryParser.Operator.OR + : (operator == QueryParser.AND_OPERATOR ? InvenioQueryParser.Operator.AND : InvenioQueryParser.Operator.OR)); + } + mainq = invParser.parse(getString()); + } + else { + + lparser = new SolrQueryParser(this, defaultField); + // these could either be checked & set here, or in the SolrQueryParser + // constructor + String opParam = getParam(QueryParsing.OP); + if (opParam != null) { + lparser.setDefaultOperator("AND".equals(opParam) ? QueryParser.Operator.AND + : QueryParser.Operator.OR); + } else { + // try to get default operator from schema + QueryParser.Operator operator = getReq().getSchema() + .getSolrQueryParser(null).getDefaultOperator(); + lparser.setDefaultOperator(null == operator ? QueryParser.Operator.OR + : operator); + } + mainq = lparser.parse(getString()); + } + /** + else { + StandardQueryParser qpHelper = new StandardQueryParser(); + qpHelper.setAllowLeadingWildcard(true); + qpHelper.setAnalyzer(new StandardAnalyzer()); + try { + mainq = qpHelper.parse(getString(), schema.getDefaultSearchFieldName()); + } catch (QueryNodeException e) { + throw new ParseException(); + } + } + **/ + + Query mainq2; + try { + mainq2 = rewriteQuery(mainq, 0); + if (!mainq2.equals(mainq)) { + log.info(getString() + " --> " + mainq2.toString()); + mainq = mainq2; + } + } catch (IOException e) { + throw new ParseException(); + } + + return mainq; + } + + private Pattern weird_or = Pattern.compile("( \\|)([a-zA-Z\"])"); + private String normalizeInvenioQuery(String q) { + try { + Matcher matcher = weird_or.matcher(q); + q = matcher.replaceAll(" || $2"); + } + catch (Exception e) { + System.out.println(q); + } + q = q.replace("refersto:", "refersto\\:"); + q = q.replace("citedby:", "citedby\\:"); + q = q.replace("cited:", "cited\\:"); + q = q.replace("cocitedwith:", "cocitedwith\\:"); + q = q.replace("reportnumber:", "reportnumber\\:"); + q = q.replace("reference:", "reference\\:"); + return q; + + } + + /** + * Help method to change query into invenio fields (if the field is not defined + * in the schema, it is considered to be Invenio). However we use this simplistic + * rewriting only when '*' is not activated and when iq.mode=maxsolr + * @param req + * @param q + * @return + */ + private String changeInvenioQuery(SolrQueryRequest req, String q) { + IndexSchema schema = req.getSchema(); + // SolrQueryParser qparser = new SolrQueryParser(schema, "all"); + // log.info(qparser.escape(q)); + + // leave this to invenio + // q = q.replace("refersto:", "{!relation rel=refersto}"); + // q = q.replace("citedby:", "{!relation rel=citedby}"); + q = q.replace("journal:", "publication:"); + q = q.replace("arXiv:", "reportnumber:"); + + String q2 = q; + + Matcher matcher = fieldPattern.matcher(q); + while (matcher.find()) { + String field = q.substring(matcher.start(), matcher.end() - 1); + try { + if (schema.getFieldType(field) != null) { + continue; + } + } catch (SolrException e) { + // pass - not serious + } + q2 = q2.replace(field + ":", InvenioQParserPlugin.PREFIX + field + + ":"); + } + return q2; + } + + /** + * Returns a field (string) IFF we should pass the query to Invenio. + * + * @param field + * @return + * @throws ParseException + */ + private String getInvField(String field) throws ParseException { + String v = null; + // always consider it as Invenio field if the prefix is present + if (field.startsWith(InvenioQParserPlugin.PREFIX)) { + v = field.substring(InvenioQParserPlugin.PREFIX.length()); + return v; + } + + + // consider it as solr field if it is in the schema + if (operationMode.equals("maxsolr")) { + if(schema.hasExplicitField(field) && xfields.indexOf(field) == -1) { + return null; + } + return field; // consider it Invenio field + } + else { // pass all fields to Invenio + if (xfields.indexOf(field) > -1) { // besides explicitly solr fields + if (!schema.hasExplicitField(field)) { + throw new ParseException("The field '" + field + "' is not defined for Solr."); + } + return null; + } + return field; + } + + } + + + private Query createInvenioQuery(String field, String value, Map recidToDocid) { + Query newQuery = null; + String newField = field; + if (field.equals(schema.getDefaultSearchFieldName())) { + newField = ""; + } + if (exchangeType.equals("bitset")) { + newQuery = new InvenioQueryBitSet(new TermQuery(new Term(newField, value)), req, localParams, recidToDocid); + } + else { + newQuery = new InvenioQuery(new TermQuery(new Term(newField, value)), req, localParams, recidToDocid); + } + return newQuery; + + } + + + /** @see #QueryParsing.toString(Query,IndexSchema) */ + public Query rewriteQuery(Query query, int flags) throws IOException, + ParseException { + + boolean writeBoost = true; + + Query newQuery = null; + + SolrIndexReader reader = req.getSearcher().getReader(); + Map recidToDocid = null; + try { + recidToDocid = DictionaryCache.INSTANCE.getTranslationCache(reader, + InvenioQParserPlugin.IDFIELD); + } catch (IOException e) { + e.printStackTrace(); + throw new ParseException( + "Invenio translation table recid<->docid is not available!"); + } + + StringBuffer out = new StringBuffer(); + if (query instanceof TermQuery) { + TermQuery q = (TermQuery) query; + Term t = q.getTerm(); + String invf = getInvField(t.field()); + if (invf != null) { + newQuery = createInvenioQuery(invf, t.text(), recidToDocid); + } + + } else if (query instanceof TermRangeQuery) { + TermRangeQuery q = (TermRangeQuery) query; + String invf = getInvField(q.getField()); + if (invf != null) { + String fname = q.getField(); + FieldType ft = QueryParsing.writeFieldName(invf, schema, out, + flags); + out.append(q.includesLower() ? '[' : '{'); + String lt = q.getLowerTerm(); + String ut = q.getUpperTerm(); + if (lt == null) { + out.append('*'); + } else { + QueryParsing.writeFieldVal(lt, ft, out, flags); + } + + out.append(" TO "); + + if (ut == null) { + out.append('*'); + } else { + QueryParsing.writeFieldVal(ut, ft, out, flags); + } + + out.append(q.includesUpper() ? ']' : '}'); + + // newQuery = new + // TermRangeQuery(q.getField().replaceFirst(PREFIX, ""), + // q.getLowerTerm(), q.getUpperTerm(), + // q.includesLower(), q.includesUpper()); + newQuery = createInvenioQuery(invf, out.toString(), recidToDocid); + } + + } else if (query instanceof NumericRangeQuery) { + NumericRangeQuery q = (NumericRangeQuery) query; + String invf = getInvField(q.getField()); + if (invf != null) { + String fname = q.getField(); + FieldType ft = QueryParsing.writeFieldName(invf, schema, out, + flags); + out.append(q.includesMin() ? '[' : '{'); + Number lt = q.getMin(); + Number ut = q.getMax(); + if (lt == null) { + out.append('*'); + } else { + out.append(lt.toString()); + } + + out.append(" TO "); + + if (ut == null) { + out.append('*'); + } else { + out.append(ut.toString()); + } + + out.append(q.includesMax() ? ']' : '}'); + newQuery = createInvenioQuery(invf, out.toString(), recidToDocid); + + + // TODO: Invneio is using int ranges only, i think, but we shall + // not hardcode it here + // SchemaField ff = + // schema.getField(q.getField().substring(PREFIX.length())); + // newQuery = NumericRangeQuery.newIntRange(ff.getName(), + // (Integer)q.getMin(), (Integer)q.getMax(), + // q.includesMin(), q.includesMax()); + } + + } else if (query instanceof BooleanQuery) { + BooleanQuery q = (BooleanQuery) query; + newQuery = new BooleanQuery(); + + Listclauses = (List) q.clauses(); + + Query subQuery; + for (int i=0;i 0 ? "'" : "\""); + for (int i=0;i 0 ? "'" : "\""); + newQuery = createInvenioQuery(invf, out.toString(), recidToDocid); //TODO: is this correct? + } + + } else if (query instanceof FuzzyQuery) { + // do nothing + } else if (query instanceof ConstantScoreQuery) { + // do nothing + } else { + // do nothing + } + + if (newQuery != null) { + if (writeBoost && query.getBoost() != 1.0f) { + newQuery.setBoost(query.getBoost()); + } + return newQuery; + } else { + return query; + } + + } + + public String[] getDefaultHighlightFields() { + return new String[] { lparser.getField() }; + } + +} diff --git a/src/java/org/apache/solr/search/InvenioQuery.java b/src/java/org/apache/solr/search/InvenioQuery.java new file mode 100644 index 000000000..ea58fadec --- /dev/null +++ b/src/java/org/apache/solr/search/InvenioQuery.java @@ -0,0 +1,182 @@ +package org.apache.solr.search; + +import org.apache.lucene.search.Query; +import org.apache.lucene.search.Searcher; +import org.apache.lucene.search.Similarity; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.ToStringUtils; + +import java.io.IOException; +import java.util.Map; +import java.util.Set; + +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.Term; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.search.InvenioQParserPlugin; + + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class InvenioQuery extends Query { + + public static final Logger log = LoggerFactory + .getLogger(InvenioQuery.class); + + private float boost = 1.0f; // query boost factor + Query query; + SolrParams localParams; + SolrQueryRequest req; + Map recidToDocid = null; + + public InvenioQuery(TermQuery query, SolrQueryRequest req, + SolrParams localParams, Map recidToDocid) { + this.query = query; + this.localParams = localParams; + this.req = req; + this.recidToDocid = recidToDocid; + + } + + /** + * Sets the boost for this query clause to b. Documents + * matching this clause will (in addition to the normal weightings) have + * their score multiplied by b. + */ + public void setBoost(float b) { + query.setBoost(b); + } + + /** + * Gets the boost for this clause. Documents matching this clause will (in + * addition to the normal weightings) have their score multiplied by + * b. The boost is 1.0 by default. + */ + public float getBoost() { + return query.getBoost(); + } + + /** + * Expert: Constructs an appropriate Weight implementation for this query. + * + *

+ * Only implemented by primitive queries, which re-write to themselves. + */ + public Weight createWeight(Searcher searcher) throws IOException { + + return new InvenioWeight(this, localParams, req, recidToDocid); + } + + /** + * Expert: Constructs and initializes a Weight for a top-level query. + */ + public Weight weight(Searcher searcher) throws IOException { + Query query = searcher.rewrite(this); + Weight weight = query.createWeight(searcher); + float sum = weight.sumOfSquaredWeights(); + float norm = getSimilarity(searcher).queryNorm(sum); + if (Float.isInfinite(norm) || Float.isNaN(norm)) + norm = 1.0f; + weight.normalize(norm); + return weight; + } + + /** + * Expert: called to re-write queries into primitive queries. For example, a + * PrefixQuery will be rewritten into a BooleanQuery that consists of + * TermQuerys. + */ + public Query rewrite(IndexReader reader) throws IOException { + Query rewritten = query.rewrite(reader); + if (rewritten != query) { + InvenioQuery clone = (InvenioQuery) this.clone(); + clone.query = rewritten; + return clone; + } else { + return this; + } + } + + /** + * Expert: called when re-writing queries under MultiSearcher. + * + * Create a single query suitable for use by all subsearchers (in 1-1 + * correspondence with queries). This is an optimization of the OR of all + * queries. We handle the common optimization cases of equal queries and + * overlapping clauses of boolean OR queries (as generated by + * MultiTermQuery.rewrite()). Be careful overriding this method as + * queries[0] determines which method will be called and is not necessarily + * of the same type as the other queries. + */ + public Query combine(Query[] queries) { + return query.combine(queries); + + } + + /** + * Expert: adds all terms occurring in this query to the terms set. Only + * works if this query is in its {@link #rewrite rewritten} form. + * + * @throws UnsupportedOperationException + * if this query is not yet rewritten + */ + public void extractTerms(Set terms) { + query.extractTerms(terms); + } + + /** + * Expert: Returns the Similarity implementation to be used for this query. + * Subclasses may override this method to specify their own Similarity + * implementation, perhaps one that delegates through that of the Searcher. + * By default the Searcher's Similarity implementation is returned. + */ + public Similarity getSimilarity(Searcher searcher) { + return searcher.getSimilarity(); + } + + /** Prints a user-readable version of this query. */ + public String toString(String s) { + StringBuffer buffer = new StringBuffer(); + buffer.append("<"); + Term t = ((TermQuery) query).getTerm(); + if (t.field().length() > 0 ) { + buffer.append(t.field()); + buffer.append("|"); + } + buffer.append(t.text()); + //buffer.append(query.toString(s)); + buffer.append(">"); + buffer.append(ToStringUtils.boost(getBoost())); + return buffer.toString(); + } + + /** Returns true iff o is equal to this. */ + public boolean equals(Object o) { + if (o instanceof InvenioQuery) { + InvenioQuery fq = (InvenioQuery) o; + return (query.equals(fq.query) && getBoost() == fq.getBoost()); + } + return false; + } + + /** Returns a hash code value for this object. */ + public int hashCode() { + return query.hashCode() ^ Float.floatToRawIntBits(getBoost()); + } + + public String getInvenioQuery() { + String qfield = ((TermQuery) query).getTerm().field(); + String qval = ((TermQuery) query).getTerm().text(); + if (qfield.length() > 0) { + qval = qfield + ":" + qval; + } + if (qval.substring(0, 1).equals("\"/")) { + qval = qval.substring(1, qval.length()-1); + } + return qval; + } +} + diff --git a/src/java/org/apache/solr/search/InvenioQueryBitSet.java b/src/java/org/apache/solr/search/InvenioQueryBitSet.java new file mode 100644 index 000000000..8a2b9c749 --- /dev/null +++ b/src/java/org/apache/solr/search/InvenioQueryBitSet.java @@ -0,0 +1,38 @@ +package org.apache.solr.search; + +import java.io.IOException; +import java.util.Map; + +import org.apache.lucene.search.Searcher; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.Weight; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.request.SolrQueryRequest; + + +public class InvenioQueryBitSet extends InvenioQuery { + + private static final long serialVersionUID = -2624111746562481355L; + private float boost = 1.0f; // query boost factor + + public InvenioQueryBitSet(TermQuery query, SolrQueryRequest req, + SolrParams localParams, Map recidToDocid) { + super(query, req, localParams, recidToDocid); + + } + + /** + * Expert: Constructs an appropriate Weight implementation for this query. + * + *

+ * Only implemented by primitive queries, which re-write to themselves. + */ + public Weight createWeight(Searcher searcher) throws IOException { + + return new InvenioWeightBitSet(this, localParams, req, recidToDocid); + } + + +} + + diff --git a/src/java/org/apache/solr/search/InvenioWeight.java b/src/java/org/apache/solr/search/InvenioWeight.java new file mode 100644 index 000000000..3bae510bb --- /dev/null +++ b/src/java/org/apache/solr/search/InvenioWeight.java @@ -0,0 +1,178 @@ +package org.apache.solr.search; + +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import java.io.IOException; +import java.util.Map; + +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.Explanation; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.Similarity; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.Weight; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.handler.InvenioHandler; +import org.apache.solr.request.SolrQueryRequest; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + + +public class InvenioWeight extends Weight { + + public static final Logger log = LoggerFactory + .getLogger(InvenioWeight.class); + + protected Weight weight; + protected Similarity similarity; + protected InvenioQuery query; + protected TermQuery innerQuery; + protected SolrParams localParams; + protected Map recidToDocid; + protected float value; + + private int searcherCounter; + + public InvenioWeight(InvenioQuery query, SolrParams localParams, + SolrQueryRequest req, Map recidToDocid) + throws IOException { + SolrIndexSearcher searcher = req.getSearcher(); + this.innerQuery = (TermQuery) query.query; + this.weight = innerQuery.createWeight(searcher); + this.similarity = innerQuery.getSimilarity(searcher); + this.query = query; + this.localParams = localParams; + this.recidToDocid = recidToDocid; + this.searcherCounter = 0; + } + public Scorer scorer(IndexReader indexReader, boolean scoreDocsInOrder, + boolean topScorer) throws IOException { + + if (searcherCounter > 0) { + return null; + } + searcherCounter++; + + // we override the Scorer for the InvenioQuery + return new Scorer(similarity) { + + private int doc = -1; + private int[] recids = null; + private int recids_counter = -1; + private int max_counter = -1; + + public void score(Collector collector) throws IOException { + collector.setScorer(this); + + int d; + while ((d = nextDoc()) != NO_MORE_DOCS) { + collector.collect(d); + } + } + + private void searchInvenio() throws IOException { + // ask Invenio to give us recids + String qval = query.getInvenioQuery(); + + PythonMessage message = MontySolrVM.INSTANCE + .createMessage("perform_request_search_ints") + .setSender("InvenioQuery").setParam("query", qval); + try { + MontySolrVM.INSTANCE.sendMessage(message); + } catch (InterruptedException e) { + e.printStackTrace(); + throw new IOException("Error searching Invenio!"); + } + + Object result = message.getResults(); + if (result != null) { + recids = (int[]) result; + max_counter = recids.length - 1; + log.info("Invenio returned: " + recids.length + " hits"); + } + else { + log.info("Invenio returned: null"); + } + } + + public int nextDoc() throws IOException { + // this is called only once + if (this.doc == -1) { + searchInvenio(); + if (recids == null || recids.length == 0) { + return doc = NO_MORE_DOCS; + } + } + + recids_counter += 1; + if (recids_counter > max_counter) { + return doc = NO_MORE_DOCS; + } + + try { + doc = recidToDocid.get(recids[recids_counter]); + } + catch (NullPointerException e) { + log.error("Doc with recid=" + recids[recids_counter] + " missing. You should update Invenio recids!"); + throw e; + } + + return doc; + } + + public int docID() { + return doc; + } + + public int advance(int target) throws IOException { + while ((doc = nextDoc()) < target) { + } + return doc; + } + + public float score() throws IOException { + assert doc != -1; + return innerQuery.getBoost() * 1.0f; // TODO: implementation of the + // scoring algorithm + } + };// Scorer + }// scorer + + // pass these methods through to enclosed query's weight + public float getValue() { + return value; + } + + public float sumOfSquaredWeights() throws IOException { + return weight.sumOfSquaredWeights() * innerQuery.getBoost() + * innerQuery.getBoost(); + } + + public void normalize(float v) { + weight.normalize(v); + value = weight.getValue() * innerQuery.getBoost(); + } + + public Explanation explain(IndexReader ir, int i) throws IOException { + Explanation inner = weight.explain(ir, i); + if (innerQuery.getBoost() != 1) { + Explanation preBoost = inner; + inner = new Explanation(inner.getValue() * innerQuery.getBoost(), + "product of:"); + inner.addDetail(new Explanation(innerQuery.getBoost(), "boost")); + inner.addDetail(preBoost); + } + inner.addDetail(new Explanation(0.0f, "TODO: add formula details")); + return inner; + } + + // return this query + public Query getQuery() { + return query; + } + +}; // Weight + diff --git a/src/java/org/apache/solr/search/InvenioWeightBitSet.java b/src/java/org/apache/solr/search/InvenioWeightBitSet.java new file mode 100644 index 000000000..8006822a4 --- /dev/null +++ b/src/java/org/apache/solr/search/InvenioWeightBitSet.java @@ -0,0 +1,124 @@ +package org.apache.solr.search; + +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.util.Map; + +import org.ads.solr.InvenioBitSet; +import org.apache.commons.io.IOUtils; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.TermQuery; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.request.SolrQueryRequest; + + +import com.jcraft.jzlib.ZInputStream; + +public class InvenioWeightBitSet extends InvenioWeight { + + public InvenioWeightBitSet(InvenioQuery query, SolrParams localParams, + SolrQueryRequest req, Map recidToDocid) + throws IOException { + super(query, localParams, req, recidToDocid); + + } + + public Scorer scorer(IndexReader indexReader, boolean scoreDocsInOrder, + boolean topScorer) throws IOException { + + // we override the Scorer for the InvenioQuery + return new Scorer(similarity) { + + private int doc = -1; + private int recid = -1; + private InvenioBitSet bitSet = null; + + public void score(Collector collector) throws IOException { + collector.setScorer(this); + + int d; + while ((d = nextDoc()) != NO_MORE_DOCS) { + collector.collect(d); + } + } + + private void searchInvenio() throws IOException { + // ask Invenio to give us recids + String qval = query.getInvenioQuery(); + + PythonMessage message = MontySolrVM.INSTANCE + .createMessage("perform_request_search_bitset") + .setSender("InvenioQuery").setParam("query", qval); + try { + MontySolrVM.INSTANCE.sendMessage(message); + } catch (InterruptedException e) { + e.printStackTrace(); + throw new IOException("Error searching Invenio!"); + } + + Object result = message.getResults(); + + if (result != null) { + // use zlib to read in the data + InputStream is = new ByteArrayInputStream((byte[]) result); + ByteArrayOutputStream bOut = new ByteArrayOutputStream(); + ZInputStream zIn = new ZInputStream(is); + + int bytesCopied = IOUtils.copy(zIn, bOut); + byte[] bitset_bytes = bOut.toByteArray(); + bitSet = new InvenioBitSet(bitset_bytes); + } + } + + public int nextDoc() throws IOException { + // this is called only once + if (this.recid == -1) { + searchInvenio(); + if (bitSet == null || bitSet.isEmpty()) { + return doc = NO_MORE_DOCS; + } + } + + if ((recid = bitSet.nextSetBit(recid)) == -1) { + return doc = NO_MORE_DOCS; + } + + try { + doc = recidToDocid.get(recid); + } + catch (NullPointerException e) { + log.error("Doc with recid=" + recid + " missing. You should update Invenio recids!"); + throw e; + } + + return doc; + } + + public int docID() { + return doc; + } + + public int advance(int target) throws IOException { + while ((doc = nextDoc()) < target) { + } + return doc; + } + + public float score() throws IOException { + assert doc != -1; + return query.getBoost() * 1.0f; // TODO: implementation of the + // scoring algorithm + } + };// Scorer + }// scorer + +}; // Weight + diff --git a/src/java/org/apache/solr/update/InvenioKeepRecidUpdated.java b/src/java/org/apache/solr/update/InvenioKeepRecidUpdated.java new file mode 100644 index 000000000..91085e876 --- /dev/null +++ b/src/java/org/apache/solr/update/InvenioKeepRecidUpdated.java @@ -0,0 +1,166 @@ +package org.apache.solr.update; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + + +import invenio.montysolr.jni.PythonMessage; +import invenio.montysolr.jni.MontySolrVM; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Map; + +import org.apache.lucene.document.Field; +import org.apache.solr.common.SolrException; +import org.apache.solr.common.SolrInputDocument; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.core.SolrCore; +import org.apache.solr.handler.RequestHandlerBase; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.request.SolrQueryResponse; +import org.apache.solr.schema.IndexSchema; +import org.apache.solr.util.DictionaryCache; + + + +/** + * Ping solr core + * + * @since solr 1.3 + */ +public class InvenioKeepRecidUpdated extends RequestHandlerBase +{ + @Override + public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception + { + SolrParams params = req.getParams(); + SolrParams required = params.required(); + SolrCore core = req.getCore(); + IndexSchema schema = req.getSchema(); + + UpdateHandler updateHandler = core.getUpdateHandler(); + + + long start = System.currentTimeMillis(); + + AddUpdateCommand addCmd = new AddUpdateCommand(); + addCmd.allowDups = false; + addCmd.overwriteCommitted = false; + addCmd.overwritePending = false; + + + + int last_recid = -1; // -1 means get the first created doc + + if (params.getInt("last_recid") != null) { + last_recid = params.getInt("last_recid"); + } + else { + int[] ids = DictionaryCache.INSTANCE.getLuceneCache(req.getSearcher().getReader(), schema.getUniqueKeyField().getName()); + for(int m: ids) { + if (m > last_recid) { + last_recid = m; + } + } + } + + rsp.add("last_recid", last_recid); + + + Map dictData; + + if (params.getBool("generate", false)) { + Integer max_recid = params.getInt("max_recid", 0); + if (max_recid == 0 || max_recid < last_recid) { + throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "The max_recid parameter missing!"); + } + + dictData = new HashMap(); + int[] a = new int[max_recid-last_recid]; + for (int i=0, ii=last_recid+1;ii) results; + } + + + if (dictData.containsKey("ADDED")) { + int[] recids = dictData.get("ADDED"); + // create new documentns, they will have only recids, but that's OK (for some + // people), sigh... + if (recids.length > 0) { + SolrInputDocument doc = null; + for (int i=0; i> cache = null; + + private HashMap> + cache = new HashMap>(4); + + private HashMap> + translation_cache = new HashMap>(2); + private HashMap + translation_cache_tracker = new HashMap(2); + + public void setCache(String name, Map value) { + cache.put(name, value); + } + + public Map getCache(String name) { + return cache.get(name); + } + + public int[] getLuceneCache(IndexReader reader, String field) throws IOException { + return FieldCache.DEFAULT.getInts(reader, field); + } + + public Map buildCache(int[] idMapping) throws IOException { + + Map fromFieldToLuceneId = new HashMap(idMapping.length); + int i = 0; + for (int value: idMapping) { + fromFieldToLuceneId.put(value, i); + i++; + } + return fromFieldToLuceneId; + } + + public Map getTranslationCache(IndexReader reader, String field) throws IOException { + int[] idMapping = getLuceneCache(reader, field); + Integer h = idMapping.hashCode(); + Integer old_hash = null; + if (translation_cache_tracker.containsKey(field)) + old_hash = translation_cache_tracker.get(field); + if (!h.equals(old_hash)) { + Map translTable = buildCache(idMapping); + translation_cache.put(field, translTable); + translation_cache_tracker.put(field, h); + } + return translation_cache.get(field); + } + +} diff --git a/src/java/org/apache/solr/util/WebUtils.java b/src/java/org/apache/solr/util/WebUtils.java new file mode 100644 index 000000000..20e8b2538 --- /dev/null +++ b/src/java/org/apache/solr/util/WebUtils.java @@ -0,0 +1,50 @@ +package org.apache.solr.util; + +import java.io.UnsupportedEncodingException; +import java.net.URLDecoder; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.StringTokenizer; + +public class WebUtils { + + public static Map> getUrlParameters(String url) + throws UnsupportedEncodingException { + Map> params = new HashMap>(); + String[] urlParts = url.split("\\?"); + if (urlParts.length > 1) { + String query = urlParts[1]; + for (String param : query.split("&")) { + String pair[] = param.split("="); + String key = URLDecoder.decode(pair[0], "UTF-8"); + String value = URLDecoder.decode(pair[1], "UTF-8"); + List values = params.get(key); + if (values == null) { + values = new ArrayList(); + params.put(key, values); + } + values.add(value); + } + } + return params; + } + + + public static Map parseQueryString(String encodedParams) + throws UnsupportedEncodingException { + final Map qps = new HashMap(); + final StringTokenizer pairs = new StringTokenizer(encodedParams, "&"); + while (pairs.hasMoreTokens()) { + final String pair = pairs.nextToken(); + final StringTokenizer parts = new StringTokenizer(pair, "="); + final String key = URLDecoder.decode(parts.nextToken(), "UTF-8"); + final String value = parts.hasMoreTokens() ? URLDecoder.decode(parts.nextToken(), "UTF-8") : ""; + + qps.put(key, value); + } + return qps; + } + +} diff --git a/src/python/montysolr/__init__.py b/src/python/montysolr/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/python/montysolr/examples/__init__.py b/src/python/montysolr/examples/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/python/montysolr/examples/bigtest.py b/src/python/montysolr/examples/bigtest.py new file mode 100644 index 000000000..7956015ee --- /dev/null +++ b/src/python/montysolr/examples/bigtest.py @@ -0,0 +1,79 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +from montysolr.initvm import montysolr_java as sj +from montysolr.utils import MontySolrTarget + + +import random +import time + +def bigtest(message): + req = message.getSolrQueryRequest() + rsp = message.getSolrQueryResponse() + + params = req.getParams() + action = params.get("action") + + start = time.time() + + if 'recids' in action: + + size = params.getInt("size", 5000) + + if action == 'recids_int': + result = range(0, size) + result = sj.JArray_int(result) + elif action == 'recids_str': + result = [ '%s' % x for x in xrange(size)] + result = sj.JArray_string(result) + elif action == 'recids_hm_strstr': + result = sj.HashMap().of_(sj.String, sj.String) + for x in xrange(size): + result.put(str(x), str(x)) + elif action == 'recids_hm_strint': + result = sj.HashMap().of_(sj.String, sj.Integer) + for x in xrange(size): + result.put(str(x), x) + elif action == 'recids_hm_intint': + result = sj.HashMap().of_(sj.Integer, sj.Integer) + for x in xrange(size): + result.put(x, x) + elif action == 'recids_bitset': + from invenio import intbitset + filled = int(params.getInt('filled').intValue()) + result = intbitset.intbitset(rhs=size) + step = int(size / filled) + for x in xrange(0, size, step): + result.add(x) + result = sj.JArray_byte(result.fastdump()) + else: + result = None + + message.setResults(result) + + else: + help = ''' + action:args:description + recids_int:@size(int):returns array of integers of the given size + recids_str:@size(int):returns array of strings of the given size + recids_hm_strstr:@size(int):returns hashmap of string:string of the given size + recids_hm_strint:@size(int):returns hashmap of string:int of the given size + recids_hm_intint:@size(int):returns hashmap of int:int of the given size + recids_bitset:@size(int) - size of the bitset; @filled(int) - number of elements that are set:returns bit array (uses invenio bitset for the transfer) + ''' + for line in help.split('\n'): + rsp.add('python-message', line.strip()) + + + rsp.add('python-message', 'Python call finished in: %s ms.' % (time.time() - start)) + + +def montysolr_targets(): + targets = [ + MontySolrTarget(':bigtest', bigtest), + ] + return targets diff --git a/src/python/montysolr/examples/twitter_test.py b/src/python/montysolr/examples/twitter_test.py new file mode 100644 index 000000000..d1c09e60c --- /dev/null +++ b/src/python/montysolr/examples/twitter_test.py @@ -0,0 +1,62 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +from montysolr.initvm import montysolr_java as sj +from montysolr.utils import MontySolrTarget + +import twitter + +def twitter_api(message): + req = message.getSolrQueryRequest() + rsp = message.getSolrQueryResponse() + + params = req.getParams() + core = sj.SolrCore.cast_(req.getCore()) + schema = sj.IndexSchema.cast_(req.getSchema()) + updateHandler = sj.UpdateHandler.cast_(core.getUpdateHandler()) + + addCmd = sj.AddUpdateCommand() + addCmd.allowDups = False + addCmd.overwriteCommitted = False + addCmd.overwritePending = False + + + action = params.get("action") + + if action == 'search': + term = params.get("term") + + if not term: + rsp.add("python-message", 'Missing search term!') + return + api = twitter.Api() + docs = api.GetSearch(term) + for d in docs: + d = d.AsDict() + doc = sj.SolrInputDocument(); + doc.addField(schema.getUniqueKeyField().getName(), d['id']) + doc.addField("title", d['text']) + doc.addField("source", d['source']) + doc.addField("user", d['user']['screen_name']) + + addCmd.doc = sj.DocumentBuilder.toDocument(doc, schema) + updateHandler.addDoc(addCmd) + + updateCmd = sj.CommitUpdateCommand(True) # coz for demo we want to see it + updateHandler.commit(updateCmd) + + rsp.add('python-message', 'Found and indexed %s docs for term %s from Twitter' % (len(docs), term)) + + else: + rsp.add("python-message", 'Unknown action: %s' % action) + + + +def montysolr_targets(): + targets = [ + MontySolrTarget('TwitterAPIHandler:twitter_api', twitter_api), + ] + return targets diff --git a/src/python/montysolr/handler.py b/src/python/montysolr/handler.py new file mode 100644 index 000000000..a95c1805e --- /dev/null +++ b/src/python/montysolr/handler.py @@ -0,0 +1,182 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +import logging +import traceback +import sys +import imp +import os + +class Handler(object): + '''Handler objects are responsible for passing messages from the MontySolr + bridge towards the real method that knows what to do with them. Because + the handler is potentially expensive to create, they are always singletons. + + Of course, this is the basic class + ''' + def __init__(self): + self._db = {} + self.log = logging + self.init() + + def init(self): + raise NotImplemented("This method must be overriden") + + + def handle_message(self, message): + '''Receives the messages, finds the target of the message + and calls it, passing it the message instance''' + message.threadInfo("handle_message") + target = self.get_target(message) + if target: + target(message) + + + + def get_target(self, message): + """Must return only a callables that receive + a PythonMessage object""" + recipient = message.getReceiver() + sender = message.getSender() + + message_id = (sender or '') + ':' + recipient + if message_id in self._db: + return self._db[message_id] + else: + self.log.error("Unknown target; sender=%s, recipient=%s, message_id=%s" % + (sender, recipient, message_id)) + + def discover_targets(self, places): + '''Queries the different objects for existence of the callable + called montysolr_targets. If that callable is present, it will + get from it the MontySolrTarget instances, which represent the + message_id and target -- for the (PythonMessage) objects + @var places: (list) must be a list of either strings + example: + 'package.module' + package.module has a method 'montysolr_targets' + '/tmp/package/module/someothermodule.py' + we create a new anonymous module and call its + 'montysolr_targets' method + or the object may be a python objects that has a + callble method 'montysolr_targets' + ''' + if not isinstance(places, list): + raise Exception("The argument must be a list") + + for place in places: + if isinstance(place, basestring): + if os.path.exists(place): # it is a module + try: + obj = self.create_module(place) + self.retrieve_targets(obj) + except: + self.log.error(traceback.format_exc()) + else: + obj = self.import_module(place) + self.retrieve_targets(obj) + else: + self.retrieve_targets(place) + + def import_module(self, module_name): + """Import workflow module + @var workflow: string as python import, eg: merkur.workflow.load_x""" + mod = __import__(module_name) + components = module_name.split('.') + for comp in components[1:]: + mod = getattr(mod, comp) + return mod + + def create_module(self, file, anonymous=False, fail_onerror=True): + """ Initializes module into a separate object (not included in sys) """ + name = 'MontySolrTmpModule<%s>' % os.path.basename(file) + x = imp.new_module(name) + x.__file__ = file + x.__id__ = name + x.__builtins__ = __builtins__ + + # XXX - chdir makes our life difficult, especially when + # one workflow wrap another wf and relative paths are used + # in the config. In such cases, the same relative path can + # point to different locations just because location of the + # workflow (parts) are different + # The reason why I was using chdir is because python had + # troubles to import files that containes non-ascii chars + # in their filenames. That is important for macros, but not + # here. + + # old_cwd = os.getcwd() + + try: + #filedir, filename = os.path.split(file) + #os.chdir(filedir) + if anonymous: + execfile(file, x.__dict__) + else: + #execfile(file, globals(), x.locals()) + exec open(file).read() + except Exception, excp: + if fail_onerror: + raise Exception(excp) + else: + self.log.error(excp) + self.log.error(traceback.format_exc()) + return + return x + + def retrieve_targets(self, obj): + if hasattr(obj, 'montysolr_targets'): + db = self._db + for t in obj.montysolr_targets(): + message_id = t.getMessageId() + target = t.getTarget() + if message_id in db: + raise Exception("The message with id '%s' already has a target:" % + (message_id, db[message_id])) + db[message_id] = target + else: + self.log.error("The %s has no method 'montysolr_targets'" % obj) + + if ':diagnostic_test' not in db: + db[':diagnostic_test'] = self._diagnostic_target() + + + def _diagnostic_target(self): + def diagnostic_target(message): + out = [] + out.append("PYTHONPATH: %s" % "\n ".join(sys.path)) + out.append("PYTHONHOME: %s" % os.getenv("PYTHONHOME")) + out.append("PATH: %s" % os.getenv("PATH")) + out.append("LD_LIBRARY_PATH: %s" % os.getenv("LD_LIBRARY_PATH")) + + out.append('---') + out.append('handler: %s' % self) + + + out.append('---') + out.append('current targets: %s' % " \n".join(map(lambda x: '%s --> %s' % x, self._db.items()))) + + out.append('---') + out.append('running diagnostic tests') + for k,v in self._db.items(): + if 'diagnostic_test' in k and k != ':diagnostic_test': + out.append('===================') + out.append(k) + try: + v(message) + res = message.getResults() + out.append(str(res)) + except: + out.append(traceback.format_exc()) + out.append('===================') + + message.setResults('\n'.join(out)) + + return diagnostic_target + + + + diff --git a/src/python/montysolr/initvm.py b/src/python/montysolr/initvm.py new file mode 100644 index 000000000..b3c09b302 --- /dev/null +++ b/src/python/montysolr/initvm.py @@ -0,0 +1,45 @@ +''' +Created on Jan 13, 2011 + +@author: rca +''' +import os +import sys + +import lucene + +try: + import solr_java + import montysolr_java +except: + _d = os.path.abspath(os.path.dirname(__file__) + '/../../build/dist') + if _d not in sys.path and os.path.exists(_d): + sys.stderr.write('Warning: we add the default folder to sys.path:\n') + sys.stderr.write(_d + '\n') + sys.path.append(_d) + import solr_java + import montysolr_java + + +if os.getenv('MONTYSOLR_DEBUG'): + from invenio import remote_debugger + remote_debugger.start('3') #or override '3|ip:192.168.31.1|port:9999' + + +_jvmargs = '' +if os.getenv('MONTYSOLR_JVMARGS_PYTHON'): + _jvmargs = os.getenv('MONTYSOLR_JVMARGS_PYTHON') + +# the distribution may contain a file that lists the jars that weere used +# for comilation, get the and add them to the classpath +_cp = os.path.join(os.path.dirname(montysolr_java.__file__), 'classpath') +_classpath='' +if os.path.exists(_cp): + _classpath = open(_cp, 'r').read() + +if _jvmargs: + montysolr_java.initVM(lucene.CLASSPATH+os.pathsep+montysolr_java.CLASSPATH+os.pathsep+_classpath, vmargs=_jvmargs) +else: + montysolr_java.initVM(lucene.CLASSPATH+os.pathsep+montysolr_java.CLASSPATH+os.pathsep+_classpath) +lucene.initVM() +solr_java.initVM() diff --git a/src/python/montysolr/inveniopie/__init__.py b/src/python/montysolr/inveniopie/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/python/montysolr/inveniopie/api_calls.py b/src/python/montysolr/inveniopie/api_calls.py new file mode 100644 index 000000000..e456a92ee --- /dev/null +++ b/src/python/montysolr/inveniopie/api_calls.py @@ -0,0 +1,114 @@ + + +import os +import thread + +from invenio import search_engine +from invenio import search_engine_summarizer +from invenio import dbquery +from invenio import bibrecord +from invenio.intbitset import intbitset + +from invenio.bibrank_citation_searcher import get_citation_dict + + + +from cStringIO import StringIO + +def dispatch(func_name, *args, **kwargs): + """Dispatches the call to the *local* worker + It returns a tuple (ThreadID, result) + """ + tid = thread.get_ident() + out = globals()[func_name](*args, **kwargs) + return [tid, out] + +def get_recids_changes(last_recid, max_recs=10000): + + search_op = '>' + + if last_recid == -1: + l = list(dbquery.run_sql("SELECT id FROM bibrec ORDER BY creation_date ASC LIMIT 1")) + search_op = '>=' + else: + # let's make sure we have a valid recid (or get the close valid one) + l = list(dbquery.run_sql("SELECT id FROM bibrec WHERE id >= %s LIMIT 1", (last_recid,))) + if not len(l): + return + last_recid = l[0][0] + + # there is not api to get this (at least i haven't found it) + mod_date = search_engine.get_modification_date(last_recid, fmt="%Y-%m-%d %H:%i:%S") + if not mod_date: + return + modified_records = list(dbquery.run_sql("SELECT id,modification_date, creation_date FROM bibrec " + "WHERE modification_date " + search_op + "%s LIMIT %s", (mod_date, max_recs ))) + + out = {'DELETED': [], 'CHANGED': [], 'ADDED': []} + for recid, mod_date, create_date in modified_records: + if mod_date == create_date: + out['ADDED'].append(recid) + else: + rec = search_engine.get_record(recid) + status = bibrecord.record_get_field_value(rec, tag='980', code='c') + if status == 'DELETED': + out['DELETED'].append(recid) + else: + out['CHANGED'].append(recid) + return out + +def citation_summary(recids, of, ln, p, f): + out = StringIO() + x = search_engine_summarizer.summarize_records(recids, of, ln, p, f, out) + if x: + output = x + else: + out.seek(0) + output = out.read() + return output + +def search(q, max_len=25): + offset = 0 + #hits = search_engine.search_pattern_parenthesised(None, q) + hits = search_engine.perform_request_search(None, p=q) + total_matches = len(hits) + + if max_len: + return [offset, hits[:max_len], total_matches] + else: + return [offset, hits, total_matches] + +def sort_and_format(hits, kwargs): + + kwargs = search_engine._cleanup_arguments(**kwargs) + t1 = os.times()[4] + req = StringIO() + kwargs['req'] = req + + if 'hosted_colls_actual_or_potential_results_p' not in kwargs: + kwargs['hosted_colls_actual_or_potential_results_p'] = True # this prevents display of the nearest-term box + + # search stage 4 and 5: intersection with collection universe and sorting/limiting + output = search_engine._collect_sort_display(hits, kwargs=kwargs, **kwargs) + if output is not None: + req.seek(0) + return req.read() + output + + t2 = os.times()[4] + cpu_time = t2 - t1 + kwargs['cpu_time'] = cpu_time + + recids = search_engine._rank_results(kwargs=kwargs, **kwargs) + + if 'of' in kwargs and kwargs['of'].startswith('hc'): + output = citation_summary(intbitset(recids), kwargs['of'], kwargs['ln'], kwargs['p'], kwargs['f']) + if output: + return output + + return recids + + + + +if __name__ == '__main__': + print dispatch("get_recids_changes", 85) diff --git a/src/python/montysolr/inveniopie/multiprocess_api_calls.py b/src/python/montysolr/inveniopie/multiprocess_api_calls.py new file mode 100644 index 000000000..bc8c925ff --- /dev/null +++ b/src/python/montysolr/inveniopie/multiprocess_api_calls.py @@ -0,0 +1,127 @@ +''' +Created on Feb 13, 2011 + +@author: rca +''' + +from montysolr.inveniopie import api_calls +from invenio.intbitset import intbitset +import os +import multiprocessing + + +POOL = None + +# ====================================================== +# Multiprocess versions of the api_call methods +# ====================================================== + +def citation_summary_local_pre(args, kwargs): + args[0] = args[0].fastdump() + +def citation_summary_remote_pre(args, kwargs): + args[0] = intbitset().fastload(args[0]) + +def search_remote_post_X(result): + if result: + res = result[1] + if len(res) > 0: + result[1] = intbitset(res).fastdump() + return result + +def search_local_post_X(result): + if result: + res = result[1] + if isinstance(res, basestring): + result[1] = intbitset().fastload(res).tolist() + return result + +def sort_and_format_local_pre(args, kwargs): + args[0] = intbitset(args[0]).fastdump() + +def sort_and_format_remote_pre(args, kwargs): + args[0] = intbitset().fastload(args[0]) + + +# ====================================================== +# Start of the multi-processing +# ====================================================== + +def start_multiprocessing(num_proc=None, default=4): + global POOL + if not num_proc: + try: + num_proc = multiprocessing.cpu_count() + except: + num_proc = default + POOL = multiprocessing.Pool(processes=num_proc) + else: + POOL = multiprocessing.Pool(processes=num_proc) + +# ====================================================== +# Some code to execute on lazy-initialization +# ====================================================== + +from invenio import bibrank_citation_searcher as bcs, \ + search_engine_summarizer as ses + +# initialize citation dictionaries in parent (so that forks have them shared) +bcs.get_citation_dict("citationdict") +bcs.get_citation_dict("reversedict") + + +# ====================================================== +# Dispatching code +# ====================================================== + +def dispatch(func_name, *args, **kwargs): + """Dispatches the call to the remote worker""" + g = globals() + func_name_pre = '%s_local_pre' % func_name + func_name_post = '%s_local_post' % func_name + + if func_name_pre in g: + args = list(args) + g[func_name_pre](args, kwargs) + + handle = POOL.apply_async(_dispatch_remote, args=(func_name, args, kwargs)) + (worker_pid, result) = handle.get() + + if func_name_post in g: + result = g[func_name_post](result) + + return(worker_pid, result) + + + +def _dispatch_remote(func_name, args, kwargs): + """This receives the data on the remote side and calls + the actual function that does the job and returns results. + """ + + g = globals() + func_name_pre = '%s_remote_pre' % func_name + func_name_post = '%s_remote_post' % func_name + + if func_name_pre in g: + args = list(args) + g[func_name_pre](args, kwargs) + + (thread_id, result) = api_calls.dispatch(func_name, *args, **kwargs) + + if func_name_post in g: + result = g[func_name_post](result) + + return (os.getpid(), result) + + +def _dispatch(func_name, *args, **kwargs): + return api_calls.dispatch(func_name, *args, **kwargs) + + + + + + + + diff --git a/src/python/montysolr/inveniopie/targets.py b/src/python/montysolr/inveniopie/targets.py new file mode 100644 index 000000000..ee1d744d1 --- /dev/null +++ b/src/python/montysolr/inveniopie/targets.py @@ -0,0 +1,314 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +from cStringIO import StringIO +from invenio.intbitset import intbitset +from montysolr.initvm import montysolr_java as sj +from montysolr.utils import MontySolrTarget +import logging +import os +import montysolr.inveniopie.multiprocess_api_calls as api_calls + +import time + + + +def format_search_results(message): + req = message.getSolrQueryRequest() + rsp = message.getSolrQueryResponse() + recids = message.getParamArray_int("recids") + start = time.time() + message.threadInfo("start: citation_summary") + c_time = time.time() + iset = intbitset(recids) + message.threadInfo("int[] converted to intbitset in: %s, size=%s" % (time.time() - c_time, len(iset))) + (wid, (output)) = api_calls.dispatch('citation_summary', iset, 'hcs', 'en', '', '') + message.threadInfo("end: citation_summary pid=%s, finished in %s" % (wid, time.time() - start)) + rsp.add("inv_response", output) + +def format_search_results_local(message): + req = message.getSolrQueryRequest() + rsp = message.getSolrQueryResponse() + + recids = message.getParamArray_int("recids") + out = StringIO() + # TODO: pass the ln and other arguments + (wid, (output,)) = api_calls.dispatch("sumarize_records", intbitset(recids), 'hcs', 'en', '', '', out) + if not output: + out.seek(0) + output = out.read() + del out + rsp.add("inv_response", output) + + + +def perform_request_search_bitset(message): + query = unicode(message.getParam("query")).encode("utf8") + #offset, hit_dump, total_matches, searcher_id = searching.multiprocess_search(query, 0) + (wid, (offset, hits, total_matches)) = api_calls.dispatch('search', query, 0) + #message.threadInfo("query=%s, total_hits=%s" % (query, total_matches)) + message.setResults(sj.JArray_byte(intbitset(hits).fastdump())) + +def perform_request_search_ints(message): + query = unicode(message.getParam("query")).encode("utf8") + #offset, hit_list, total_matches, searcher_id = searching.multiprocess_search(query, 0) + (wid, (offset, hits, total_matches)) = api_calls.dispatch('search', query, 0) + if len(hits): + message.setResults(sj.JArray_int(hits)) + else: + message.setResults(sj.JArray_int([])) + + message.setParam("total", total_matches) + +def handle_request_body(message): + req = message.getSolrQueryRequest() + rsp = message.getSolrQueryResponse() + params = message.getParams() + + start = time.time() + q = params.get("q").encode('utf8') #TODO: sj.CommonParams.Q is overshadowed by solr.util.CommonParams or is not wrapped at all + + #offset, hit_list, total_matches, searcher_id = searching.multiprocess_search(str(q)) + (wid, (offset, hit_list, total_matches)) = api_calls.dispatch('search', str(q)) + + t = time.time() - start + #message.threadInfo("Query took: %s s. hits=%s and was executed by: %s" % (t, total_matches, searcher_id)) + + reader = req.getSearcher().getReader(); + + # translate invenio recids into lucene docids + transl_table = sj.DictionaryCache.INSTANCE.getTranslationCache(reader, "id") + res = [] + for h in hit_list: + if transl_table.containsKey(h): + res.append(transl_table.get(h)) + + #logging.error(transl_table.size()) + + ds = sj.DocSlice(offset,len(res),res, None, total_matches, 1.0) + rsp.add("response", ds) + +def get_recids_changes(message): + """Retrieves the recids of the last changed documents""" + last_recid = int(sj.Integer.cast_(message.getParam("last_recid")).intValue()) + max_records = 10000 + if message.getParam('max_records'): + mr = int(sj.Integer.cast_(message.getParam("max_records")).intValue()) + if mr < 100001: + max_records = mr + (wid, results) = api_calls.dispatch("get_recids_changes", last_recid, max_records) + if results: + out = sj.HashMap().of_(sj.String, sj.JArray_int) + for k,v in results.items(): + out.put(k, sj.JArray_int(v)) + message.setResults(out) + + + + +def get_citation_dict(message): + + dictname = str(message.getParam('dictname')) + hm = sj.HashMap.cast_(message.getParam('result')) + + # we will call the local module (not dispatched remotely) + cd = api_calls._dispatch("get_citation_dict", dictname) + message.threadInfo("%s: %s" % (dictname, str(len(cd)))) + if cd: + #hm = sj.HashMap().of_(sj.Integer, sj.JArray_int) + + message.threadInfo('creating hashmap') + for k,v in cd.items(): + j_array = sj.JArray_int(v) + hm.put(int(k), j_array) + message.threadInfo('finished') + +def workout_field_value(message): + sender = str(message.getSender()) + if sender in 'PythonTextField': + value = message.getParam('externalVal') + if not value: + return + value = str(value) + #print 'searching for', value + vals = {} + #ret = value.lower() + ' Hey! ' + ret = None + if value: + parts = value.split('|') + for p in parts: + k, v = p.split(':', 1) + if v[0] == '[' and v[-1] == ']': + v = v[1:-1] + vals[k] = v + if 'arxiv_id' in vals and 'src_dir' in vals: + #print vals + dirs = vals['src_dir'].split(',') + ax = vals['arxiv_id'].split(',')[0].strip() + if ax.find('/') > -1: + arx_parts = ax.split('/') #math-gt/060889 + fname = ''.join(arx_parts) + topdir = arx_parts[1][:4] + elif ax.find('.') > -1: + arx_parts = ax.replace('arXiv:', '').split('.', 1) #arXiv:0712.0712 + topdir = arx_parts[0] + fname = '.'.join(arx_parts) + else: + return ret + + if len(arx_parts) == 2: + + + for d in dirs: + #print (d, topdir, fname + '.txt') + newname = os.path.join(d, topdir, fname + '.txt') + if os.path.exists(newname): + fo = open('/tmp/solr-index.txt', 'a') + fo.write(newname + '\n') + fo.close() + ret = open(newname, 'r').read() + if ret: + message.setResults(ret.decode('utf8')) + else: + fo = open('/tmp/solr-not-found.txt', 'a') + fo.write('%s\t%s\n' % (newname, value)) + fo.close() + break + + + +def sort_and_format(message): + req = message.getSolrQueryRequest() + rsp = message.getSolrQueryResponse() + + recids = intbitset(message.getParamArray_int("recids")) + kwargs = sj.HashMap.cast_(message.getParam('kwargs')) + + kws = {} + + kset = kwargs.keySet().toArray() + vset = kwargs.values().toArray() + max_size = len(vset) + i = 0 + while i < max_size: + v = str(vset[i]) + if v[0:1] in ["'", '[', '{'] : + try: + v = eval(v) + except: + pass + kws[str(kset[i])] = v + i += 1 + + start = time.time() + message.threadInfo("start: citation_summary") + c_time = time.time() + + message.threadInfo("int[] converted to intbitset in: %s, size=%s" % (time.time() - c_time, len(recids))) + (wid, (output)) = api_calls._dispatch('sort_and_format', recids, kws) + + message.threadInfo("end: citation_summary pid=%s, finished in %s" % (wid, time.time() - start)) + + if isinstance(output, list): + message.setResults(sj.JArray_int(output)) + message.setParam("rtype", "int") + else: + message.setResults(output) + message.setParam("rtype", "string") + +def diagnostic_test(message): + out = [] + message.setParam("query", "boson") + perform_request_search_ints(message) + res = sj.JArray_int.cast_(message.getResults()) + out.append('Search for "boson" retrieved: %s hits' % len(res) ) + out.append('Total hits: %s' % sj.Integer.cast_(message.getParam("total"))) + message.setResults('\n'.join(out)) + +''' +def _get_solr(): + # HACK: this should be lazy loaded and in a separate module + from montysolr.python_bridge import JVMBridge + if not hasattr(sj, '__server') and not JVMBridge.hasObj("solr.server"): + initializer = sj.CoreContainer.Initializer() + conf = {'solr_home': '/x/dev/workspace/sandbox/montysolr/example/solr', + 'data_dir': '/x/dev/workspace/sandbox/montysolr/example/solr/data'} + + sj.System.setProperty('solr.solr.home', conf['solr_home']) + sj.System.setProperty('solr.data.dir', conf['data_dir']) + core_container = initializer.initialize() + server = sj.EmbeddedSolrServer(core_container, "") + JVMBridge.setObj("solr.server", server) + JVMBridge.setObj("solr.container", core_container) + sj.__server = server + return server + return sj.__server + return JVMBridge.getObj("solr.server") + + +def search_unit_solr(message): + """Called from search_engine""" + from montysolr.python_bridge import JVMBridge + + sj = JVMBridge.getObjMontySolr() + server = _get_solr() + q = str(message.getParam("query")) #String + + query = sj.SolrQuery() + query.setQuery(q) + query.setParam("fl", ("id",)) + query_response = server.query(query) + + head_part = query_response.getResponseHeader() + res_part = query_response.getResults() + qtime = query_response.getQTime() + etime = query_response.getElapsedTime() + nf = res_part.getNumFound() + + a_size = res_part.size() + res = sj.JArray_int(a_size) + res_part = res_part.toArray() + if a_size: + #it = res_part.iterator() + #i = 0 + #while it.hasNext(): + for i in xrange(a_size): + #x = it.next() + doc = sj.SolrDocument.cast_(res_part[i]) + # we must do this gymnastics because of the tests + s = str(doc.getFieldValue("id")) # 002800500 + if s[0] == '0': + s = s[3:] # 800500 + res[i] = int(s) + #i += 1 + + message.setParam("QTime", qtime) + message.setParam("ElapsedTime", etime) + message.setResults(res) +''' + +def montysolr_targets(): + targets = [ + MontySolrTarget('PythonTextField:workout_field_value', workout_field_value), + MontySolrTarget('handleRequestBody', handle_request_body), + MontySolrTarget('rca.python.solr.handler.InvenioHandler:handleRequestBody', handle_request_body), + MontySolrTarget('CitationQuery:get_citation_dict', get_citation_dict), + MontySolrTarget('InvenioQuery:perform_request_search_ints', perform_request_search_ints), + MontySolrTarget('InvenioQuery:perform_request_search_bitset', perform_request_search_bitset), + MontySolrTarget('InvenioFormatter:format_search_results', format_search_results), + MontySolrTarget('InvenioKeepRecidUpdated:get_recids_changes', get_recids_changes), + MontySolrTarget('InvenioFormatter:sort_and_format', sort_and_format), + MontySolrTarget('Invenio:diagnostic_test', diagnostic_test), + ] + + + # start multiprocessing with that many processes in the pool + if hasattr(api_calls, "start_multiprocessing"): + if os.getenv('MONTYSOLR_MAX_WORKERS'): + api_calls.start_multiprocessing(int(os.getenv('MONTYSOLR_MAX_WORKERS'))) + else: + api_calls.start_multiprocessing() + return targets diff --git a/src/python/montysolr/java_bridge.py b/src/python/montysolr/java_bridge.py new file mode 100644 index 000000000..3e11e08b9 --- /dev/null +++ b/src/python/montysolr/java_bridge.py @@ -0,0 +1,37 @@ + +from montysolr.initvm import montysolr_java as sj + + + +''' +Created on Jan 13, 2011 + +@author: rca +''' + +DEBUG = False + +class SimpleBridge(sj.MontySolrBridge): + + def __init__(self, handler=None): + if not handler: + import montysolr.sequential_handler as handler_module + handler = handler_module.Handler + super(SimpleBridge, self).__init__() + self._handler = handler + self._handler_module = handler.__module__ + + def receive_message(self, message): + if DEBUG: + # HACK: to remove this whole block + req = message.getSolrQueryRequest() + if req: + params = req.getParams() + if params.get("reload"): + message.threadInfo('Reloading python!', self._handler_module) + self._handler_module = reload(self._handler_module) + self._handler = self._handler_module.Handler + self._handler.handle_message(message) + + def set_handler(self, handler): + self._handler = handler diff --git a/src/python/montysolr/python_bridge.py b/src/python/montysolr/python_bridge.py new file mode 100644 index 000000000..4179471a0 --- /dev/null +++ b/src/python/montysolr/python_bridge.py @@ -0,0 +1,76 @@ + +''' +Created on Feb 7, 2011 + +@author: rca + +This class serves the same purpose as its java counterpart +MontySolrVM - but it is a singleton that contains a reference +to the handlers - we don't need to instantiate it everytime +as is the case when calling from Java. Therefore we can +put the VM and the bridge parts together. + +It intentionally has the java-style method names +''' + +from montysolr import initvm +import sys + +class JVMBridge(object): + + def __new__(cls, *args): + if hasattr(initvm.montysolr_java, '_JVMBridge_SINGLETON'): + return getattr(initvm.montysolr_java, '_JVMBridge_SINGLETON') + else: + instance = super(JVMBridge, cls).__new__(cls) + setattr(initvm.montysolr_java, '_JVMBridge_SINGLETON', instance) + instance._store = {} + return instance + + def __del__(self): + #if 'solr.container' in self._store: + # self._store['solr.container'].shutdown() + sys.stderr.write('!!!!!!! - Bridge deleted') + + def __init__(self, handler=None): + + if not handler: #FIXME: we must make the handler configurable outside the code + import montysolr.sequential_handler as handler + handler = handler.Handler + self._handler = handler + self._lucene = initvm.lucene + self._solr = initvm.solr_java + self._sj = initvm.montysolr_java + + + def sendMessage(self, message): + self._sj.getVMEnv().attachCurrentThread() + self._handler.handle_message(message) + + def setHandler(self, handler): + self._handler = handler + + def createMessage(self, receiver): + self._sj.getVMEnv().attachCurrentThread() + return self._sj.PythonMessage(receiver) + + def getObjMontySolr(self): + return self._sj + + def getObjLucene(self): + return self._lucene + + def getObjSolr(self): + return self._solr + + def setObj(self, name, value): + self._store[name] = value + + def getObj(self, name): + return self._store[name] + + def hasObj(self, name): + return name in self._store + +JVMBridge = JVMBridge() + \ No newline at end of file diff --git a/src/python/montysolr/sequential_handler.py b/src/python/montysolr/sequential_handler.py new file mode 100644 index 000000000..423cc3f04 --- /dev/null +++ b/src/python/montysolr/sequential_handler.py @@ -0,0 +1,19 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +from montysolr import handler + + +class Handler(handler.Handler): + '''Simple handler that just calls the methods + ''' + + def init(self): + self.discover_targets(['montysolr.inveniopie.targets', 'montysolr.examples.twitter_test']) + + + +Handler = Handler() diff --git a/src/python/montysolr/tests/__init__.py b/src/python/montysolr/tests/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/python/montysolr/tests/run_jetty_servlet.py b/src/python/montysolr/tests/run_jetty_servlet.py new file mode 100644 index 000000000..06fede44d --- /dev/null +++ b/src/python/montysolr/tests/run_jetty_servlet.py @@ -0,0 +1,33 @@ +''' +Created on Jan 10, 2011 + +@author: rca +''' + +import montysolr_java +import urllib2 + +def run(): + cp = '/x/dev/workspace/apache-solr-1.4.1/lib/commons-codec-1.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/commons-csv-1.0-SNAPSHOT-r609327.jar:/x/dev/workspace/apache-solr-1.4.1/lib/commons-fileupload-1.2.1.jar:/x/dev/workspace/apache-solr-1.4.1/lib/commons-httpclient-3.1.jar:/x/dev/workspace/apache-solr-1.4.1/lib/commons-io-1.4.jar:/x/dev/workspace/apache-solr-1.4.1/lib/easymock.jar:/x/dev/workspace/apache-solr-1.4.1/lib/geronimo-stax-api_1.0_spec-1.0.1.jar:/x/dev/workspace/apache-solr-1.4.1/lib/jcl-over-slf4j-1.5.5.jar:/x/dev/workspace/apache-solr-1.4.1/lib/junit-4.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-analyzers-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-core-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-highlighter-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-memory-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-misc-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-queries-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-snowball-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/lucene-spellchecker-2.9.3.jar:/x/dev/workspace/apache-solr-1.4.1/lib/servlet-api-2.4.jar:/x/dev/workspace/apache-solr-1.4.1/lib/slf4j-api-1.5.5.jar:/x/dev/workspace/apache-solr-1.4.1/lib/slf4j-jdk14-1.5.5.jar:/x/dev/workspace/apache-solr-1.4.1/lib/wstx-asl-3.2.7.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-cell-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-cell-docs-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-clustering-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-clustering-docs-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-core-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-core-docs-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-dataimporthandler-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-dataimporthandler-docs-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-dataimporthandler-extras-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-solrj-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-solrj-docs-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/apache-solr-velocity-docs-1.4.2-dev.jar:/x/dev/workspace/apache-solr-1.4.1/dist/solrj-lib/commons-codec-1.3.jar:/x/dev/workspace/apache-solr-1.4.1/dist/solrj-lib/commons-httpclient-3.1.jar:/x/dev/workspace/apache-solr-1.4.1/dist/solrj-lib/commons-io-1.4.jar:/x/dev/workspace/apache-solr-1.4.1/dist/solrj-lib/geronimo-stax-api_1.0_spec-1.0.1.jar:/x/dev/workspace/apache-solr-1.4.1/dist/solrj-lib/jcl-over-slf4j-1.5.5.jar:/x/dev/workspace/apache-solr-1.4.1/dist/solrj-lib/slf4j-api-1.5.5.jar:/x/dev/workspace/apache-solr-1.4.1/dist/solrj-lib/wstx-asl-3.2.7.jar:/x/dev/workspace/apache-solr-1.4.1/example/lib/jetty-6.1.3.jar:/x/dev/workspace/apache-solr-1.4.1/example/lib/jetty-util-6.1.3.jar:/x/dev/workspace/apache-solr-1.4.1/example/lib/jsp-2.1/ant-1.6.5.jar:/x/dev/workspace/apache-solr-1.4.1/example/lib/jsp-2.1/core-3.1.1.jar:/x/dev/workspace/apache-solr-1.4.1/example/lib/jsp-2.1/jsp-2.1.jar:/x/dev/workspace/apache-solr-1.4.1/example/lib/jsp-2.1/jsp-api-2.1.jar:/x/dev/workspace/apache-solr-1.4.1/example/lib/servlet-api-2.5-6.1.3.jar' + montysolr_java.initVM(montysolr_java.CLASSPATH + ':' + cp) + montysolr_java.System.setProperty('solr.solr.home', '/x/dev/workspace/test-solr/solr') + montysolr_java.System.setProperty('solr.data.dir', '/x/dev/workspace/test-solr/solr/data') + # montysolr_java.JettyRunner.main(()) + jetty = montysolr_java.JettyRunner() + jetty.start() + + + page = urllib2.urlopen('http://localhost:8983/test/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on&qt=recids').read() + print page + assert page.find('"numFound">6000') > -1 + start = page.index('name="docs">')+12 + docs = page[start:page.index("", start)].strip() + results = montysolr_java.ResultsCacheSingleton.getInstance().getResults(int(docs)) + + print 'this is printed by python but comes from java' + print results + + jetty.stop() + +if __name__ == '__main__': + run() \ No newline at end of file diff --git a/src/python/montysolr/tests/unittest_run_jetty.py b/src/python/montysolr/tests/unittest_run_jetty.py new file mode 100644 index 000000000..77613d639 --- /dev/null +++ b/src/python/montysolr/tests/unittest_run_jetty.py @@ -0,0 +1,28 @@ +''' +Created on Jan 10, 2011 + +@author: rca +''' +import unittest +import montysolr + + +class Test(unittest.TestCase): + + + def setUp(self): + sorlpie.initVM() + + + def tearDown(self): + pass + + + def test_jetty(self): + '''Tests if we are able to start jetty inside python and request results from the index''' + sorlpie.JettyRunner.main(('solr.home', '/x/dev/workspace/test-solr/solr')) + + +if __name__ == "__main__": + #import sys;sys.argv = ['', 'Test.testName'] + unittest.main() \ No newline at end of file diff --git a/src/python/montysolr/utils.py b/src/python/montysolr/utils.py new file mode 100644 index 000000000..51f6cfc25 --- /dev/null +++ b/src/python/montysolr/utils.py @@ -0,0 +1,14 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +class MontySolrTarget(object): + def __init__(self, message_id, callable): + self._message_id = message_id + self._target = callable + def getTarget(self): + return self._target + def getMessageId(self): + return self._message_id \ No newline at end of file diff --git a/src/python/utils/attach_fulltexts.py b/src/python/utils/attach_fulltexts.py new file mode 100644 index 000000000..7b6ee7a97 --- /dev/null +++ b/src/python/utils/attach_fulltexts.py @@ -0,0 +1,186 @@ +''' +Created on Feb 3, 2011 + +Search the folders starting at point X, finds all the files of a certain +pattern and copy them to a different folders. + +@author: rca +''' +import sys +import os +import shutil +import logging as log +import subprocess + + +from invenio import search_engine, bibdocfile, bibdocfilecli +log.root.setLevel(5) + +MSG_AFTER = 100 +BATCH_SIZE = 1000 + + +def run(ids_file, src_dir, mode='append', suffix=".pdf"): + """Traverse the source folder, search for files, when found, + copies them to other place - into the properly ordered file-system + """ + + assert os.path.exists(src_dir) is True + + + if mode not in ('append', 'replace'): + raise Exception('Unknown mode ' + mode) + + ids_map, has_filepath = get_prescription(ids_file) + total_counter = [0, 0, 0] + + if has_filepath: + process_afs_folders(total_counter, ids_map, src_dir, append=(mode == 'append')) + else: + process_harvests(total_counter, ids_map, src_dir, append=(mode == 'append')) + + + print 'uploaded: %s, skipped: %s, not-found: %s' % (total_counter[0], total_counter[1], total_counter[2]) + + +def process_harvests(total_counter, ids_map, src_dir, suffix='.pdf', append=True): + files = filter(lambda x: x not in ['.', '..'], os.listdir(src_dir)) + i = 0 + ffts = {} + for name, (recid, arxiv_id) in ids_map.items(): + fullname = name + suffix + if fullname in files: + fullpath = os.path.join(src_dir, fullname) + res = prepare_ffts(ffts, recid, arxiv_id, fullpath, append=append) + if res == False: + total_counter[1] += 1 + else: + total_counter[0] += 1 + else: + total_counter[2] += 1 + + i += 1 + if i % MSG_AFTER == 0: + print 'processed %s out of %s' % (i, len(ids_map)) + if len(ffts) % BATCH_SIZE == 0: + upload_file(ffts) + + if len(ffts): + upload_file(ffts) + + + +def process_afs_folders(total_counter, ids_map, src_dir, suffix='.pdf', append=True): + i = 0 + ffts = [] + for name, (recid, topdir, arxiv_id) in ids_map.items(): + fullpath = os.path.join(src_dir, topdir, name + suffix) + if not os.path.exists(fullpath): + log.error("The file %s not exists" % fullpath) + total_counter[2] += 1 + continue + + res = prepare_ffts(ffts, recid, arxiv_id, fullpath, append=append) + if res == False: + total_counter[1] += 1 + else: + total_counter[0] += 1 + + i += 1 + if i % MSG_AFTER == 0: + print 'processed %s out of %s' % (i, len(ids_map)) + + if len(ffts) % BATCH_SIZE == 0: + upload_file(ffts) + + if len(ffts): + upload_file(ffts) + + + +def get_prescription(ids_file): + ids_map = {} + fi = open(ids_file, 'r') + # read the first line (find it contains filepath) + elems = fi.readline().strip().split('\t') + has_filepath = False + if len(elems) > 2: + has_filepth = True + fi.seek(0) + + if has_filepath: + for line in fi: + line = line.strip() + if not line: + continue + recid, arxiv_id, path = line.split('\t') + name , topdir = split_arxivid(arxiv_id) + if name and topdir: + ids_map[name] = (recid, topdir, arxiv_id) + else: + for line in fi: + line = line.strip() + if not line: + continue + recid, arxiv_id = line.split() + ids_map[arxiv_id.replace('/', '_')] = (recid, arxiv_id) + return (ids_map, has_filepath) + + + +def prepare_ffts(ffts, recid, docname, fullpath, doctype='arXiv', append=False, format='.pdf', options=['HIDDEN']): + recid = int(recid) + docname = 'arXiv:%s' % docname.replace('/', '_') + bibdoc = bibdocfile.BibRecDocs(recid) + + res = subprocess.Popen(['file', fullpath], stdout=subprocess.PIPE).communicate()[0] + if not ('PDF' in res or 'pdf' in res.lower()): + return False + + # check it is an existing recod + if len(bibdoc.display()) and (bibdoc.has_docname_p(docname) and append is not False): + return False + + ffts[recid] = [{ + 'docname' : docname, + 'format' : format, + 'url' : fullpath, + 'doctype': doctype, + 'options': options, + }] + + +def upload_file(ffts, append=True): + try: + sys.argv.append('--yes-i-know') + out = bibdocfilecli.bibupload_ffts(ffts, append=append, debug=False) + finally: + sys.argv.pop(-1) + ffts.clear() + + +def split_arxivid(arxiv_id, err=True): + name = topdir = None + if arxiv_id.find('/') > -1: + arx_parts = arxiv_id.split('/') #math-gt/060889 + name = ''.join(arx_parts) + topdir = arx_parts[1][:4] + elif arxiv_id.find('.') > -1: + arx_parts = arxiv_id.split('.', 1) #0712.0712 + topdir = arx_parts[0] + name = ''.join(arx_parts) + else: + if err: + print 'error parsing:', arxiv_id + return name, topdir + + + +if __name__ == '__main__': + if len(sys.argv) == 1 or not os.path.exists(sys.argv[1]): + try: + sys.argv[1] = int(sys.argv[1]) + except: + exit('Usage: find_fulltexts.py ') + print sys.argv[1:] + run(*sys.argv[1:]) diff --git a/src/python/utils/compress_top_folders.py b/src/python/utils/compress_top_folders.py new file mode 100644 index 000000000..7b5fc8feb --- /dev/null +++ b/src/python/utils/compress_top_folders.py @@ -0,0 +1,28 @@ +import sys +import os + +COMPRESS_CMD = 'tar -czf "%s.tgz" "%s"' +REMOVE_CMD = 'rm -fR "%s"' + +def run(src_dir, delete=False): + old_dir = os.getcwd() + os.chdir(src_dir) + files = os.listdir(src_dir) + for f in files: + if os.path.isdir(f): + #fullname = os.path.abspath(os.path.join(src_dir, f)) + fullname = f + cmd = COMPRESS_CMD % (fullname, fullname) + os.system(cmd) + if delete: + cmd = REMOVE_CMD % fullname + os.system(cmd) + print f + os.chdir(old_dir) + +if __name__ == '__main__': + if len(sys.argv) < 1: + exit('usage: program ') + if len(sys.argv) == 2: + sys.argv.append(False) + run(*sys.argv[1:]) \ No newline at end of file diff --git a/src/python/utils/copy_top_folders.py b/src/python/utils/copy_top_folders.py new file mode 100644 index 000000000..28ee2ab4f --- /dev/null +++ b/src/python/utils/copy_top_folders.py @@ -0,0 +1,31 @@ +import sys +import os + +COPY_CMD = 'cp -fR %s %s' +REMOVE_CMD = 'rm -fR %s' + +def run(src_dir, tgt_dir, delete=False): + files = os.listdir(src_dir) + tgt = os.path.abspath(tgt_dir) + for f in files: + if f != '.' or f != '..': + fullname = os.path.abspath(os.path.join(src_dir, f)) + + if delete: + existing = os.path.abspath(os.path.join(src_dir, f)) + if os.path.exists(existing): + print 'remove ' + existing + cmd = REMOVE_CMD % existing + os.system(cmd) + + cmd = COPY_CMD % (fullname, tgt) + os.system(cmd) + + print f + +if __name__ == '__main__': + if len(sys.argv) < 1: + exit('usage: program ') + if len(sys.argv) == 3: + sys.argv.append(False) + run(*sys.argv[1:]) \ No newline at end of file diff --git a/src/python/utils/decompress_top_folders.py b/src/python/utils/decompress_top_folders.py new file mode 100644 index 000000000..a0a156e09 --- /dev/null +++ b/src/python/utils/decompress_top_folders.py @@ -0,0 +1,28 @@ +import sys +import os + +DECOMPRESS_CMD = 'tar -xf %s' +REMOVE_CMD = 'rm -fR %s' + +def run(src_dir, delete=False): + old_dir = os.getcwd() + os.chdir(src_dir) + files = os.listdir(src_dir) + for f in files: + if os.path.isfile(f): + #fullname = os.path.abspath(os.path.join(src_dir, f)) + fullname = f + cmd = DECOMPRESS_CMD % (fullname,) + status = os.system(cmd) + if status == 0 and delete: + cmd = REMOVE_CMD % fullname + os.system(cmd) + print f + os.chdir(old_dir) + +if __name__ == '__main__': + if len(sys.argv) < 2: + exit('usage: program ') + if len(sys.argv) == 2: + sys.argv.append(False) + run(*sys.argv[1:]) \ No newline at end of file diff --git a/src/python/utils/dump_dicts.py b/src/python/utils/dump_dicts.py new file mode 100644 index 000000000..ba71b43d4 --- /dev/null +++ b/src/python/utils/dump_dicts.py @@ -0,0 +1,27 @@ +import os +import cPickle +from invenio import bibrank_citation_searcher as bcs +from invenio import intbitset + +'''Utility to dump cached citation dictionary into a filesystem''' + +basedir = '/opt/rchyla/citdicts' + +cit_names = ['citationdict', + 'reversedict', 'selfcitdict', 'selfcitedbydict'] + +for dname in cit_names: + print 'loading: %s' % dname + cd = bcs.get_citation_dict(dname) # load the dictionary + f = os.path.join(basedir, dname) # dump it out + fo = open(f, 'wb') + print 'dumping of %s entries started' % len(cd) + if isinstance(cd, intbitset): + cPickle.dump(cd.fastdump(), fo) + else: + cPickle.dump(cd, fo) + fo.close() + print 'dumped %s into %s' % (dname, f) + + + diff --git a/src/python/utils/extract_queries.py b/src/python/utils/extract_queries.py new file mode 100644 index 000000000..e533b2d1f --- /dev/null +++ b/src/python/utils/extract_queries.py @@ -0,0 +1,194 @@ +import time +import os +import sys +import re + +_d = '/opt/invenio/lib/python' +if _d not in sys.path: + sys.path.insert(0, _d) + +from invenio import search_engine_query_parser + +invenio_qparser = search_engine_query_parser.SearchQueryParenthesisedParser() +invenio_qconverter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() + +def run(searchlog_file): + """This will extract some known query patterns form the + the search logs + """ + + # find the last recid, that we indexed + out_filepath = os.path.join(os.path.dirname(searchlog_file), 'searchlog-%s' % os.path.split(str(searchlog_file))[1]) + + fi = open(searchlog_file, 'r') + fo = open(out_filepath, 'w') + + i = 0 + for line in fi: + # 20110101004545#ss#find a trnka, jaroslav##HEP#16 + line = line.strip() + parts = line.split('#') + if len(parts) != 6: + continue + fdate, fform, fvalue, ffield, fcollection, fresults = parts + q = convert_query(fvalue, ffield) + if q: + fo.write('%s\n' % q) + + i+= 1 + if 1 % 1000 == 0: + print i + continue + fo.close() + + if 0: + val = None + if ffield: #the field was specified + if ffield == 'author': + val = format_author(fvalue) + elif ffield == 'exactauthor': + val = format_exactauthor(fvalue) + elif ffield == 'fulltext': + val = format_fulltext(fvalue) + elif ffield == 'journal': + val = format_journal(fvalue) + elif ffield == 'title': + val = format_title(fvalue) + elif ffield == 'keyword': + val = format_keyword(fvalue) + elif ffield == 'year': + val = format_year(fvalue) + + if val: + fo.write('%s\n' % val) + +_regexes = [] +def get_query_regexes(): + global _regexes + if _regexes: + return _regexes + _regexes.extend([ + (re.compile(r'001\s*\:'), 'recid:'), + (re.compile(r'980\s*\:'), 'status:'), + + (re.compile(r'100__u\:'), 'affiliation:'), + (re.compile(r'700__u\:'), 'affiliation:'), + (re.compile(r'902__a\:'), 'affiliation:'), + + (re.compile(r'100\s*\:'), 'author:'), + (re.compile(r'700\s*\:'), 'author:'), + + (re.compile(r'710\s*\:'), 'corporation:'), + + (re.compile(r'773__a\:'), 'doi:'), + (re.compile(r'773\s*\:'), 'publication:'), + + (re.compile(r'037*\:'), 'reportnumber:'), + (re.compile(r'245_*\s*\:'), 'title:'), + + (re.compile(r'035__z\:'), 'other_id:'), + #(re.compile(r':\s*\*'), ':'), #we don't allow asterisk at the start + ] + + ) + inspire_fields = { + 'eprint':'reportnumber', + 'bb':'reportnumber', + 'bbn':'reportnumber', + 'bull':'reportnumber', + 'r':'reportnumber', + 'rn':'reportnumber', + 'cn':'collaboration', + 'a':'author', + 'au':'author', + 'name':'author', + 'ea':'exactauthor', + 'exp':'experiment', + 'expno':'experiment', + 'sd':'experiment', + 'se':'experiment', + 'j':'publication', #was journal + 'kw':'keyword', + 'keywords':'keyword', + 'k':'keyword', + 'au':'author', + 'ti':'title', + 't':'title', + 'irn':'970__a', + 'institution':'affiliation', + 'inst':'affiliation', + 'affil':'affiliation', + 'aff':'affiliation', + 'af':'affiliation', + '902_*.*': 'affiliation', + '695__a':'topic', + 'tp':'695__a', + 'dk':'695__a', #'topic':'695__a','tp':'695__a','dk':'695__a', + 'date':'year', + 'd':'year', + 'date-added':'datecreated', + 'da':'datecreated', + 'dadd':'datecreated', + 'date-updated':'datemodified', + 'dupd':'datemodified', + 'du':'datemodified' + } + for k, v in inspire_fields.items(): + _regexes.append( + (re.compile('\W%s:' % k), '%s:' % v) + ) + return _regexes + + + +def convert_query(p, field=None): + # if the pattern uses SPIRES search syntax, convert it to Invenio syntax + if invenio_qconverter.is_applicable(p): + p = invenio_qconverter.convert_query(p) + p = p.strip() + + # do some basic transformations + _transregex = get_query_regexes() + + field = field.strip() + if field and p[0:len(field)] != field: + p = '%s:%s' % (field, p) + + for regex, replacement in _transregex: + p = regex.sub(replacement, p) + + return p + + +def format_author(s): + s = s.replace('find author ', '').replace('find a ').replace('f k ') + return 'author:(%s)' % s.strip() + +def format_exactauthor(s): + return 'author:"%s"' % format_author(s) + +def format_fulltext(s): + return 'text:%s' % s.strip() + +def format_journal(s): + s = s.replace('find journal ').replace('find f ').replace('f j ') + return 'publication:%s' % s + +def format_title(s): + s = s.replace('find title ').replace('find t ').replace('f t ') + return 'title:%s' % s + +def format_keyword(s): + s = s.replace('find keyword ').replace('find k ').replace('f k ') + return 'title:%s' % s + +def format_year(s): + if s.isalnum(): + return '' + return 'date:(%s)' % s.strip() + + +if __name__ == '__main__': + if len(sys.argv) == 1 or not os.path.exists(sys.argv[1]): + exit('Usage: extract_queries.py ') + run(sys.argv[1]) diff --git a/src/python/utils/find_fulltexts.py b/src/python/utils/find_fulltexts.py new file mode 100644 index 000000000..74c371d69 --- /dev/null +++ b/src/python/utils/find_fulltexts.py @@ -0,0 +1,159 @@ +''' +Created on Feb 3, 2011 + +Search the folders starting at point X, finds all the files of a certain +pattern and copy them to a different folders. + +@author: rca +''' +import sys +import os +import shutil + +import logging as log + +log.root.setLevel(5) + + +def run(idsfile, src_dir, tgt_dir, mode, extensions): + """Traverse the source folder, search for files, when found, + copies them to other place - into the properly ordered file-system + """ + + assert os.path.exists(src_dir) is True + assert os.path.exists(tgt_dir) is True + + extensions = extensions.split(',') + assert len(extensions) > 0 + + print 'we will search for these extensions:' + '|'.join(extensions) + + stack = {} + stack['found'] = open(os.path.join('/tmp', 'found.txt'), 'w') + stack['not-found'] = open(os.path.join('/tmp', 'not-found.txt'), 'w') + + if mode not in ('copy', '#copy', 'count'): + raise Exception('Unknown mode ' + mode) + + ids_map = {} + fi = open(idsfile, 'r') + for line in fi: + line = line.strip() + if not line: + continue + recid, arxiv_id = line.split('\t') + name , topdir = split_arxivid(arxiv_id) + if name and topdir: + ids_map[name] = (recid, topdir, line) + + total_counter = [0] + created_target_dirs = [] + + def copy_func(arg, dirname, fnames): + + log.info('inside: %s' % dirname) + to_copy = {} + for f in fnames: + fullpath = os.path.join(dirname, f) + basename, ext = os.path.splitext(f) + if basename and ext: + _e = ext[1:].lower() #remove the leading dot + if extensions and _e in extensions: + name, topdir = split_arxivid(basename, err=False) + found = False + if basename in ids_map: + name = basename + topdir = ids_map[basename][1] + found = True + elif name in ids_map: + name = name + topdir = ids_map[name][1] + found = True + + if found: + if mode[0] == '#': + continue + else: + if mode[0] == '#': + # we are looking for files not in the list + topdir = os.path.split(dirname)[1] + else: + continue + + target = os.path.join(tgt_dir, _e, topdir) + + if _e == 'txt' or _e == 'utf8': + to_copy[name] = (os.path.join(dirname, f), target, f) + elif _e == 'pdf': + if name not in to_copy: # txt files have preference + to_copy[name] = (os.path.join(dirname, f), target, f) + log.info('identified: %s candidates' % len(to_copy)) + + if mode == 'copy' or mode == '#copy': + for name, (source, target, filename) in to_copy.items(): + if target not in created_target_dirs and not os.path.isdir(target): + os.makedirs(target) + created_target_dirs.append(target) + try: + target_file = os.path.join(target, filename) + if not os.path.exists(target_file): + shutil.copy(source, target) + del ids_map[name] + total_counter[0] += 1 + except Exception, msg: + del to_copy[name] + if os.path.isdir(source): + pass + else: + print msg + if len(to_copy): + print 'copied: %s (in total so far: %s)' % (len(to_copy), total_counter[0]) + elif mode == 'count': + for name, (source, target, filename) in to_copy.items(): + record = ids_map.pop(name) + stack['found'].write('%s\t%s\n' % (record[2], source)) + + print '%s: %s' % (dirname, len(to_copy)) + + try: + os.path.walk(src_dir, copy_func, None) + except KeyboardInterrupt: + pass + + print 'found: %s, not-found: %s' % (total_counter[0], len(ids_map)) + fo = stack['not-found'] + for k, (recid, topdir, line) in ids_map.items(): + fo.write('%s\n' % line) + fo.close() + stack['found'].close() + + target_f = os.path.join(tgt_dir, 'found.txt') + target_nf = os.path.join(tgt_dir, 'not-found.txt') + + shutil.copyfile(stack['found'].name, target_f) + shutil.copyfile(stack['found'].name, target_nf) + + +def split_arxivid(arxiv_id, err=True): + name = topdir = None + if arxiv_id.find('/') > -1: + arx_parts = arxiv_id.split('/') #math-gt/060889 + name = ''.join(arx_parts) + topdir = arx_parts[1][:4] + elif arxiv_id.find('.') > -1: + arx_parts = arxiv_id.split('.', 1) #0712.0712 + topdir = arx_parts[0] + name = ''.join(arx_parts) + else: + if err: + print 'error parsing:', arxiv_id + return name, topdir + +if __name__ == '__main__': + if len(sys.argv) == 1 or not os.path.exists(sys.argv[1]): + try: + sys.argv[1] = int(sys.argv[1]) + except: + exit('Usage: find_fulltexts.py %s<' % recid) > -1: + + if newdir not in existing_dirs: + if not os.path.exists(newdir): + os.makedirs(newdir) + existing_dirs[newdir] = True + + fo = open(newfile, 'w') + fo.write(text) + fo.close() + recid_file.seek(0) + recid_file.write(recid) + recid_file.flush() + except: + print 'error getting: %s' % u + continue + i += 1 + + + + +if __name__ == '__main__': + if len(sys.argv) == 1 or not os.path.exists(sys.argv[1]): + try: + sys.argv[1] = int(sys.argv[1]) + except: + exit('Usage: run_index.py ') + + if not os.path.exists(str(sys.argv[1])): + try: + x = int(sys.argv[1]) + print 'Harvesting a range: 0-%s' % x + run(x) + except: + run(sys.argv[1]) + else: + run(sys.argv[1]) + + + diff --git a/src/python/utils/import_dicts.py b/src/python/utils/import_dicts.py new file mode 100644 index 000000000..aef0a7537 --- /dev/null +++ b/src/python/utils/import_dicts.py @@ -0,0 +1,54 @@ + +import sys, os +import cPickle +from invenio import bibrank_citation_indexer as bci +from invenio import intbitset +import time + +'''Utility to import pickled cached citation dictionary into a database. +WARNING! This will replace your database entries!!! ''' + +if len(sys.argv) > 1: + basedir = sys.argv[1] +else: + basedir = '/opt/rchyla/citdicts' + +if not os.path.exists(basedir) and not os.path.isdir(basedir): + raise Exception('%s is not a folder' % basedir) + +cit_names = ['citationdict', + 'reversedict', 'selfcitdict', 'selfcitedbydict'] + +def insert_into_cit_db(dic, name): + """an aux thing to avoid repeating code""" + ndate = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) + s = bci.serialize_via_marshal(dic) + #check that this column really exists + testres = bci.run_sql("select object_name from rnkCITATIONDATA where object_name = %s", + (name,)) + if testres: + bci.run_sql("UPDATE rnkCITATIONDATA SET object_value = %s where object_name = %s", + (s, name)) + else: + #there was no entry for name, let's force.. + bci.run_sql("INSERT INTO rnkCITATIONDATA(object_name,object_value) values (%s,%s)", + (name,s)) + bci.run_sql("UPDATE rnkCITATIONDATA SET last_updated = %s where object_name = %s", + (ndate,name)) + + +for dname in cit_names: + # load the dictionary + f = os.path.join(basedir, dname) + if os.path.exists(f): + print 'loading %s...' % dname + fi = open(f, 'rb') + cd = cPickle.load(fi) + if isinstance(cd, basestring): + cd = intbitset.intbitset().fastload(cd) + fi.close() + print 'loaded %s made of %s entries' % (dname, len(cd)) + print 'saving into db...' + insert_into_cit_db(cd, dname) + print 'saved!' + diff --git a/src/python/utils/run_index.py b/src/python/utils/run_index.py new file mode 100644 index 000000000..a87487431 --- /dev/null +++ b/src/python/utils/run_index.py @@ -0,0 +1,135 @@ +import urllib2 +import time +import os +import sys +import subprocess + +SOLRURL = 'http://localhost:8984/solr/waiting-dataimport?command=full-import&url=%(docaddr)s&%(extraparam)s' +DOCADDR = 'file://%(datadir)s/metadata/%(topdir)s/%(recid)s' +EXTRAPARAM = 'dirs=%(datadir)s/fulltexts/arXiv' +COMMIT = 1000000 + +BASKET_SIZE = 1000 +CHECK_IN_ADVANCE = 1000 +EXTRACT_CMD = 'tar -C %(datadir)s/metadata -xf %(datadir)s/metadata/%(topdir)s.tgz' +REMOVE_CMD = 'rm -fR %(datadir)s/metadata/%(topdir)s' + +def run(recids_file, datadir): + """This will call the indexer - passing recid on every call, + if the passed in argument is an integer, then we can work + without recids, just using the range (0, recids) + """ + commit_after = COMMIT + + # find the last recid, that we indexed + recid_filepath = os.path.join(datadir, 'last-recid-%s' % os.path.split(str(recids_file))[1]) + #os.remove(recid_filepath) + last_recid = 1 + if os.path.exists(recid_filepath): + recid_file = open(recid_filepath, 'r') + x = recid_file.read().strip() + recid_file.close() + if x: + last_recid = x + + + if isinstance(recids_file, int): + try: + recids = range(int(last_recid), recids_file) + except: + raise Exception('If you want to index a range, last-recid must be a number, not: "%s"' % last_recid) + else: + recids = [] + for r in open(recids_file, 'r'): + recids.append(r.strip()) + if str(last_recid) in recids: + _i = recids.index(str(last_recid)) + recids = recids[_i+1:] + + # we will write the last-id into this file + recid_file = open(recid_filepath, 'w') + + + start_time = time.time() + last_extracted_topdir = None + last_id = recids[-1] + + _for_removal = [] + i = 0 + _success = _failure = 0 + params = {'datadir': datadir} + params['extraparam'] = EXTRAPARAM % params + for recid in recids: + params['topdir'] = int(int(recid) / BASKET_SIZE) + params['recid'] = recid + params['docaddr'] = DOCADDR % params + + u = SOLRURL % params + if i % commit_after == 0 or recid == last_id: + u += '&commit=true' + print u + + # look at the files ahead and if necessary + # extract the archive (without waiting) + if EXTRACT_CMD: + if datadir and last_extracted_topdir is None: + args = EXTRACT_CMD % params + pid = subprocess.Popen(args.split()).pid + if REMOVE_CMD: + _for_removal.insert(0, REMOVE_CMD % params) + last_extracted_topdir = params['topdir'] + _n = i+CHECK_IN_ADVANCE + if _n < len(recids): + next_topdir = int(int(recids[_n]) / BASKET_SIZE) + if next_topdir != last_extracted_topdir: + old_topdir = params['topdir'] + params['topdir'] = next_topdir + args = EXTRACT_CMD % params + # run extraction + pid = subprocess.Popen(args.split()).pid + last_extracted_topdir = next_topdir + if REMOVE_CMD: + _for_removal.insert(0, REMOVE_CMD % params) + if len(_for_removal) > 2: + #remove the folder again + subprocess.Popen(_for_removal.pop().split()).pid + params['topdir'] = old_topdir + + while True: + text = urllib2.urlopen(u).read() + #print text + if text.find('>idle -1: + if text.find('Rolledback') > -1: + print 'not indexed: %s/%s' % (params['topdir'], params['recid']) + _failure += 1 + else: + _success += 1 + break + else: + print 'sleeping' + time.sleep(.1) + i += 1 + total_time = time.time() - start_time + avg_time = total_time / i + if i % 100 == 0: + print '%s.\t%s\t%s/%s\t%s\t%s' % (i, recid, _success, _failure, '%5.3f h.' % (total_time / 3600), avg_time) + + recid_file.seek(0) + recid_file.write(str(recid)) + recid_file.flush() + + print '%s.\t%s\t%s/%s\t%s\t%s' % (i, recid, _success, _failure, '%5.3f h.' % (total_time / 3600), avg_time) + recid_file.close() + for r in _for_removal: + subprocess.Popen(r.split()).pid + + + +if __name__ == '__main__': + if len(sys.argv) < 2 or not (os.path.exists(sys.argv[1]) and os.path.exists(sys.argv[2])): + try: + sys.argv[1] = int(sys.argv[1]) + except: + exit('Usage: run_index.py ') + + run(*sys.argv[1:]) diff --git a/test/java/invenio/montysolr/MontySolrTestCase.java b/test/java/invenio/montysolr/MontySolrTestCase.java new file mode 100644 index 000000000..f04727058 --- /dev/null +++ b/test/java/invenio/montysolr/MontySolrTestCase.java @@ -0,0 +1,96 @@ +package invenio.montysolr; + +import invenio.montysolr.jni.PythonBridge; +import invenio.montysolr.jni.MontySolrVM; + +import java.io.PrintStream; +import java.io.File; +import java.util.Arrays; +import java.util.Iterator; +import java.util.Random; + +import junit.framework.TestCase; + + + +/** + * Base class for all Lucene unit tests. + *

+ * Currently the + * only added functionality over JUnit's TestCase is + * asserting that no unhandled exceptions occurred in + * threads launched by ConcurrentMergeScheduler and asserting sane + * FieldCache usage athe moment of tearDown. + *

+ *

+ * If you + * override either setUp() or + * tearDown() in your unit test, make sure you + * call super.setUp() and + * super.tearDown() + *

+ * @see #assertSaneFieldCaches + */ +public abstract class MontySolrTestCase extends TestCase { + + protected MontySolrVM VM; + + public MontySolrTestCase() { + super(); + } + + public MontySolrTestCase(String name) { + super(name); + } + + protected void setUp() throws Exception { + MontySolrVM.INSTANCE.start("montysolr_java"); + this.VM = MontySolrVM.INSTANCE; + super.setUp(); + + } + + protected PythonBridge getBridge() { + return MontySolrVM.INSTANCE.getBridge(); + } + + protected String getTestLabel() { + return getClass().getName() + "." + getName(); + } + + protected void tearDown() throws Exception { + super.tearDown(); + } + + + /** + * Convinience method for logging an iterator. + * @param label String logged before/after the items in the iterator + * @param iter Each next() is toString()ed and logged on it's own line. If iter is null this is logged differnetly then an empty iterator. + * @param stream Stream to log messages to. + */ + public static void dumpIterator(String label, Iterator iter, + PrintStream stream) { + stream.println("*** BEGIN "+label+" ***"); + if (null == iter) { + stream.println(" ... NULL ..."); + } else { + while (iter.hasNext()) { + stream.println(iter.next().toString()); + } + } + stream.println("*** END "+label+" ***"); + } + + /** + * Convinience method for logging an array. Wraps the array in an iterator and delegates + * @see dumpIterator(String,Iterator,PrintStream) + */ + public static void dumpArray(String label, Object[] objs, + PrintStream stream) { + Iterator iter = (null == objs) ? null : Arrays.asList(objs).iterator(); + dumpIterator(label, iter, stream); + } + +} + diff --git a/test/java/org/apache/solr/search/TestInvenioQueryParser.java b/test/java/org/apache/solr/search/TestInvenioQueryParser.java new file mode 100644 index 000000000..e843fb366 --- /dev/null +++ b/test/java/org/apache/solr/search/TestInvenioQueryParser.java @@ -0,0 +1,312 @@ +package org.apache.solr.search; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import org.apache.lucene.queryParser.ParseException; +import org.apache.lucene.search.Query; +import org.apache.solr.common.params.ModifiableSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.core.SolrCore; +import org.apache.solr.util.AbstractSolrTestCase; + +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.schema.IndexSchema; +import org.apache.solr.search.InvenioQParserPlugin; + +public class TestInvenioQueryParser extends AbstractSolrTestCase { + + public String getSchemaFile() { return "schema.xml"; } + public String getSolrConfigFile() { return "solrconfig.xml"; } + public String getCoreName() { return "basic"; } + + + public void setUp() throws Exception { + // if you override setUp or tearDown, you better call + // the super classes version + super.setUp(); + } + public void tearDown() throws Exception { + // if you override setUp or tearDown, you better call + // the super classes version + super.tearDown(); + } + + + public void testQueryTypes() { + assertU(adoc("id","1", "v_t","Hello Dude")); + assertU(adoc("id","2", "v_t","Hello Yonik")); + assertU(adoc("id","3", "v_s","{!literal}")); + assertU(adoc("id","4", "v_s","other stuff")); + assertU(adoc("id","5", "v_f","3.14159")); + assertU(adoc("id","6", "v_f","8983")); + assertU(adoc("id","7", "v_f","1.5")); + assertU(adoc("id","8", "v_ti","5")); + assertU(commit()); + + Object[] arr = new Object[] { + "id",999.0 + ,"v_s","wow dude" + ,"v_t","wow" + ,"v_ti",-1 + ,"v_tis",-1 + ,"v_tl",-1234567891234567890L + ,"v_tls",-1234567891234567890L + ,"v_tf",-2.0f + ,"v_tfs",-2.0f + ,"v_td",-2.0 + ,"v_tds",-2.0 + ,"v_tdt","2000-05-10T01:01:01Z" + ,"v_tdts","2002-08-26T01:01:01Z" + }; + + + + SolrCore core = h.getCore(); + SolrQueryRequest req = lrf.makeRequest("q", "*:*"); + SolrParams params = req.getParams(); + ModifiableSolrParams localParams = new ModifiableSolrParams(); + localParams.add("iq.mode", "maxinv"); + localParams.add("iq.xfields", "fulltext,abstract"); + localParams.add("iq.syntax", "lucene"); + + ModifiableSolrParams localParams_2 = new ModifiableSolrParams(); + localParams_2.add("iq.mode", "maxinv"); + localParams_2.add("iq.xfields", "fulltext,abstract"); + localParams_2.add("iq.syntax", "invenio"); + + assertTrue("core is null and it shouldn't be", core != null); + QParserPlugin parserPlugin = core.getQueryPlugin(InvenioQParserPlugin.NAME); + + + String[] queries = { + + "author:hawking and affiliation:\"cambridge u., damtp\" and year:2004->9999", + " + +9999>", + + // test cases + "hey |muon", + " ", + "hey |\"muon muon\"", + " <\"muon muon\">", + "\"and or not AND OR NOT\" and phrase", + "<\"and or not and or not\"> +", + + + // http://inspirebeta.net/help/search-tips + + // find a hawking and aff "cambridge u., damtp" and date > 2004 + "author:hawking and affiliation:\"cambridge u., damtp\" and year:2004->9999", + " + +9999>", + "thomas crewther quark 2002", + " <2002>", + //find j phys.rev.lett.,62,1825 + "journal:phys.rev.lett.,62,1825", + "", + //find j "Phys.Rev.Lett.,105*" or j Phys.Lett. and a thomas + "journal:\"Phys.Rev.Lett.,105*\" or journal:Phys.Lett. and author:thomas", + " +", + // find d 1997-11-18 + "year:1997-11-18", + "", + // find da 2011-01-26 and title neutrino* + "datecreated:2011-01-26 and title:neutrino*", + " +", + //find eprint arxiv:0711.2908 or arxiv:0705.4298 or eprint hep-ph/0504227 + "reportnumber:arxiv:0711.2908 or arxiv:0705.4298 or reportnumber:hep-ph/0504227", + "", + //find a unruh or t cauchy not t problem and primarch gr-qc + "author:unruh or title:cauchy not title:problem and 037__c:gr-qc", + " - +<037__c|gr-qc>", + //find a m albrow and j phys.rev.lett. and t quark* cited:200->99999 + "(author:\"albrow, m*\") and journal:phys.rev.lett. and (title:quark* and title:cited:200->99999)", + "", + //find c Phys.Rev.Lett.,28,1421 or c arXiv:0711.4556 + "reference:Phys.Rev.Lett.,28,1421 or reference:arXiv:0711.4556", + "", + //find c "Phys.Rev.Lett.,*" + "reference:\"Phys.Rev.Lett.,*\"", + "", + //citedby:hep-th/9711200 author:cvetic + "citedby:hep-th/9711200 author:cvetic", + " ", + "author:parke citedby:author:witten", + "", + "refersto:hep-th/9711200 title:nucl*", + " ", + "author:witten refersto:author:\"parke, s j\"", + "", + "refersto:author:parke or refersto:author:lykken author:witten", + "", + "affiliation:\"oxford u.\" refersto:title:muon*", + "", + // find af "harvard u." + "affiliation:\"harvard u.\"", + "", + + // http://inspirebeta.net/help/search-guide + + "\"Ellis, J\"", + "<\"ellis, j\">", + "'muon decay'", + "<'muon decay'>", + "'Ellis, J'", + "<'ellis, j'>", + "ellis +muon", + " +", + "ellis muon", + " ", + "ellis and muon", + " +", + "ellis -muon", + " -", + "ellis not muon", + " -", + "ellis |muon", + " ", + "ellis or muon", + " ", + "muon or kaon and ellis", + " +", + "ellis and muon or kaon", + " + ", + "muon or kaon and ellis -decay", + " + -", + "(gravity OR supergravity) AND (ellis OR perelstein)", + "( ) +( )", + "C++", + "", + "O'Shea", + "", + "$e^{+}e^{-}$", + "<$e^{+}e^{-}$>", + "hep-ph/0204133", + "", + "BlaCK hOlEs", + " ", + "пушкин", + "<пушкин>", + "muon*", + "", + "CERN-TH*31", + "", + "a*", + "", + "\"Neutrino mass*\"", + "<\"neutrino mass*\">", + "author:ellis", + "", + "author:ellis title:muon*", + " ", + "experiment:NA60 year:2001", + " ", + "title:/^E.*s$/", + "", + "author:/^Ellis, (J|John)$/", + "", + "title:/dense ([^ l]* )?matter/", + "", //TODO: remove the quotation marks + "collection:PREPRINT -year:/^[0-9]{4}([\\?\\-]|\\-[0-9]{4})?$/", + " -", + "collection:PREPRINT -year:/^[[:digit:]]{4}([\\?\\-]|\\-[[:digit:]]{4})?$/", + " -", + "muon decay year:1983->1992", + " 1992>", + "author:\"Ellis, J\"->\"Ellis, Qqq\"", + "\"ellis, qqq\">", + "refersto:reportnumber:hep-th/0201100", + "", + "citedby:author:klebanov", + "", + "refersto:author:\"Klebanov, I\"", + "", + "refersto:keyword:gravitino", + "", + "author:klebanov AND citedby:author:papadimitriou NOT refersto:author:papadimitriou", + "", + "refersto:/author:\"Klebanov, I\" title:O(N)/", + "", + "author:ellis -muon* +abstract:'dense quark matter' year:200*", + " - +abstract:\"dense quark matter\"~2 ", + "author:ellis -muon* +title:'dense quark matter' year:200*", + " - + ", + "higgs or reference:higgs or fulltext:higgs", + " fulltext:higgs", + "author:lin fulltext:Schwarzschild fulltext:AdS reference:\"Adv. Theor. Math. Phys.\"", + " fulltext:schwarzschild fulltext:ads ", + "author:/^Ellis, (J|John)$/", + "", + "fulltext:e-", + "fulltext:e-", + "muon or fulltext:muon and author:ellis", + " fulltext:muon +", + "reference:hep-ph/0103062", + "", + "reference:giddings reference:ross reference:\"Phys. Rev., D\" reference:61 reference:2000", + " ", + "standard model -author:ellis reference:ellis", + " - ", + + }; + + Query q; + Query q2; + QParser qp; + QParser qp2; + int success = 0; + for (int i=0; i" % (sys.argv[0],)) + run(sys.argv[1]) \ No newline at end of file diff --git a/test/python/test_examples_twitter.py b/test/python/test_examples_twitter.py new file mode 100644 index 000000000..993621e7a --- /dev/null +++ b/test/python/test_examples_twitter.py @@ -0,0 +1,47 @@ + +import unittest +from montysolr_testcase import MontySolrTestCase, sj + +import os +import time +import sys + + +class Test(MontySolrTestCase): + + def setUp(self): + self.setSolrHome(os.path.join(self.getBaseDir(), 'examples/twitter/solr')) + self.setDataDir(os.path.join(self.getBaseDir(), 'examples/twitter/solr/data')) + self.setHandler(self.loadHandler('montysolr.examples.twitter_test')) + MontySolrTestCase.setUp(self) + + + def test_twitter(self): + '''Index docs fetched by twitter api''' + + hm = sj.HashMap().of_(sj.String, sj.String) + hm.put('action', 'search') + hm.put('term', 'Feb17') + params = sj.MapSolrParams(hm) + + req = sj.LocalSolrQueryRequest(self.core, params) + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('twitter_api') \ + .setSender('TwitterAPIHandler') \ + .setSolrQueryResponse(rsp) \ + .setSolrQueryRequest(req) + + self.bridge.receive_message(message) + + res = sj.JArray_int.cast_(message.getResults()) + res = list(res) + assert len(res) == size + assert res[0] == 0 + assert res[5] == 5 + + + +if __name__ == "__main__": + #import sys;sys.argv = ['', 'Test.test_get_recids_changes4'] + unittest.main() diff --git a/test/python/test_invenio_queries.py b/test/python/test_invenio_queries.py new file mode 100644 index 000000000..f9f5cba9e --- /dev/null +++ b/test/python/test_invenio_queries.py @@ -0,0 +1,55 @@ + +import sys +import os +import solr + +from invenio import search_engine + +s = solr.SolrConnection('http://localhost:8983/solr') +s.select = solr.SearchHandler(s, '/invenio') + + +def run(query_file): + + fi = open(query_file, 'r') + queries = filter(len, map(lambda x: x.strip(), fi.readlines())) + fi.close() + + success = failure = error = 0 + for q in queries: + print '---' + print q + inv_res = len(search_engine.perform_request_search(None, p=q)) + msg = 'NO' + inv_query = '\t\t' + try: + (solr_res, inv_query) = ask_solr(q) + except Exception, e: + solr_res = None + #print e + msg = 'ER' + error += 1 + failure -= 1 + + print inv_query + if inv_res == solr_res: + success += 1 + msg = 'OK' + else: + failure += 1 + + + print "%s invenio=%s montysolr=%s" % (msg, inv_res, solr_res) + + print 'total=%s, success/mismatch/error=%s/%s/%s' % (len(queries), success, failure, error) + +def ask_solr(q): + response = s.query(q, fields=['id']) + num_found = response.numFound + inv_query = response.inv_query + return (num_found, inv_query) + +if __name__ == '__main__': + if len(sys.argv) < 2: + exit('Usage: ') + run(*sys.argv[1:]) diff --git a/test/python/testing_targets.py b/test/python/testing_targets.py new file mode 100644 index 000000000..34f045954 --- /dev/null +++ b/test/python/testing_targets.py @@ -0,0 +1,51 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +from montysolr.utils import MontySolrTarget +import os +from montysolr import initvm + +sj = initvm.montysolr_java + + +from invenio import bibrank_citation_searcher as bcs + + +def handle_request_body(message): + rsp = message.getSolrQueryResponse() + rsp.add("python", 'says hello!') + +def receive_field_value(message): + val = message.getParam('value') + val = sj.JArray_string.cast_(val) + val.append('z') + +def get_citation_dict(message): + dictname = sj.String.cast_(message.getParam('dictname')) + cd = bcs.get_citation_dict(dictname) + if cd: + hm = sj.HashMap().of_(sj.String, sj.JArray_int) + + for k,v in cd.items(): + j_array = sj.JArray_int(v) + hm.put(k, j_array) + + message.put('result', hm) + + + + + + +def montysolr_targets(): + targets = [ + MontySolrTarget('receive_field_value', receive_field_value), + MontySolrTarget('handleRequestBody', handle_request_body), + MontySolrTarget('CitationQuery:get_citation_dict', get_citation_dict), + ] + + return targets + \ No newline at end of file diff --git a/test/python/tmp_run_solr.py b/test/python/tmp_run_solr.py new file mode 100644 index 000000000..bba2c4ec9 --- /dev/null +++ b/test/python/tmp_run_solr.py @@ -0,0 +1,40 @@ + +def run(): + import unittest + import unittest_solr + #fo = open('/tmp/solr-test', 'w') + #suite = unittest.TestLoader().loadTestsFromTestCase(unittest_solr.Test) + #unittest.TextTestRunner(verbosity=2).run(suite) + #fo.write('OK!') + #fo.close() + + sj = unittest_solr.sj + + initializer = sj.CoreContainer.Initializer() + conf = {'solr_home': '/x/dev/workspace/sandbox/montysolr/example/solr', + 'data_dir': '/x/dev/workspace/sandbox/montysolr/example/solr/data-test'} + + sj.System.setProperty('solr.solr.home', conf['solr_home']) + sj.System.setProperty('solr.data.dir', conf['data_dir']) + core_container = initializer.initialize() + server = sj.EmbeddedSolrServer(core_container, "") + + solr_config = sj.SolrConfig() + index_schema = sj.IndexSchema(solr_config, None, None) + q = sj.QueryParsing.parseQuery('*:*', index_schema) + + # create a query + query = sj.SolrQuery() + query.setQuery('*:*') + + query_response = server.query(query) + + head_part = query_response.getResponseHeader() + res_part = query_response.getResults() + qtime = query_response.getQTime() + etime = query_response.getElapsedTime() + + print qtime, etime, head_part, res_part + +if __name__ == '__main__': + run() \ No newline at end of file diff --git a/test/python/unittest_bridge.py b/test/python/unittest_bridge.py new file mode 100644 index 000000000..4707ab5bd --- /dev/null +++ b/test/python/unittest_bridge.py @@ -0,0 +1,44 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' +import unittest +from montysolr import initvm, java_bridge +from montysolr import handler +import sys +import os + +sj = java_bridge.sj + +class TestHandler(handler.Handler): + def init(self): + self.discover_targets([os.path.join(os.path.basedir(__file__), 'testing_targets.py')]) + +class Test(unittest.TestCase): + + + def setUp(self): + self.bridge = java_bridge.SimpleBridge() + + def tearDown(self): + pass + + def test_basic(self): + b = self.bridge + + + assert b.testReturnString().find('java is printing') > -1 + assert b.getName() is None # the bridge has name only when started from java + + message = sj.PythonMessage('receive_field_value').setParam('value', sj.JArray_string(['x','z'])) + b.receive_message(message) + ret = message.getParam('result') + if ret: + r = list(ret) + assert r == ['x', 'z'] + + +if __name__ == "__main__": + #import sys;sys.argv = ['', 'Test.testName'] + unittest.main() \ No newline at end of file diff --git a/test/python/unittest_examples_bigtest.py b/test/python/unittest_examples_bigtest.py new file mode 100644 index 000000000..e832f6029 --- /dev/null +++ b/test/python/unittest_examples_bigtest.py @@ -0,0 +1,214 @@ +''' +Created on May 11, 2011 + +@author: rca + +To run this unittest, you will need a lot of memory (if size is big) +You can do this: + +export MONTYSOLR_JVMARGS_PYTHON='-Xmx800m -d32' +python unittest_example_bigtest.py Test.test_bigtest01 +''' +#@PydevCodeAnalysisIgnore + +import unittest +from montysolr import initvm, java_bridge, handler +import os +import time +import sys + +sj = initvm.montysolr_java + +class TestHandler(handler.Handler): + def init(self): + #_b = os.path.join(os.path.dirname(__file__), 'testing_targets.py') + #self.discover_targets([_b]) + self.discover_targets(['montysolr.examples.bigtest']) +test_handler = TestHandler() + +class Test(unittest.TestCase): + + + def setUp(self): + self.size = 5000000 + sj.System.setProperty('solr.solr.home', os.path.join(os.path.abspath(os.path.dirname(initvm.__file__) + '../../..'), 'examples/twitter/solr')) + self.bridge = java_bridge.SimpleBridge(test_handler) + self.core = sj.SolrCore.getSolrCore() + + def tearDown(self): + #self.core.close() + pass + + + + def test_bigtest01(self): + '''Get int[]''' + + #req = sj.QueryRequest() + size = self.size + hm = sj.HashMap().of_(sj.String, sj.String) + hm.put('action', 'recids_int') + hm.put('size', str(size)) + params = sj.MapSolrParams(hm) + req = sj.LocalSolrQueryRequest(self.core, params) + + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('bigtest') \ + .setSolrQueryResponse(rsp) \ + .setSolrQueryRequest(req) + + self.bridge.receive_message(message) + + res = sj.JArray_int.cast_(message.getResults()) + res = list(res) + assert len(res) == size + assert res[0] == 0 + assert res[5] == 5 + + def test_bigtest02(self): + '''Get String[]''' + + #req = sj.QueryRequest() + size = self.size + hm = sj.HashMap().of_(sj.String, sj.String) + hm.put('action', 'recids_str') + hm.put('size', str(size)) + params = sj.MapSolrParams(hm) + req = sj.LocalSolrQueryRequest(self.core, params) + + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('bigtest') \ + .setSolrQueryResponse(rsp) \ + .setSolrQueryRequest(req) + + self.bridge.receive_message(message) + + res = sj.JArray_string.cast_(message.getResults()) + assert len(res) == size + assert res[0] == '0' + assert res[5] == '5' + + + def test_bigtest03(self): + '''Get recids_hm_strstr''' + + #req = sj.QueryRequest() + size = self.size + hm = sj.HashMap().of_(sj.String, sj.String) + hm.put('action', 'recids_hm_strstr') + hm.put('size', str(size)) + params = sj.MapSolrParams(hm) + req = sj.LocalSolrQueryRequest(self.core, params) + + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('bigtest') \ + .setSolrQueryResponse(rsp) \ + .setSolrQueryRequest(req) + + self.bridge.receive_message(message) + + res = sj.HashMap.cast_(message.getResults()) + assert res.size() == size + assert str(sj.String.cast_(res.get('0'))) == '0' + assert str(sj.String.cast_(res.get('5'))) == '5' + + + def test_bigtest04(self): + '''Get recids_hm_strint''' + + #req = sj.QueryRequest() + size = self.size + hm = sj.HashMap().of_(sj.String, sj.String) + hm.put('action', 'recids_hm_strint') + hm.put('size', str(size)) + params = sj.MapSolrParams(hm) + req = sj.LocalSolrQueryRequest(self.core, params) + + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('bigtest') \ + .setSolrQueryResponse(rsp) \ + .setSolrQueryRequest(req) + + self.bridge.receive_message(message) + + res = sj.HashMap.cast_(message.getResults()) + assert res.size() == size + assert sj.Integer.cast_(res.get('0')).equals(0) + assert sj.Integer.cast_(res.get('5')).equals(5) + + + def test_bigtest05(self): + '''Get recids_hm_intint''' + + #req = sj.QueryRequest() + size = self.size + hm = sj.HashMap().of_(sj.String, sj.String) + hm.put('action', 'recids_hm_intint') + hm.put('size', str(size)) + params = sj.MapSolrParams(hm) + req = sj.LocalSolrQueryRequest(self.core, params) + + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('bigtest') \ + .setSolrQueryResponse(rsp) \ + .setSolrQueryRequest(req) + + self.bridge.receive_message(message) + + res = sj.HashMap.cast_(message.getResults()) + assert res.size() == size + assert sj.Integer.cast_(res.get(0)).equals(0) + assert sj.Integer.cast_(res.get(5)).equals(5) + + + def test_bigtest06(self): + '''Get recids_bitset - needs invenio.intbitset''' + + from invenio import intbitset + + #req = sj.QueryRequest() + size = self.size + hm = sj.HashMap().of_(sj.String, sj.String) + hm.put('action', 'recids_bitset') + hm.put('size', str(size)) + filled = int(size * 0.3) + hm.put('filled', str(filled)) + params = sj.MapSolrParams(hm) + req = sj.LocalSolrQueryRequest(self.core, params) + + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('bigtest') \ + .setSolrQueryResponse(rsp) \ + .setSolrQueryRequest(req) + + self.bridge.receive_message(message) + res = sj.JArray_byte.cast_(message.getResults()) + ibs = intbitset.intbitset() + ibs = ibs.fastload(res.string_) + + assert len(ibs) > 0 + + def timeit(self): + for x in range(1, 7): + testid = '%02d' % x + times = 10 + test_name = 'test_bigtest%s' % testid + if hasattr(self, test_name): + print test_name, + start = time.time() + test_method = getattr(self, test_name) + for x in xrange(times): + test_method() + end = time.time() - start + print end / times, 's.' + + +if __name__ == "__main__": + #import sys;sys.argv = ['', 'Test.test_get_recids_changes4'] + unittest.main() diff --git a/test/python/unittest_invenio.py b/test/python/unittest_invenio.py new file mode 100644 index 000000000..d67b8ee2e --- /dev/null +++ b/test/python/unittest_invenio.py @@ -0,0 +1,232 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' +#@PydevCodeAnalysisIgnore + +import unittest +from montysolr import initvm, java_bridge, handler +import os + +sj = initvm.montysolr_java + +class TestHandler(handler.Handler): + def init(self): + #_b = os.path.join(os.path.dirname(__file__), 'testing_targets.py') + #self.discover_targets([_b]) + self.discover_targets(['montysolr.inveniopie.targets']) +test_handler = TestHandler() + +class Test(unittest.TestCase): + + + def setUp(self): + self.bridge = java_bridge.SimpleBridge(test_handler) + + def tearDown(self): + pass + + + def test_dict_cache(self): + + message = sj.PythonMessage('get_citation_dict') \ + .setSender('CitationQuery') \ + .setParam('dictname', 'citationdict') + self.bridge.receive_message(message) + + result = message.getParam('result') + print 'got result' + + + def test_workout_field_value(self): + + u = 'id:840017|arxiv_id:arXiv:0912.2620|src_dir:/Users/rca/work/indexing/fulltexts/arXiv' + message = sj.PythonMessage('workout_field_value') \ + .setSender('PythonTextField') \ + .setParam('externalVal', u) + self.bridge.receive_message(message) + + result = unicode(message.getParam('result')) + print len(result) + + def test_handle_request_body(self): + + req = sj.QueryRequest() + srp = sj.SolrQueryResponse() + + u = 'id:840017|arxiv_id:arXiv:0912.2620|src_dir:/Users/rca/work/indexing/fulltexts/arXiv' + message = sj.PythonMessage('handleRequestBody') \ + .setSender('rca.python.solr.handler.InvenioHandler') \ + .setParam('externalVal', u) + self.bridge.receive_message(message) + + result = unicode(message.getParam('result')) + print len(result) + + def test_format_search_results(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('format_search_results') \ + .setSender('InvenioFormatter') \ + .setSolrQueryResponse(rsp) \ + .setParam('recids', sj.JArray_int(range(0, 93))) + self.bridge.receive_message(message) + + result = unicode(rsp.getValues()) + assert 'inv_response' in result + assert '

' in result + + def test_get_recids_changes(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('get_recids_changes') \ + .setSender('InvenioKeepRecidUpdated') \ + .setSolrQueryResponse(rsp) \ + .setParam('last_recid', 30) + self.bridge.receive_message(message) + + results = message.getResults() + out = sj.HashMap.cast_(results) + + added = sj.JArray_int.cast_(out.get('ADDED')) + assert len(added) > 1 + + def test_get_recids_changes2(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('get_recids_changes') \ + .setSender('InvenioKeepRecidUpdated') \ + .setSolrQueryResponse(rsp) \ + .setParam('last_recid', 0) #test we can deal with extreme cases + self.bridge.receive_message(message) + + results = message.getResults() + out = sj.HashMap.cast_(results) + + added = sj.JArray_int.cast_(out.get('ADDED')) + assert len(added) > 1 + + def test_get_recids_changes3(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('get_recids_changes') \ + .setSender('InvenioKeepRecidUpdated') \ + .setSolrQueryResponse(rsp) \ + .setParam('last_recid', 9999999) + self.bridge.receive_message(message) + + results = message.getResults() + assert results is None + + + def test_get_recids_changes4(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('get_recids_changes') \ + .setSender('InvenioKeepRecidUpdated') \ + .setSolrQueryResponse(rsp) \ + .setParam('last_recid', -1) + self.bridge.receive_message(message) + + results = message.getResults() + out = sj.HashMap.cast_(results) + + added = sj.JArray_int.cast_(out.get('ADDED')) + updated = sj.JArray_int.cast_(out.get('CHANGED')) + deleted = sj.JArray_int.cast_(out.get('DELETED')) + + + assert len(added) == 104 + assert len(updated) == 0 + assert len(deleted) == 0 + + + def test_perform_request_search_ints(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('perform_request_search_ints') \ + .setSender('InvenioQuery') \ + .setSolrQueryResponse(rsp) \ + .setParam('query', 'ellis') + self.bridge.receive_message(message) + + results = message.getResults() + out = sj.JArray_int.cast_(results) + + assert len(out) > 1 + + def test_sort_and_format(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + kwargs = sj.HashMap() + kwargs.put("of", "hcs") + #kwargs.put("sf", "year") #sort by year + kwargs.put('colls_to_search', """['Articles & Preprints', 'Multimedia & Arts', 'Books & Reports']""") + + message = sj.PythonMessage('sort_and_format') \ + .setSender('InvenioFormatter') \ + .setSolrQueryResponse(rsp) \ + .setParam('recids', sj.JArray_int(range(0, 93))) \ + .setParam("kwargs", kwargs) + + self.bridge.receive_message(message) + + result = unicode(message.getResults()) + assert '

' in result + + def test_sort_and_format2(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + kwargs = sj.HashMap() + #kwargs.put("of", "hcs") + kwargs.put("sf", "year") #sort by year + kwargs.put('colls_to_search', """['Articles & Preprints', 'Multimedia & Arts', 'Books & Reports']""") + + message = sj.PythonMessage('sort_and_format') \ + .setSender('InvenioFormatter') \ + .setSolrQueryResponse(rsp) \ + .setParam('recids', sj.JArray_int(range(0, 93))) \ + .setParam("kwargs", kwargs) + + self.bridge.receive_message(message) + + result = sj.JArray_int.cast_(message.getResults()) + assert len(result) > 3 + assert result[0] == 77 + + def test_diagnostic_test(self): + + req = sj.QueryRequest() + rsp = sj.SolrQueryResponse() + + message = sj.PythonMessage('diagnostic_test') \ + .setSolrQueryResponse(rsp) \ + .setParam('recids', sj.JArray_int(range(0, 93))) + + self.bridge.receive_message(message) + + res = message.getResults() + print res + + + +if __name__ == "__main__": + #import sys;sys.argv = ['', 'Test.test_get_recids_changes4'] + unittest.main() diff --git a/test/python/unittest_python_bridge.py b/test/python/unittest_python_bridge.py new file mode 100644 index 000000000..e74861701 --- /dev/null +++ b/test/python/unittest_python_bridge.py @@ -0,0 +1,70 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' +import unittest +from montysolr import handler +from montysolr.python_bridge import JVMBridge +from montysolr.utils import MontySolrTarget +import sys +import os + +sj = JVMBridge.getObjMontySolr() + +class TestingMethods(): + def montysolr_targets(self): + + def test_a(message): + data = sj.JArray_int.cast_(message.getParam("data")) + data = data * 2 + message.setParam("result", sj.JArray_int(data)) + + def test_b(message): + data = sj.JArray_int.cast_(message.getParam("data")) + data = str(data) + message.setParam("result", data) + + return [ + MontySolrTarget(':test_a', test_a), + MontySolrTarget(':test_b', test_b), + ] + +class TestHandler(handler.Handler): + def init(self): + self.discover_targets([TestingMethods()]) + +class Test(unittest.TestCase): + + def setUp(self): + self._handler = JVMBridge._handler + JVMBridge.setHandler(TestHandler()) + + def tearDown(self): + JVMBridge.setHandler(self._handler) + + def test_basic(self): + + sj = JVMBridge.getObjMontySolr() + message = JVMBridge.createMessage("test_a") \ + .setParam('data', sj.JArray_int([0,1,2])) + + JVMBridge.sendMessage(message) + res = list(sj.JArray_int.cast_(message.getParam("result"))) + assert res == [0, 1, 2, 0, 1, 2] + + #lets reuse the message object + message.setReceiver("test_b") + JVMBridge.sendMessage(message) + res = str(message.getParam("result")) + assert res.find("[0, 1, 2]") > -1 + + message = JVMBridge.createMessage("test_b") \ + .setParam('data', sj.JArray_int([0,1,2])) + JVMBridge.sendMessage(message) + res = str(message.getParam("result")) + assert res.find("[0, 1, 2]") > -1 + +if __name__ == "__main__": + #import sys;sys.argv = ['', 'Test.testName'] + unittest.main() \ No newline at end of file diff --git a/test/python/unittest_solr.py b/test/python/unittest_solr.py new file mode 100644 index 000000000..a40a27ff7 --- /dev/null +++ b/test/python/unittest_solr.py @@ -0,0 +1,62 @@ +''' +Created on Feb 4, 2011 + +@author: rca +''' + +import os +# "-Djava.util.logging.config.file=/x/dev/workspace/sandbox/montysolr/example/etc/test.logging.properties" +os.environ['MONTYSOLR_JVMARGS_PYTHON'] = "" + +import unittest +from montysolr import initvm + + +sj = initvm.montysolr_java +solr = initvm.solr_java +lu = initvm.lucene + + +class Test(unittest.TestCase): + + + def setUp(self): + + self.initializer = sj.CoreContainer.Initializer() + self.conf = {'solr_home': '/x/dev/workspace/sandbox/montysolr/example/solr', + 'data_dir': '/x/dev/workspace/sandbox/montysolr/example/solr/data-jtest'} + + sj.System.setProperty('solr.solr.home', self.conf['solr_home']) + sj.System.setProperty('solr.data.dir', self.conf['data_dir']) + self.core_container = self.initializer.initialize() + self.server = sj.EmbeddedSolrServer(self.core_container, "") + + solr_config = sj.SolrConfig() + index_schema = sj.IndexSchema(solr_config, None, None) + q = sj.QueryParsing.parseQuery('*:*', index_schema) + + def tearDown(self): + self.core_container.shutdown() + + + def test_solr_all(self): + + server = self.server + + # create a query + query = sj.SolrQuery() + query.setQuery('*:*') + + query_response = server.query(query) + + head_part = query_response.getResponseHeader() + res_part = query_response.getResults() + qtime = query_response.getQTime() + etime = query_response.getElapsedTime() + + print qtime, etime, head_part, res_part + + +if __name__ == "__main__": + #import sys;sys.argv = ['', 'Test.testName'] + unittest.main() \ No newline at end of file diff --git a/test/test-files/README b/test/test-files/README new file mode 100644 index 000000000..10f878acc --- /dev/null +++ b/test/test-files/README @@ -0,0 +1,21 @@ + + +This directory is where any non-transient, non-java files needed +for the execution of tests should live. + +It is used as the CWD when running JUnit tests. diff --git a/test/test-files/invenio-test-queries.result b/test/test-files/invenio-test-queries.result new file mode 100644 index 000000000..1dbb9c954 --- /dev/null +++ b/test/test-files/invenio-test-queries.result @@ -0,0 +1,290 @@ +reportnumber:arxiv:0711.2908 or arxiv:0705.4298 or reportnumber:hep-ph/0504227 + +OK invenio=3 montysolr=3 +--- +reference:Phys.Rev.Lett.,28,1421 or reference:arXiv:0711.4556 + +OK invenio=457 montysolr=457 +--- +author:hawking and affiliation:"cambridge u., damtp" and year:2004->9999 ++ + +9999> +OK invenio=10 montysolr=10 +--- +hey |muon + +OK invenio=29520 montysolr=29520 +--- +hey |"muon muon" + <"muon muon"> +OK invenio=599 montysolr=599 +--- +"and or not AND OR NOT" and phrase ++<"and or not and or not"> + +OK invenio=0 montysolr=0 +--- +author:hawking and affiliation:"cambridge u., damtp" and year:2004->9999 ++ + +9999> +OK invenio=10 montysolr=10 +--- +thomas crewther quark 2002 ++ + + +<2002> +OK invenio=2 montysolr=2 +--- +journal:phys.rev.lett.,62,1825 + +OK invenio=1 montysolr=1 +--- +journal:"Phys.Rev.Lett.,105*" or journal:Phys.Lett. and author:thomas + + + +NO invenio=1134 montysolr=1114 +--- +year:1997-11-18 + +OK invenio=0 montysolr=0 +--- +datecreated:2011-01-26 and title:neutrino* ++ + +OK invenio=0 montysolr=0 +--- +author:unruh or title:cauchy not title:problem and 037__c:gr-qc + - +<037__c|gr-qc> +NO invenio=79 montysolr=23839 +--- +(author:"albrow, m*") and journal:phys.rev.lett. and (title:quark* and title:cited:200->99999) ++ + +(+ +99999>) +OK invenio=2 montysolr=2 +--- +reference:"Phys.Rev.Lett.,*" ++ + +NO invenio=280780 montysolr=0 +--- +citedby:hep-th/9711200 author:cvetic ++ + +OK invenio=2 montysolr=2 +--- +author:parke citedby:author:witten ++ + +OK invenio=4 montysolr=4 +--- +refersto:hep-th/9711200 title:nucl* ++ + +OK invenio=28 montysolr=28 +--- +author:witten refersto:author:"parke, s j" ++ + +OK invenio=6 montysolr=6 +--- +refersto:author:parke or refersto:author:lykken author:witten + + +NO invenio=11 montysolr=434 +--- +affiliation:"oxford u." refersto:title:muon* ++ + +OK invenio=1030 montysolr=1030 +--- +affiliation:"harvard u." + +OK invenio=7270 montysolr=7270 +--- +"Ellis, J" +<"ellis, j"> +OK invenio=0 montysolr=0 +--- +'muon decay' +<'muon decay'> +OK invenio=563 montysolr=563 +--- +'Ellis, J' +<'ellis, j'> +OK invenio=936 montysolr=936 +--- +ellis +muon ++ + +OK invenio=217 montysolr=217 +--- +ellis muon ++ + +OK invenio=217 montysolr=217 +--- +ellis and muon ++ + +OK invenio=217 montysolr=217 +--- +ellis -muon ++ - +OK invenio=2263 montysolr=2263 +--- +ellis not muon ++ - +OK invenio=2263 montysolr=2263 +--- +ellis |muon + +OK invenio=31216 montysolr=31216 +--- +ellis or muon + +OK invenio=31216 montysolr=31216 +--- +muon or kaon and ellis + + + +NO invenio=240 montysolr=30 +--- +ellis and muon or kaon ++ +NO invenio=6191 montysolr=2480 +--- +muon or kaon and ellis -decay + + + - +NO invenio=240 montysolr=30 +--- +(gravity OR supergravity) AND (ellis OR perelstein) ++( ) +( ) +OK invenio=201 montysolr=201 +--- +C++ + +OK invenio=226 montysolr=226 +--- +O'Shea + +OK invenio=550 montysolr=550 +--- +$e^{+}e^{-}$ +<$e^{+}e^{-}$> +OK invenio=124 montysolr=124 +--- +hep-ph/0204133 + +OK invenio=1 montysolr=1 +--- +BlaCK hOlEs ++ + +OK invenio=26165 montysolr=26165 +--- +пушкин +<пушкин> +OK invenio=0 montysolr=0 +--- +muon* + +OK invenio=29811 montysolr=29811 +--- +CERN-TH*31 + +OK invenio=62 montysolr=62 +--- +a* + +OK invenio=510300 montysolr=510300 +--- +"Neutrino mass*" +<"neutrino mass*"> +OK invenio=1217 montysolr=1217 +--- +author:ellis + +OK invenio=2234 montysolr=2234 +--- +author:ellis title:muon* ++ + +OK invenio=44 montysolr=44 +--- +experiment:NA60 year:2001 ++ + +OK invenio=0 montysolr=0 +--- +title:/^E.*s$/ + +OK invenio=16410 montysolr=16410 +--- +author:/^Ellis, (J|John)$/ + +NO invenio=1018 montysolr=0 +--- +title:/dense ([^ l]* )?matter/ + +OK invenio=0 montysolr=0 +--- +collection:PREPRINT -year:/^[0-9]{4}([\?\-]|\-[0-9]{4})?$/ ++ - +OK invenio=0 montysolr=0 +--- +collection:PREPRINT -year:/^[[:digit:]]{4}([\?\-]|\-[[:digit:]]{4})?$/ ++ - +OK invenio=0 montysolr=0 +--- +muon decay year:1983->1992 ++ + +1992> +OK invenio=0 montysolr=0 +--- +author:"Ellis, J"->"Ellis, Qqq" ++ -<>> +<"ellis, qqq"> +NO invenio=1437 montysolr=0 +--- +refersto:reportnumber:hep-th/0201100 + +OK invenio=34 montysolr=34 +--- +citedby:author:klebanov + +OK invenio=2022 montysolr=2022 +--- +refersto:author:"Klebanov, I" + +OK invenio=9831 montysolr=9831 +--- +refersto:keyword:gravitino + +OK invenio=17014 montysolr=17014 +--- +author:klebanov AND citedby:author:papadimitriou NOT refersto:author:papadimitriou ++ + - +OK invenio=10 montysolr=10 +--- +refersto:/author:"Klebanov, I" title:O(N)/ ++ + + + +NO invenio=119 montysolr=0 +--- +author:ellis -muon* +abstract:'dense quark matter' year:200* ++ - + + +OK invenio=2 montysolr=2 +--- +author:ellis -muon* +title:'dense quark matter' year:200* ++ - + + +OK invenio=1 montysolr=1 +--- +higgs or reference:higgs or fulltext:higgs + fulltext:higgs +NO invenio=37625 montysolr=61090 +--- +author:lin fulltext:Schwarzschild fulltext:AdS reference:"Adv. Theor. Math. Phys." ++ +fulltext:schwarzschild +fulltext:ads + +<"adv. theor. math. phys."> +OK invenio=0 montysolr=0 +--- +author:/^Ellis, (J|John)$/ + +NO invenio=1018 montysolr=0 +--- +fulltext:e- +fulltext:e- +NO invenio=854 montysolr=1187 +--- +muon or fulltext:muon and author:ellis + +fulltext:muon + +NO invenio=267 montysolr=0 +--- +reference:hep-ph/0103062 + +OK invenio=341 montysolr=341 +--- +reference:giddings reference:ross reference:"Phys. Rev., D" reference:61 reference:2000 ++ + + +<"phys. rev., d"> + + +OK invenio=0 montysolr=0 +--- +standard model -author:ellis reference:ellis ++ + - + +OK invenio=0 montysolr=0 + +total=72, success/mismatch/error=58/14/0 + diff --git a/test/test-files/invenio-test-queries.txt b/test/test-files/invenio-test-queries.txt new file mode 100644 index 000000000..e9442a1e1 --- /dev/null +++ b/test/test-files/invenio-test-queries.txt @@ -0,0 +1,73 @@ +reportnumber:arxiv:0711.2908 or arxiv:0705.4298 or reportnumber:hep-ph/0504227 +reference:Phys.Rev.Lett.,28,1421 or reference:arXiv:0711.4556 +author:hawking and affiliation:"cambridge u., damtp" and year:2004->9999 +hey |muon +hey |"muon muon" +"and or not AND OR NOT" and phrase +author:hawking and affiliation:"cambridge u., damtp" and year:2004->9999 +thomas crewther quark 2002 +journal:phys.rev.lett.,62,1825 +journal:"Phys.Rev.Lett.,105*" or journal:Phys.Lett. and author:thomas +year:1997-11-18 +datecreated:2011-01-26 and title:neutrino* +author:unruh or title:cauchy not title:problem and 037__c:gr-qc +(author:"albrow, m*") and journal:phys.rev.lett. and (title:quark* and title:cited:200->99999) +reference:"Phys.Rev.Lett.,*" +citedby:hep-th/9711200 author:cvetic +author:parke citedby:author:witten +refersto:hep-th/9711200 title:nucl* +author:witten refersto:author:"parke, s j" +refersto:author:parke or refersto:author:lykken author:witten +affiliation:"oxford u." refersto:title:muon* +affiliation:"harvard u." +"Ellis, J" +'muon decay' +'Ellis, J' +ellis +muon +ellis muon +ellis and muon +ellis -muon +ellis not muon +ellis |muon +ellis or muon +muon or kaon and ellis +ellis and muon or kaon +muon or kaon and ellis -decay +(gravity OR supergravity) AND (ellis OR perelstein) +C++ +O'Shea +$e^{+}e^{-}$ +hep-ph/0204133 +BlaCK hOlEs +пушкин +muon* +CERN-TH*31 +a* +"Neutrino mass*" +author:ellis +author:ellis title:muon* +experiment:NA60 year:2001 +title:/^E.*s$/ +author:/^Ellis, (J|John)$/ +title:/dense ([^ l]* )?matter/ +collection:PREPRINT -year:/^[0-9]{4}([\?\-]|\-[0-9]{4})?$/ +collection:PREPRINT -year:/^[[:digit:]]{4}([\?\-]|\-[[:digit:]]{4})?$/ +muon decay year:1983->1992 +author:"Ellis, J"->"Ellis, Qqq" +refersto:reportnumber:hep-th/0201100 +citedby:author:klebanov +refersto:author:"Klebanov, I" +refersto:keyword:gravitino +author:klebanov AND citedby:author:papadimitriou NOT refersto:author:papadimitriou +refersto:/author:"Klebanov, I" title:O(N)/ +author:ellis -muon* +abstract:'dense quark matter' year:200* +author:ellis -muon* +title:'dense quark matter' year:200* +higgs or reference:higgs or fulltext:higgs +author:lin fulltext:Schwarzschild fulltext:AdS reference:"Adv. Theor. Math. Phys." +author:/^Ellis, (J|John)$/ +fulltext:e- +muon or fulltext:muon and author:ellis +reference:hep-ph/0103062 +reference:giddings reference:ross reference:"Phys. Rev., D" reference:61 reference:2000 +standard model -author:ellis reference:ellis + diff --git a/test/test-files/solr/conf/data-config-test-java.xml b/test/test-files/solr/conf/data-config-test-java.xml new file mode 100644 index 000000000..3dec8773e --- /dev/null +++ b/test/test-files/solr/conf/data-config-test-java.xml @@ -0,0 +1,53 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/test/test-files/solr/conf/data-config.xml b/test/test-files/solr/conf/data-config.xml new file mode 100644 index 000000000..3a0080d5a --- /dev/null +++ b/test/test-files/solr/conf/data-config.xml @@ -0,0 +1,91 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/test/test-files/solr/conf/elevate.xml b/test/test-files/solr/conf/elevate.xml new file mode 100755 index 000000000..9b4caec69 --- /dev/null +++ b/test/test-files/solr/conf/elevate.xml @@ -0,0 +1,36 @@ + + + + + + + + + + + + + + + + + + diff --git a/test/test-files/solr/conf/protwords.txt b/test/test-files/solr/conf/protwords.txt new file mode 100755 index 000000000..1dfc0abec --- /dev/null +++ b/test/test-files/solr/conf/protwords.txt @@ -0,0 +1,21 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +# Use a protected word file to protect against the stemmer reducing two +# unrelated words to the same base word. + +# Some non-words that normally won't be encountered, +# just to test that they won't be stemmed. +dontstems +zwhacky + diff --git a/test/test-files/solr/conf/schema.xml b/test/test-files/solr/conf/schema.xml new file mode 100755 index 000000000..3b63345be --- /dev/null +++ b/test/test-files/solr/conf/schema.xml @@ -0,0 +1,666 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + id + + + all + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/test/test-files/solr/conf/solrconfig.xml b/test/test-files/solr/conf/solrconfig.xml new file mode 100755 index 000000000..c9e597df1 --- /dev/null +++ b/test/test-files/solr/conf/solrconfig.xml @@ -0,0 +1,1115 @@ + + + + + + ${solr.abortOnConfigurationError:true} + + + + + + + + + + + + + + + + ${solr.data.dir:./solr/data} + + + + + + false + + 10 + + + + + 32 + + 10000 + 1000 + 10000 + + + + + + + + + + + + + native + + + + + + + false + 32 + 10 + + + + + + + + false + + + true + + + + + + + + 1 + + 0 + + + + + false + + + + + + + + + + + + + + + + + + + + + + + + + + + 1024 + + + + + + + + + + + + + + + + true + + + + + + + + 20 + + + 200 + + + + + + + + + + + + + solr rocks010 + static firstSearcher warming query from solrconfig.xml + {!iq}inv_refersto:"recid:100"010invenio + {!iq}inv_citedby:"recid:100"010invenio + + + + + false + + + 2 + + + + + + + + + + + + + + + + + + + + + + + explicit + false + + + + + + + + explicit + 10000 + + + + + + query + invenio-formatter + facet + mlt + highlight + stats + debug + + + + explicit + + + + + + + + + + + + + dismax + explicit + 0.01 + + text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 + + + text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 + + + popularity^0.5 recip(price,1,1000,1000)^0.3 + + + id,name,price,score + + + 2<-1 5<-2 6<90% + + 100 + *:* + + text features name + + 0 + + name + regex + + + + + + + dismax + explicit + text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 + 2<-1 5<-2 6<90% + + incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2 + + + + inStock:true + + + + cat + manu_exact + price:[* TO 500] + price:[500 TO *] + + + + + + + + + + textSpell + + + default + name + ./spellchecker + + + + + + + + + + + + + + + + false + + false + + 1 + + + spellcheck + + + + + + + + true + + + tvComponent + + + + + + + + + default + + org.carrot2.clustering.lingo.LingoClusteringAlgorithm + + 20 + + + stc + org.carrot2.clustering.stc.STCClusteringAlgorithm + + + + + true + default + true + + name + id + + features + + true + + + + false + + + clusteringComponent + + + + + + + + text + true + ignored_ + + + true + links + ignored_ + + + + + + + + + + true + + + termsComponent + + + + + + + + string + elevate.xml + + + + + + explicit + + + elevator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + standard + solrpingquery + all + + + + + + + explicit + true + + + + + + + + + 100 + + + + + + + + 70 + + 0.5 + + [-\w ,/\n\"']{20,200} + + + + + + + ]]> + ]]> + + + + + + + + + + + + + + 5 + + + + + + + + + + + + + solr + + + + + + + + + data-config.xml + false + false + + + + + data-config.xml + false + false + + + + + data-config-test-java.xml + false + false + + + + + + last_modified + ignored_ + + + + + diff --git a/test/test-files/solr/conf/spellings.txt b/test/test-files/solr/conf/spellings.txt new file mode 100755 index 000000000..d7ede6f56 --- /dev/null +++ b/test/test-files/solr/conf/spellings.txt @@ -0,0 +1,2 @@ +pizza +history \ No newline at end of file diff --git a/test/test-files/solr/conf/stopwords.txt b/test/test-files/solr/conf/stopwords.txt new file mode 100755 index 000000000..b5824da32 --- /dev/null +++ b/test/test-files/solr/conf/stopwords.txt @@ -0,0 +1,58 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +# a couple of test stopwords to test that the words are really being +# configured from this file: +stopworda +stopwordb + +#Standard english stop words taken from Lucene's StopAnalyzer +a +an +and +are +as +at +be +but +by +for +if +in +into +is +it +no +not +of +on +or +s +such +t +that +the +their +then +there +these +they +this +to +was +will +with + diff --git a/test/test-files/solr/conf/synonyms.txt b/test/test-files/solr/conf/synonyms.txt new file mode 100755 index 000000000..b0e31cb7e --- /dev/null +++ b/test/test-files/solr/conf/synonyms.txt @@ -0,0 +1,31 @@ +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#----------------------------------------------------------------------- +#some test synonym mappings unlikely to appear in real input text +aaa => aaaa +bbb => bbbb1 bbbb2 +ccc => cccc1,cccc2 +a\=>a => b\=>b +a\,a => b\,b +fooaaa,baraaa,bazaaa + +# Some synonym groups specific to this example +GB,gib,gigabyte,gigabytes +MB,mib,megabyte,megabytes +Television, Televisions, TV, TVs +#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming +#after us won't split it into two words. + +# Synonym mappings can be used for spelling correction too +pixima => pixma +