yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	3897bb4409	added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index) - migrates all entries in old urldb Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), - added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format) - removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format) current log excerpt for NumberFormatException: W 2013/01/14 00:10:07 StackTrace For input string: "-" java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152) ... Caused by: java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152)	12 years ago
reger	3b6e08b49f	prevent checking of urldb if empty - disconnect urlIndexFile if empty - add missing lock class in submenuSearchConfiguration	12 years ago
reger	f143804382	fix configuration for search page navigators - added additional config page (ConfigSearchPage_p) for easy setup of search page layout (to not overload ConfigPortal page) - currently redundant setting with part of ConfigPortal page - added missing config for filetype and protocol navigator - adjusted init of SearchEvent to check navigation config setting - renamed RankigProcess.getTopicNavigator to getTopics (to distiguish between added SearchEvent.getTopicNavigator)	12 years ago
Michael Peter Christen	becd52a984	added also a re-calculation of reference counts during the post-processing of clickcount calculations. This is a really nice thing to have because the reference count affects ranking.	12 years ago
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	12 years ago
Michael Peter Christen	6f0baaa309	added the clickdepth post-processing: some links may have 'shortcuts' to already calculated click depths. There are then calculated if the crawl buffer is empty and therefore no new 'shortcuts' can be discovered. The status of the clickdepth stack (to-be-processed) can be seen using a solr search command like this: http://localhost:8090/solr/select?q=process_sxt:[%20TO%20]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt	12 years ago
Michael Peter Christen	0f5b6f38c1	enhanced root-url detection	12 years ago
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	12 years ago
Michael Peter Christen	6861af87e2	removed warnings	12 years ago
Michael Peter Christen	295884fd54	- Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8' - fixed conflict in htroot/yacysearch.java - removed nedres check because that causes that the remote server is not called at all in most cases (local index has already results but we want more) - fixed a regex bug (a '=' too much)	12 years ago
reger	276e63401e	small sanitary fixes - exclude unix shell scripts in NSIS windows install archive - replace link to env/grafics/yacy.gif to yacy.png (build.nsi) - remove unused code lines (Blacklist_p, Response, WordReferenceVars) - type & xhtml (RankingSolr_p.html)	12 years ago
reger	f301336adf	fix: no results with configuration citation reference index switched off - urlcitationindex != null check added to ResultEntry.referencesCount - plus other places where conflicting procedure was used (and urlcitationindex not already checked != null)	12 years ago
orbiter	fe50702eb0	added a filterscannerfail attribute to QueryParams which causes that a check to the network scanner fail/success status can be used/suppressed for search results. This is a feature that comes with the port scanner.	12 years ago
reger	168b1d130d	Adding heuristic to get search results from configured systems which support opensearch specification - any system supporting opensearch specification can be configured - search query is only forwarded to remote system if not enough results available on local peer - discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config - sample config file with some general search engines with opensearch support	12 years ago
Michael Peter Christen	eb90d38cd7	added missing extension 'mkv' for navigation	12 years ago
Michael Peter Christen	95712fdc8b	update to pdf parser	12 years ago
Michael Peter Christen	4a9182ae16	use the search configuration to default the cacheStrategy to the value as given in the search configuration	12 years ago
Michael Peter Christen	98819ec3d9	use solr boost configuration to select search fields. At this time it is possible to enter a negative boost value to switch that value off. This might be different in the future with a better input interface.	12 years ago
Michael Peter Christen	e1f89efd0d	- made image search in interactive search using the ViewImage servlet - that enables viewing of images for intranet SMB servers. - added a filter search for protocol, tld and ext again; otherwise p2p search produces a lot of rubbish	12 years ago
Michael Peter Christen	8f3bd0c387	fix for smb crawl situation (lost too many urls)	12 years ago
reger	d456f69381	SeedUpload url : check to reject localhost url included in saveSeedList (same check as in / copied from Seed.isProper() ), to prevent identity change on next startup (due to rejected seeduploadurl).	12 years ago
reger	4987caf1c9	- apply fix for localhost handling (from yacy2solr) also to metadata2solr	12 years ago
reger	0148f1bb8c	fix: exception if default work files don't exist	12 years ago
Michael Peter Christen	9e4033f229	fix for event starter: delete start time when event is removed	12 years ago
Michael Peter Christen	99271ffd13	copy work tables from defaults/data/work if exist there and not in DATA/WORK This can be used to create start-up behavior work scripts in the api.bheap table	12 years ago
Michael Peter Christen	24c9bb35f7	extended the Scheduler: introduced scheduled events - an event type (once, regular) can be selected - for this event type, a fixed time can be selected. This may be either directly after startup or at one of the full hours at a day (==25 options) The main point about this feature is the opportunity to start an action directly after startup. That makes it possible to create YaCy distributions which, after started at the first time, start to index parts of the intranet/internet by itself.	12 years ago
Michael Peter Christen	433143ba40	removed protocol, tld, ext from the urlmask and created specific navigation field for these	12 years ago
Michael Peter Christen	84f82541e8	search process enhancements	12 years ago
Michael Peter Christen	02020b590b	- removed all extension types from extension navigation which are not proper/known - automatically show the protocol navigation if there is more than http and https - automatically show the extension navigation if there is some media content	12 years ago
Michael Peter Christen	01200f06cc	using the author field as solr-native facet. this makes it necessary to introduce a copy-field for the author field to be copied to a string field. This field is then used to generate facets. Without this field, the facet would consist only of the words of the author names, not of the full author string.	12 years ago
Michael Peter Christen	2a4c064c89	using the publisher information for the author field if no author is given. This applies to cases where only the copyright field in the html header is filled but not the author field	12 years ago
Michael Peter Christen	bab573361f	- using a filter query for facet restriction - calculating the whole search result in at most two sub-queries from solr	12 years ago
Michael Peter Christen	eac9650b31	added another solr field clickdepth_i which reflects the number of clicks which are necessary to get from the portal of a host to a specific document. At this time, only the start document is flagged with clickdepth '0', all other with '-1'. To get the actual clickdepth, a process must use crawled information to collect the actual number of clicks. This will be added in another/next step.	12 years ago
Michael Peter Christen	1052263af3	- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.	12 years ago
Michael Peter Christen	7c3de8b4cd	- fix for localhost detection - added IPv6 patterns for localhost detection	12 years ago
Michael Peter Christen	34f8786508	removed dependency of vocabulary navigation from Jena and it's triplestore; the vocabulary search is now done using generic solr fields which are created on-the-fly during runtime.	12 years ago
reger	ad71747525	fix: set defaul language to "en"	12 years ago
Michael Peter Christen	9319b90d8a	- fixes for host navigation - fixes for filetype navigation - removed unused code	12 years ago
Michael Peter Christen	cb5cbec14d	distinguishing modified query string and original query string	12 years ago
Michael Peter Christen	fb0fa9a102	- fixed 'delete from subpath' during crawl start which deleted nothing; now works; - changed some crawl start html design details	12 years ago
orbiter	712cc37c40	if maxFileSize < 0 then the file size limit is without limit.	12 years ago
orbiter	1f33c30d7b	re-integrating useForHost method (lost sometime?) to get the noProxy pattern working again. Without using this method all remote urls including the localhost had been accessed through the configured proxy	12 years ago
reger	f1a9c2e604	fix Servlet template on conditional file include with use of conditional template pattern in included template file (example IndexCreateQueues_p.html) see bug http://bugs.yacy.net/view.php?id=215	12 years ago
orbiter	a4a780b871	- fix for bad url conversion in bookmarks when using smb urls - fix for localhost hosts in solr schema host handling	12 years ago
reger	e80dfeca23	- making blacklist path part case insensitive (solving http://bugs.yacy.net/view.php?id=171 ) - blacklist test adding explicite response text "not blocked" if no blacklist match	12 years ago
reger	e2d499be9e	remove NOT NEEDED reference to solr.YaCySchema from ConfigurationSet to be able to use ConfigurationSet for other conf files (than solr.keys.default.list).	12 years ago
Michael Peter Christen	a3cd3852ab	introduced a better place to update the lastacc time value in latency	12 years ago
Michael Peter Christen	864abcd33d	removed Latency update after URL selection because that causes a completely wrong behaviour when cache fresh cases appear. Makes re-crawling MUCH faster!	12 years ago
Michael Peter Christen	dd241d03bb	latency fix: only set last-visit time if access was actually by the robot	12 years ago
Michael Peter Christen	118233a7e6	fix for bad xml in gsa result when doing a query with quotes	12 years ago
Michael Peter Christen	1e002ab18e	added another blacklist-cleaner into balancer	12 years ago
Michael Peter Christen	10527e28ae	fix for wrong display of error urls in HostBrowser	12 years ago
Michael Peter Christen	756772fbd3	fix for waitingtime computation for intranet configuration	12 years ago
Michael Peter Christen	fa27e5820f	- check blacklist (again) when taking urls from the crawl stack because the blacklist may get extended during crawling - removed debug output	12 years ago
Michael Peter Christen	adfecc6ba8	more robustness during shutdown	12 years ago
Michael Peter Christen	d4bfe9339e	Brute-force attempt to start solr in case of a memory problem. I don't actually know if this is correct. It is a desperate try to get YaCy running on production servers which must get alive even with strange hacks like this. This is also related to a forum posting in http://forum.yacy-websuche.de/viewtopic.php?t=4528&p=27135#p27135	12 years ago
Michael Peter Christen	8aa08261a7	update to Solr Boost handling	12 years ago
Michael Peter Christen	908ad2f174	Added a new servlet to configure the solr ranking using field boosts	12 years ago
Michael Peter Christen	a01e47b992	enhanced exists()-method for solr; should reduce a lot of IO during DHT target selection	12 years ago
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	12 years ago
Michael Peter Christen	b5ee88c6af	added more logging to get info which url causes performance problems	12 years ago
reger	1faa045dc1	fix: prevent regex pattern compile error for blacklist import for path '' (extend it to '.')	12 years ago
reger	6cf33f899c	prevent Solr "version conflict" on update by set Solr "_version_" field to 0 (=no version check)	12 years ago
Michael Peter Christen	acd98bebb7	improvements in GSA result writer	12 years ago
Michael Peter Christen	3de784c8dd	replaced more split and replaceAll missing pattern pre-compilation with pre-compiled pattern	12 years ago
Michael Peter Christen	8fc3679c66	using more pre-compile pattern for split methods	12 years ago
Michael Peter Christen	d48e9788d2	enhanced search result processing behavior - query less at one time; query more often - in between the small queries, evaluate results - remove fields from search results which are not needed	12 years ago
Michael Peter Christen	bf512e6350	Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1	12 years ago
reger	469efcdb9d	fix: display and calculate authors and namespace search navigator if configured (otherwise skip overhead) (leave hosts, topics and not in ConfigPortal included filetype, protocoll navigator untouched)	12 years ago
Michael Peter Christen	eca68fa197	added debug code to crawler monitor	12 years ago
Michael Peter Christen	205f8b222b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	ee612e8b93	start the local search only if this peer is doing a remote search or when it is doing a local search and the peer is old	12 years ago
Michael Peter Christen	d465773a37	- removed multi-add of documents (no used) - inserted specialized code for size request	12 years ago
Michael Peter Christen	a1a4d9aa94	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/cora/federate/solr/connector/MirrorSolrConnector.java	12 years ago
Michael Peter Christen	b7004043ea	- added a field cache for solr queries which call only for a single value - fixed a version conflict exception within a solr add request	12 years ago
orbiter	5aa5202adf	fixes for filesystem indexing	12 years ago
Michael Peter Christen	efd2c4622d	added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules).	12 years ago
Michael Peter Christen	5e182a566f	- added another enumeration method in kelondro data structure to get a more random access to data for the balancer - added random access inside the balancer	12 years ago
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	12 years ago
Michael Peter Christen	a114bb23bb	- using edismax in gsa interface - generating less field data for gsa search results - using a boost query in gsa interface to move double content to the end of the result list	12 years ago
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	12 years ago
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	12 years ago
Michael Peter Christen	46be4af5b9	Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890'	12 years ago
Michael Peter Christen	832eead998	Merge remote-tracking branch 'regerdev/master'	12 years ago
Michael Peter Christen	952e143580	FINALLY YaCy can now search for full strings using double- or singlequoted strings in the search query line!!!	12 years ago
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	12 years ago
cominch	2bb8f045cc	content control: use up-to-date definitions	12 years ago
Michael Peter Christen	5fd3b93661	added deletion of hosts during crawl start if deleteold option was given	12 years ago
Michael Peter Christen	d64445c3cb	because we have the inurl:<term> - searchmodifier, we don't actually need regular expressions as search attributes. They had now been removed from the advanced search page while they are still created internally. The filter is then expressed against solr as regular expression filter query. If the expression points out a selection of an specific protocol, host or filetype this is then translated into a facetted query.	12 years ago
cominch	a67ff1c8ac	SMW Import: replaced JSON import routines with stable ones	12 years ago
cominch	d2a94cc55e	refactor package	12 years ago
cominch	05742b4562	remove old SMW importer which was part of the ymarks package	12 years ago
cominch	21df1ad9e0	update and generalization of the SMW import and content control routines	12 years ago
Michael Peter Christen	842faf96a2	fixed media search	12 years ago
Michael Peter Christen	93001586a0	removed warnings, removed too-fast pausing of crawls	12 years ago
Michael Peter Christen	8041742e48	added matching of path to query pattern	12 years ago
Michael Peter Christen	8b1c9cba3d	fixed a problem with non-terminating crawls	12 years ago
Michael Peter Christen	61a1d32356	fix to ftp client	12 years ago
Michael Peter Christen	5105256927	update to search result logging (this was a remaining issue from the solr 4.0.0 migration)	12 years ago
Michael Peter Christen	570e42c4e3	fix for filetype naviagtor	12 years ago

1 2 3 4 5 ...

6147 Commits (c37d718f16da30a567375936a90aaa939f5f91f6)