yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	10 years ago
reger	29d1945c16	fix double &query parameter (index.html) ?query=word&query=	10 years ago
Michael Peter Christen	542c20a597	changed handling of crawl profile field crawlingIfOlder: this should be filled with the date, when the url is recognized as to be outdated. That field was partly misinterpreted and the time interval was filled in. In case that all the urls which are in the index shall be treated as outdated, the field is filled now with Long.MAX_VALUE because then all crawl dates are before that date and therefore outdated.	10 years ago
reger	7f0e757bb5	fix bookmark.rss - channel end tag postion - link with html entity	10 years ago
orbiter	e441831a24	reverted toString() change in AnchorURL to prevent mistakenly used toString(). This fixes also the update link bug.	10 years ago
reger	697b9743e7	Add link to RemoteCrawl_p suggestion http://mantis.tokeek.de/view.php?id=277	10 years ago
reger	47f201a6b8	Add Solr default query fields (&qf) to select servlet according to the ranking profiles boost fields defined by the peer (if df/qf is not specified in query). This allows for pretty simple queries ( q=word) without the need to know about the specific index configuration. Making sure all relevant fields (as determined by the index owner) are searched, still maintaining the option to query specific fields and does not relay on the duplication of text to text_t. - add author to reset-default boost fields (support results for author nav)	10 years ago
reger	8004cfc961	fix input boostfield factor of 0.0 in RankingSolr - input was accepted and stored but not editeable (added check factor >0.0 during edit) - make use of some more predefined solr constants	10 years ago
reger	a2cb366b25	Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems	10 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	10 years ago
Michael Peter Christen	87f8118108	added option to delete documents from the webgraph	10 years ago
Michael Peter Christen	32a2ff925c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	d07cdd8c3b	added SolrCloud access mode and configuration	10 years ago
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	10 years ago
reger	f99f3d5cf2	fix button (clear list) text color in CrawlResults	10 years ago
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	10 years ago
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	11 years ago
orbiter	dab9a0786a	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	51bf5c85b0	Renamed the transmission cloud to buffer in dispatcher since the name 'cloud' was a bad idea. Changed also the accumulation process for peer targets so that every dht chunk is not assigned the set of redundant targets but they are assigned to redundant targets individually. This enhances the granularity of the target accumulation and should enhance the efficiency of the process. Finally the dht protocol client was enriched with the ability to remove the 'accept remote index' flag from peers or remove peers completely if they do not answer at all.	11 years ago
reger	7057e0b3e2	catch input file not found in Mediawiki import	11 years ago
Michael Peter Christen	f384fd624b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	ba5a59a28d	make search result also avail. as atom feed via /yacysearch.atom - fix logo in rss feed	11 years ago
orbiter	59160984cc	timeline performance update	11 years ago
orbiter	54bea96e67	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	15b2fad6a2	reverted latest change for reindexing because that works actually only for internal Solr indexes. This is mainly caused by the fact that an external Solr may be also a SolrCloud which do not support LukeRequests, which are needed to request the old Schema.	11 years ago
Michael Peter Christen	841cc77391	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	e09218129c	remove check for local solr. This check was made during a time when Solr was optional and another alternative metadata store was available. Since that store is now removed, Solr is always available (internally or externally)	11 years ago
orbiter	2073e69034	fix for long periods in timeline	11 years ago
reger	1f94df29e7	fix NPE in solr rss where snippet contains only the title text and adjusted xslt, for solr snippets (&hl=true) to decode the xml encoded html <b> tag by adding disable-output-escaping (still open item description may be double as dc: tag and rss.description tag)	11 years ago
Michael Peter Christen	8c52f0651b	refactoring of AccessTracker events & timeline fix	11 years ago
Michael Peter Christen	1b279d7a7e	fixed external link	11 years ago
Michael Peter Christen	74206a10c7	refactoring	11 years ago
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	11 years ago
reger	a88ea14e09	harmonize use of style for "delete" button - apply the monstly used btn-danger class	11 years ago
Michael Peter Christen	8fd72b5e8b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	81d0f01a6f	added 'synchronous' and 'commit' flags in push api	11 years ago
reger	5043eff33a	move page navigation below results (image search) force page navigation to be displayed below results in image search for any number of displayed images instead to be displayed to the right of last image.	11 years ago
Marc Nause	f443cfa32d	Improvements and bugfixes for recording actions of blacklist API.	11 years ago
Michael Peter Christen	0ba6b98d5b	fix for broken json	11 years ago
orbiter	4177c9cf05	fix for crawl start check	11 years ago
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	11 years ago
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	11 years ago
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	11 years ago
reger	c798a9d1bb	fix unresolved pattern in yacysearch.rss title and rss xml error due to html & encoding in url entries	11 years ago
Michael Peter Christen	e64be5dcad	in case that the network is switched to any other than freeworld, RWIs are disabled. This is a temporary fix. There must be a better way to determine if RWIs are to be switched on or of.	11 years ago
Michael Peter Christen	87f171675b	doing index deletions using a get string which makes it easier to copy-paste deletion examples (see: #EuGH :( )	11 years ago
Michael Peter Christen	a2f800cd8f	fix for bad String conversion	11 years ago
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	11 years ago
reger	7a52a6ba3f	add links to port config in status panel - pom upd to match javadoc location	11 years ago
reger	c3e40c82fe	make https port setting changeable via front end somewhere (chosen Http Networking page /Settings_p.html?page=http )	11 years ago
Michael Peter Christen	698f053658	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	f23c4142e0	added option to configure a custom user agent within allip networks	11 years ago
reger	8e233e2eb4	- fix typo in Message_p (defaultpath) - use more existing switchboardconstants for getproperties - replace depriciated call defaultservlet	11 years ago
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	11 years ago
Michael Peter Christen	640b684bb6	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	2f5477ea59	a try to fix the mixed up terms 'Active' -> 'Senior' and 'Passive' -> 'Junior'	11 years ago
reger	ca5437dd50	fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149 local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows) for win like file:///c:/tmp or file://localhost/c:/tmp for linux like file:///tmp or file://localhost/tmp Host is ignored and path must be absolute	11 years ago
reger	66f6797f52	make config search page layout closer to actual page appearance	11 years ago
sixcooler	5b1c4ef191	Monitoring and limit connection-count for Jetty	11 years ago
orbiter	ce1dbfeb0f	fix appearance of image search thumbnails.	11 years ago
orbiter	6daae59479	switch on core.service.rwi when switching back from portal mode to p2p mode	11 years ago
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	11 years ago
Michael Peter Christen	2520590b45	migrated from pdfbox 1.8.4 to 1.8.5. They have a very long bugfix list for that update: http://www.apache.org/dist/pdfbox/1.8.5/RELEASE-NOTES.txt	11 years ago
Michael Peter Christen	6634b5b737	debug code for index distribution testing	11 years ago
Michael Peter Christen	89e13fa34e	fixed bug in test function	11 years ago
Marc Nause	4723329e29	Improved blacklist XML/JSON API.	11 years ago
reger	f91b2f51ae	fix: load_Rss remove feed to many parameter for get use form post methode	11 years ago
orbiter	c028ae9b09	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
reger	e31493e139	"Use remote proxy for yacy" has no function, remove option and related config item see/fix bug http://mantis.tokeek.de/view.php?id=23 http://mantis.tokeek.de/view.php?id=189	11 years ago
reger	89e2c5e884	fix: allow enable of CrawlStartExpert.html #file	11 years ago
reger	1b37b12998	fix: CrawlStartExpert.html # From File with missing filename - crawlName must not be empty - crawlingFile must not be empty	11 years ago
orbiter	0d8072aa99	removed warnings	11 years ago
orbiter	be7c99dbe8	switched menu position of ConfigPortal.html and ConfigSearchBox.html	11 years ago
Michael Peter Christen	a1ac4c3b76	automatically clear graphics cache	11 years ago
reger	f87ac716f3	improve IndexDeletion by query adding transparently text_t as pseudo default search field if no fieldname (no : ) is included. adressing bug report http://mantis.tokeek.de/view.php?id=274	11 years ago
reger	e9060d31bd	update to Jetty 9 besides adjustments in code it makes the servlet settings in web.xml significant. This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).	11 years ago
orbiter	b9c1a61814	added a peername=<peername> property in the seedlist API	11 years ago
orbiter	c637955e67	fix for navigation steering / p2p mode see also: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5198&p=29958#p29958	11 years ago
Marc Nause	f98ccf952f	Improved Blacklist API: ) added JSON support ) fixed Exception in case of missing parameters *) renamed parameter for items in "add entry" and "delete entry" from "entry" to "item" to match term in XML	11 years ago
reger	91bd384cf6	fix input-group layout on index.html see bug http://mantis.tokeek.de/view.php?id=391	11 years ago
Marc Nause	0d88f292dc	Key for parameter "blacklist name" is "list" in all servlets now.	11 years ago
reger	80e0ee92e5	adjust search page layout - search box to current style	11 years ago
reger	a81dfc27eb	remove obsolet css class bookmarkfieldset	11 years ago
Michael Peter Christen	0898f0be17	input-group for main search input window	11 years ago
Michael Peter Christen	9bb616d778	enhanced HostBrowser buttons and fixed text input alignment	11 years ago
Michael Peter Christen	4a818ad72c	fix for strange fail reason	11 years ago
Michael Peter Christen	a2fba6584f	use submitted default userAgent if cloning a crawl	11 years ago
Marc Nause	e0822fa008	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Marc Nause	c97da1a0d8	First draft of a blacklist API.	11 years ago
reger	312972c586	add display filter (active/disabled) to IndexSchema_p.html config for easier overview of schema fields	11 years ago
Michael Peter Christen	d79d7dde55	fix for result display	11 years ago
Michael Peter Christen	362c988c05	design fixes to better use the new colours	11 years ago
Michael Peter Christen	bbadccbd8d	better buttons	11 years ago
Michael Peter Christen	a9963d5c95	bootstrap update	11 years ago
reger	4e57000a40	remove redundant javascript & id in index.html to set focus to query field in IE11	11 years ago
reger	121d25be38	recover sax fatal error on OAI-PMH import of xml with entity error this allows to continue loading next resumptionToken even if import file caused sax parser error fix http://mantis.tokeek.de/view.php?id=63	11 years ago
reger	81dc2aa536	add current css to HTMLResponseWriter to fix metadata view (using css from metas.template except js links)	11 years ago
orbiter	c6f0bd05f8	better removal of stored urls when doing a crawl start	11 years ago
orbiter	469e0a62f1	added new button to terminate all crawls	11 years ago
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	11 years ago
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	11 years ago
reger	2dabe2009d	- remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http	11 years ago
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	11 years ago
Michael Peter Christen	b4b0d14c04	fix for display bug	11 years ago
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	11 years ago
Michael Peter Christen	a37d067692	refactoring	11 years ago
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	11 years ago
Michael Peter Christen	9e503b3376	also delete the robots.txt file from the cache when a new crawl is started	11 years ago
orbiter	67501c9dda	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	11 years ago
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	11 years ago
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	11 years ago
Michael Peter Christen	c8d4a63604	eliminating the word 'Facet' from the interface because it is ugly. If people do not know what search navigation is, then they also do not know what a 'facet' is.	11 years ago
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	11 years ago
Michael Peter Christen	8443255e18	better link structure limit calibration	11 years ago
Michael Peter Christen	7f5733638b	fix for linkstructure computation: now also detecting dead links	11 years ago
orbiter	18f9c40302	moved Edge class out of linkstructure servlet as this does not work on non-eclipse driven environments (all non-dev cases)	11 years ago
Michael Peter Christen	a6bb9be97e	- added d3.js for visualizations using embedded svg - added a servlet api/linkstructure.json which generates a link graph information in json - added a javascript link graph renderer hypertree.js using d3 and the new servlet linkstructure.json - embedded the new link graph in the crawler monitor and the host browser	11 years ago
Michael Peter Christen	c64c10ef00	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	48fbfa60c1	bugfix to inbound/outbound identification	11 years ago
reger	227c42bc96	eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode.	11 years ago
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	11 years ago
Michael Peter Christen	d321b0314e	added missing servlet html	11 years ago
orbiter	b1ba764d81	fix for first start options and added german translation for popup texts	11 years ago
orbiter	043d274af5	fixed crawl start path for cloned crawls	11 years ago
Michael Peter Christen	1b9ec9a1c5	- added popover to p2p/stealth mode button to explain the peer mode and privacy issues. - added popover to first-time use case to explain that specific servlets are only visible after customization and/or crawl starts	11 years ago
Michael Peter Christen	8d35fcb1c7	transition.js is also included in bootstrap.js	11 years ago
Michael Peter Christen	3abc3c4c4c	removed alert.js, modal.js and tooltip.js as these libraries are all included in bootstrap.js	11 years ago
Michael Peter Christen	898f78258e	fix for naming bug	11 years ago
Michael Peter Christen	39b641d6cd	added tutorial mode - some menu items will only appear if you 'qualify' for them. Thus, the first-time user will only see four menu items. The other items will unfold as the user interacts.	11 years ago
Michael Peter Christen	7a49f72480	fix for crawler column width	11 years ago
Michael Peter Christen	46a1a15441	added more bootstrap libraries	11 years ago
Michael Peter Christen	5ccbfeb803	show host list by default in host browser	11 years ago
Michael Peter Christen	ba0e3fb0dc	fixed crawl start links after renaming them in latest commit	11 years ago
orbiter	d29b6db270	made crawl start pages public since they do not reveal individual information and they are also not used as servlet to actually start the crawl (which is Crawler_p.html).	11 years ago
Michael Peter Christen	e41db47cac	added (again) underline to a tags	11 years ago
Michael Peter Christen	ff82a80eb3	Integrated HostBrowser back to administration interface; it can appear with and without navigation bar.	11 years ago
Michael Peter Christen	94366ba2e5	added template for latest commit	11 years ago
Michael Peter Christen	701df02ead	Complete redesign of administration top-level menu. This follows two principles: - provide an easy tutorial-like "what should I do first" menu - provide all elements which are subject to most first questions to YaCy exibition people on top level: Resource limitation, Parser and Ranking settings I apologize to everyone who are used to the old style and need to find the menu items (again) after this change. I hope that this will make the interface more usable for new users who see a web indexer/crawler the first time.	11 years ago
Michael Peter Christen	a3b7366aee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	6b66bb7109	redesign of search page integration menu structure	11 years ago
reger	92811d7850	fix: 3 more links pointing to old /xml path	11 years ago
reger	c183d66d40	fix: blacklist xml export path to xml template	11 years ago
Michael Peter Christen	656e2ce62a	replacing direct html table cellspacing with css set-up for cellspacing	11 years ago
reger	e11504309f	adding a hint to javascript browser short cut on Url-Proxy page (AugmentedBrowsing_p.html)	11 years ago
reger	7f29eee9ac	fix: cut-off button in WatchWebStructure_p.html (by header css dd hight/line-hight)	11 years ago

1 2 3 4 5 ...

5001 Commits (ce9368246b2ea9bb0a07a5d5de9ebea330931ee2)