yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	734340c128	Fixed errors for Search portal mode or when peer is not reachable. Same case as reported on issue #87.	8 years ago
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	8 years ago
reger	7c81160f45	correct blacklist export as text url to blacklists_p.txt was using servlet for network access and missing network.unit.name fix for http://mantis.tokeek.de/view.php?id=694 + prevent unresoved_pattern in yacy/list servlet	8 years ago
reger	91ab8a526a	add error msg to api/share.html and skip display of url on nothing uploaded	8 years ago
luccioman	6e96c7341a	Merge remote-tracking branch 'origin/master' Conflicts: htroot/Load_MediawikiWiki.java htroot/Load_PHPBB3.java htroot/ViewImage.java	8 years ago
reger	4e0892962a	fix NPE in citation servlet on empty text field	9 years ago
reger	d9adc2c255	load handler for Transparent Proxy on startup only if feature is activated to save the resources and keep handler chain small if the feature is not used. +add a warning message on settingsack_p page to restart on first activation	9 years ago
Michael Peter Christen	b89465d952	0N - basic dump upload servlet infrastructure, to share index dumps within an experimental new sharing model	9 years ago
Michael Peter Christen	f12a900f3e	harmonization of http post of files for one and several files - this had been differently - and wrong for several files. also: base64-encoding for gzipped push files because our data structures currently only supports ASCII POST pushes..	9 years ago
luc	8682dfbd5e	Updated getpageinfo outputs to return page icons list.	9 years ago
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	9 years ago
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	9 years ago
luc	55a4d15775	Added a note on deprecated default search field and operator.	9 years ago
reger	52a9040ae6	Sort out double keywords (dc_subject) early in parsed documents - by direct using Set vs. List - remove not neede String[] getter	9 years ago
reger	a60b1fb6c2	differentiate api call getLocalPort() from getConfigInt()	9 years ago
sixcooler	87e4abe393	fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has moved and was not cleared anymore. This results in an huge fieldcache. (http://lucene.apache.org/#highlights-of-the-lucene-release-include https://issues.apache.org/jira/browse/LUCENE-5666) Here I try to use DovValues where it is possible. For this I used the Api-Scheme as new basis für the Solr-Schema. This needs at least a complete optimization of the Solr-Index to get a smaller FieldCache. Everything that is indexed with these setting will not use the Fieldcache at all.	9 years ago
Michael Peter Christen	b43811d38c	added surrogate import process for exported solr dumps. Just throw your solr dump file into DATA/SURROGATES/in/ and it will be imported!	10 years ago
reger	3e742d1e34	Init remote crawler on demand If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread. Deploy of the remoteCrawlJob deferred on activation of the option.	10 years ago
reger	609c52e987	refactor getBookmark to consistenly check existance by != null (w/o throwing exception on not found)	10 years ago
reger	8a5b8f8789	on bookmaring of search result, remember orig. query in separate bookmark property (instead of using the description field) - adjust display and autosearch - don't overwrite existing bookmark but combine info	10 years ago
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	10 years ago
reger	7fcf0d0b71	fix missing display of CrawlerMonitor -> robots.txt Monitor revert delete of file api/table_p.html see `3ffe19b85c` (still used in this menu)	10 years ago
Michael Peter Christen	710a0efa1b	generalized time period computations	10 years ago
Michael Peter Christen	974d58b01f	IPv6 Fix for push interface	10 years ago
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	10 years ago
Michael Peter Christen	bee5ee7cce	removed some warnings	10 years ago
Michael Peter Christen	6390454652	fix for vocabulary on/off setting	10 years ago
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	10 years ago
reger	198102304b	refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size())	10 years ago
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	10 years ago
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	10 years ago
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	10 years ago
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	10 years ago
Michael Peter Christen	c3c2b6999b	fixes on wkhtmltopdf	10 years ago
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
Michael Peter Christen	d97deb5555	npe fix	10 years ago
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	10 years ago
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	10 years ago
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	10 years ago
Michael Peter Christen	0550b54d56	added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects.	10 years ago
Michael Peter Christen	1db476c67e	fix for bad table iteration	10 years ago
orbiter	3ffe19b85c	replaced old /api/table_p.xml servlet with /Tables_p.xml to avoid double code	10 years ago
Michael Peter Christen	07c5b57953	removed warnings	10 years ago
reger	f5967dfedf	add filter to citation page and a on/off button to display only sentences with citations, while maintaining the sentence number. Make the filtered list the default in search result citation link	10 years ago
Michael Peter Christen	0bfc69b29b	more ipv6 bugfixes	10 years ago
Marc Nause	1e6e69bc40	Finished implementation of UPNP: ) will try other ports if YaCy standard ports are not available ) distinguish between internal and external port (not sure if this works 100%) Still to add: propery in config to enter own external port (in case of manually configured NAT)	10 years ago
orbiter	3ac31614a3	added option to reverse-sort YaCy tables (internal API change only)	10 years ago
Michael Peter Christen	2a52c6f0f1	using htroot/api/blacklists as source folder: removed package declaration of some classes in that folder	10 years ago
reger	6654d314f1	add rss version to api/feed.rss IE11 reports error without	10 years ago
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	10 years ago
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	10 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	10 years ago
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	10 years ago
orbiter	59160984cc	timeline performance update	11 years ago
orbiter	2073e69034	fix for long periods in timeline	11 years ago
Michael Peter Christen	8c52f0651b	refactoring of AccessTracker events & timeline fix	11 years ago
Michael Peter Christen	74206a10c7	refactoring	11 years ago
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	11 years ago
Michael Peter Christen	8fd72b5e8b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	81d0f01a6f	added 'synchronous' and 'commit' flags in push api	11 years ago
Marc Nause	f443cfa32d	Improvements and bugfixes for recording actions of blacklist API.	11 years ago
orbiter	4177c9cf05	fix for crawl start check	11 years ago
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	11 years ago
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	11 years ago
Michael Peter Christen	2520590b45	migrated from pdfbox 1.8.4 to 1.8.5. They have a very long bugfix list for that update: http://www.apache.org/dist/pdfbox/1.8.5/RELEASE-NOTES.txt	11 years ago
Marc Nause	4723329e29	Improved blacklist XML/JSON API.	11 years ago
orbiter	0d8072aa99	removed warnings	11 years ago
Marc Nause	f98ccf952f	Improved Blacklist API: ) added JSON support ) fixed Exception in case of missing parameters *) renamed parameter for items in "add entry" and "delete entry" from "entry" to "item" to match term in XML	11 years ago
Marc Nause	0d88f292dc	Key for parameter "blacklist name" is "list" in all servlets now.	11 years ago
Marc Nause	c97da1a0d8	First draft of a blacklist API.	11 years ago
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	11 years ago
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	11 years ago
Michael Peter Christen	a37d067692	refactoring	11 years ago
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	11 years ago
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	11 years ago
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	11 years ago
Michael Peter Christen	7f5733638b	fix for linkstructure computation: now also detecting dead links	11 years ago
orbiter	18f9c40302	moved Edge class out of linkstructure servlet as this does not work on non-eclipse driven environments (all non-dev cases)	11 years ago
Michael Peter Christen	a6bb9be97e	- added d3.js for visualizations using embedded svg - added a servlet api/linkstructure.json which generates a link graph information in json - added a javascript link graph renderer hypertree.js using d3 and the new servlet linkstructure.json - embedded the new link graph in the crawler monitor and the host browser	11 years ago
Michael Peter Christen	48fbfa60c1	bugfix to inbound/outbound identification	11 years ago
Michael Peter Christen	a3b7366aee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	92811d7850	fix: 3 more links pointing to old /xml path	11 years ago
Michael Peter Christen	656e2ce62a	replacing direct html table cellspacing with css set-up for cellspacing	11 years ago
orbiter	f8f88d4e81	replaced pdblue-homebrew buttons with bootstrap standard buttons	11 years ago
Michael Peter Christen	85a427ec54	support for multiple sitemaps in robots.txt	11 years ago
Michael Peter Christen	bcd9dd9e1d	enhanced concurrent loading by using a fixed set of concurrent loader processes in favor of throwaway-processes. The control mechanism does less often report a 'queue full' message to the busy loop which then does not perform a long busy waiting; instead all requests are queued and new loader processes are started if necessary up to a given limit (as set before)	11 years ago
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	11 years ago
Michael Peter Christen	1bbc0fe6d2	added a properties file format for the status_p api to support reading of that information with the java Properties class (very easy for small clients)	11 years ago
Michael Peter Christen	e40511f307	extended the status_p api with disk space information	11 years ago
Michael Peter Christen	0f6b72f24b	do not use luke requests for remote solr servers if the result is different from normal requests. This happens if the remote solr is actually a solrCloud; in such cases the luke request returns only the result of the single solr peer, not the whole cloud. also done: some refactoring.	11 years ago
orbiter	f6e441dd77	refactoring	11 years ago
Michael Peter Christen	6e59ca4ebf	removed jena library and all code that depended on jena. When jena was introduced, it was also used for search facets. The generic search facets are now deduced from generic solr fields which makes jena as tool for facet semantics superfluous.	11 years ago
reger	193b8235c2	remove double jquery-1.3.1.js and adjust header links to jquery-1.3.2	11 years ago
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	11 years ago
Michael Peter Christen	c0da966dfa	enhanced crawler speed	11 years ago
reger	97e84439fb	adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString - since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic, adjusted ConfigHeuristic to use OpensearchHeuristic settings only. For this the default OSD search target list is made available (copied) by default and the other configs are removed. - the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object, but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers) - started to adjust internal html href references from absolute to relative (currently it is mixed). For future development we should prefer relative href targets (less trouble with context aware servlets)	11 years ago
reger	e05320b776	upd: to open more external links in new browser-tab	11 years ago
Michael Peter Christen	74466d731a	use pre-compiled patterns in ymark	11 years ago
Michael Peter Christen	0db8e34625	enhanced webgraph processing	11 years ago
orbiter	19a051bec8	more monitoring for postprocessing and enhanced layout in Crawler monitor page	11 years ago

1 2 3 4 5 ...

484 Commits (7b108dadf77b0e2734570b872ea132b26b4dc474)