yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	fbf85a1561	added temporary debug output in http client	10 years ago
Michael Peter Christen	ff29b0e503	added option to re-index exported xml snapshot dumps to HTCACHE/snapshots by just placing them in the SURROGATES/in path	10 years ago
Michael Peter Christen	6f4fe4b175	revert of `8a7c68e4c7` keeping surrogates after processing is essential for some users. If the space they are taking is too high, please set up an automatic deletion process (like a cronjob).	10 years ago
Michael Peter Christen	97930a6aad	added must-not-match filter to snapshot generation. also: fixed some bugs	10 years ago
Michael Peter Christen	9d8f426890	adding a try-catch to link graph processing to prevent that a single malformed url interrupts the storage process	10 years ago
reger	8a5b8f8789	on bookmaring of search result, remember orig. query in separate bookmark property (instead of using the description field) - adjust display and autosearch - don't overwrite existing bookmark but combine info	10 years ago
reger	7224209486	break out of NormalizeDistributor loop on timeout	10 years ago
reger	47e61f8325	fix typo in image filter query (extra bracket)	10 years ago
reger	4b4ab6799f	fix String out of range in Collection Nav see http://mantis.tokeek.de/view.php?id=573	10 years ago
reger	572cfe8fd4	improve character encoding for urlproxy servlet for none utf-8 pages	10 years ago
reger	6bc8a9b11e	make Quality of Service Servlet available to prioritize requests from local host This assigns priorities to incoming requests. Higher priority numbers are served before lower. (disabled by default in defaults/web.xml, uncomment or copy entry to DATA/Settings/web.xml)	10 years ago
Ryszard Goń	ca1a70aec8	fix for Accept '?' URLs column in Crawl Profile List	10 years ago
reger	5408448a56	skip redundant add. of keywords to text search uses keywords as default search field	10 years ago
reger	296e97c78e	put https port in peers dna as we flag if a peer is accesible via https, we need to know the port if we want to use is (e.g. for interYaCy communication) start to provide / tansport the port by recording it in peers dna. - add https link on the Network.html lock symbol	10 years ago
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	10 years ago
Michael Peter Christen	b060ba900d	added parsing of contentprop attribute in html tags for content='startDate' and content='endDate'. The value of these field is now written to new solr fields startDates_dts and endDates_dts.	10 years ago
Michael Peter Christen	4cb4f67f38	added parsing of dd, dt and article html fields. The parsed result is written to special solr fields which are deactivated by default.	10 years ago
reger	1395f10e95	fix typecast for css links	10 years ago
Michael Peter Christen	3288489fd2	more logging during start-up	10 years ago
Michael Peter Christen	abaaaef5f1	fix for filter queries	10 years ago
Michael Peter Christen	4d00175157	<experimental> added parsing of <article> html element. Whenever such an element occurs, the complete content of all article elements replaces the parsed <content> part of documents.	10 years ago
Michael Peter Christen	1df6492019	enhanced suggestions	10 years ago
Michael Peter Christen	ae02c92fd0	logging fix	10 years ago
Michael Peter Christen	5651713134	better debugging of fq	10 years ago
Michael Peter Christen	f5a032f293	split query into filter query and text query to get better ranking results and faster results	10 years ago
Michael Peter Christen	2e88028c1a	when selecting collections in navigation, do show the un-selected collections in search result. When selecting one of them in another search, switch off the previously selected collection. This actually turns the collection navigation modifier into a radio-button like behaviour	10 years ago
Michael Peter Christen	1de9b21c65	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	5f4cd8d6f5	replace deprecated getIP with getIPs in AbstractRemoteHandler	10 years ago
Michael Peter Christen	fa7edc9f7a	refactoring of filter queries (several queries instead only one)	10 years ago
Michael Peter Christen	40389987ec	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	f9ba50379d	added an expansion option to search facets on result page: - if less or equal of 8 facet options are present, they are shown by default - if more facet options are present, they are hidden To view or hide all facets, just click on the facet header bar	10 years ago
reger	1f0f77bb77	make location facet return results for location nav facet of field coordinate_p does not return results, now using coordinate_p_0_coordinate as alternative to get facet counts. As the actual facet value is not used this should not harm any analysis (even if facet is a incomplete location). If facet value is used in future likely *_geohash field could be introduced (for facet and other ... as transport value)	10 years ago
reger	b1ec0644e5	fix NPE in location search on missing/empty PubDate in underlaying rss data	10 years ago
reger	c1dcc8c456	fix display and limit of max server connections after startup (on restart value returned to default=50) This has no effect on Jetty but the limit is still respected.	10 years ago
reger	839b962c20	correct percent encoding for '%' char	10 years ago
Michael Peter Christen	9bf0d7ecb9	added a new collection type 'dht' to all documents from the peer-to-peer interface to distinguish rich and poor document data. This also reverts some changes from commit `796770e070` because the firstSeen database is the wrong method to distinguish these types of data	10 years ago
reger	796770e070	prevent overwrite of crawled or received full documents by (newer) metadata To protect rich index data (full resource) from overwriting by metadata gathered during remote search, the newly introduced "firstSeen" index is used to differentiate between full-resource-doc and metadata, as a "firstSeen" entry is only added on store's of full-resource-docs (during crawl or remote search).	10 years ago
Michael Peter Christen	ee2490ab98	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	431311df42	fix get fresh_date_dt to allow returned value to be date in future	10 years ago
otter	74c7e8b686	Fixes hanging FlushThread (see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5447) by replacing put() method by the more robust add() to add a merge job to the queue.	10 years ago
reger	f63fff9008	fix snippet containig number with comma as desmo point http://mantis.tokeek.de/view.php?id=344 to keep it as one word (by altering the split regex) - added sniipet test case with number - regex for word split to match multiple splitcars	10 years ago
reger	b241264632	fix error on *abc query input http://mantis.tokeek.de/view.php?id=486	10 years ago
reger	2ef8ffdb60	apply UTF-8 encoding copied from escape()	10 years ago
reger	7120ea42f1	fix for path with char code > 255 (causing index out of bound exception) + test cas for it	10 years ago
reger	1d81bd0687	fix url encoding for path see http://mantis.tokeek.de/view.php?id=559 So far we used same escape procedure for all parts of the url (which includes x-www-form-urlencoded for all url components) Added capability to use different encoding rules for the different url components (through specific bitset for each component). (this is inspired by org.apache.http.client and java.net.uri implementation). - Added test case for http://mantis.tokeek.de/view.php?id=559	10 years ago
reger	62087fb8b2	fix MultiProtocolURL mailto protocol detection	10 years ago
reger	2e8c24e02a	fix link to DeReWo download file	10 years ago
reger	706f75ddc2	try to fix hang on index blob merge on shutdown http://mantis.tokeek.de/view.php?id=505 It happens but not able to reproduce. This change makes sure terminate signal is catched at end of currently running merge jobs	10 years ago
reger	f94e34058c	fix url (path) %-decoding http://mantis.tokeek.de/view.php?id=519 - add test case for this	10 years ago
reger	7e09bff4a1	exclude default search fields from text copy to text_t for metadata index documents (reduce text redundance)	10 years ago
reger	86073a5ba3	For remote crawlReceipt add document abstract/description enhance the returned metadata returned to the originator by description_txt to improve fulltext search result hits.	10 years ago
reger	8af70950d9	harmonize snippet computation to considere description_txt always (solr hl & internal). For now just added desc to text list for computation, could be further equalized with hl computation.	10 years ago
Michael Peter Christen	fd4e2c809a	Show dates in the content of a document in the search result: - if an eventDate is given in the search result, replace the document date with the event date and prefix it with the string "on ". - the document date is omitted if a date from the cent is shown Added also the date as fields in the json and rss result sets.	10 years ago
Michael Peter Christen	893889bc7b	added special terms for on: - Date modifier: tomorrow, today; i.e.: search for: "Berlin on:tomorrow" to find events happening tomorrow in Berlin	10 years ago
Michael Peter Christen	710a0efa1b	generalized time period computations	10 years ago
Michael Peter Christen	d9d3111d10	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	10 years ago
reger	d7259419f3	postpone raw snippet html encoding upon use instead of during init of snippet adressing http://mantis.tokeek.de/view.php?id=551	10 years ago
reger	de56d934b2	apply query parameter getQueryFields() to GSA servlet	10 years ago
reger	2d2299f484	fix mimetype of rss items in rss parser - remove self reference as anchor for items	10 years ago
Michael Peter Christen	b432049d59	enhanced date parsing time	10 years ago
reger	9b0de2de64	introduce getQueryFields to return default query fields (queryparamter QF) calculated from boostfields config, making sure title, description, keywords and content is always searched. - apply change to solrServlet makes sure every remote query uses at least all locally defined boost fields for search - apply to local solr search - simplify select query by using QF defaults	10 years ago
reger	a0f04db9ea	add extracted description/subject to pptParser	10 years ago
reger	8ec1db76ee	url unescape add check for inconsistent utf8 multibyte parsing If the url contains special chars (like umlaute äöü) it's interpreted as multybyte char and actually not converted at all (removed). Added a check if the multibyte convesion is not complete, just add the char as is. This fixes http://mantis.tokeek.de/view.php?id=200	10 years ago
reger	4b97ddb9ec	stop sending crawl receipts if receiver got offline	10 years ago
reger	7e35518787	add extracted description/subject to docParser	10 years ago
reger	f0a5188e11	replace depreciated HTTPClient setStaleConnectionCheckEnabled with setValidateAfterInactivity()	10 years ago
reger	7b569d2dbe	replace depriciated HTTPClient ALLOW_ALL_HOSTNAME_VERIFIER with NoopHostnameVerifier()	10 years ago
reger	fba34e12ef	fix formatting issue if snippet contains html code replacement for reverted commit `61f42a7928`	10 years ago
reger	e48720a58c	fix NPE in snippet computation	10 years ago
reger	eda0aeaf26	allow/recognize host in file: protocol crawl target This is useful in intranet indexing while crawling a intranet file server accessed via hostname while e.g. under Windows mapped to different drive letters on individual clients. Here you can crawl e.g. file://fileserver/documents having a valid uri in that intranet environment (while e.g. P:/documents might be client dependant).	10 years ago
reger	df83fcc4fc	disable optimistic GC assumption in StandardMemoryStrategy After several tests found that eom is not prevented. Major reason in testing was assumption future GC will free avg of last 5 GC. Disabeling this check improved eom exceptions. Added simplest testcase used for verification	10 years ago
Michael Peter Christen	8ff76f8682	the cleanup process experienced a 100% CPU load situation and the loop did not terminate: Occurrences: 100 at java.util.HashMap$KeyIterator.next(HashMap.java:956) at net.yacy.cora.protocol.ConnectionInfo.cleanup(ConnectionInfo.java:300) at net.yacy.cora.protocol.ConnectionInfo.cleanUp(ConnectionInfo.java:293) at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2212) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:105) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:215) This tries to fix the problem; the problem should be monitored	10 years ago
Michael Peter Christen	1f5b5c0111	npe fix for latest scraper feature	10 years ago
Michael Peter Christen	ee97302a23	hack to make date detection faster (while it becomes a bit incomplete regarding language alternatives)	10 years ago
Michael Peter Christen	6578ff3ddb	enhanced suggest function	10 years ago
reger	fe6f5a395d	fix Umlaut handling in blekko heuristic search term http://mantis.tokeek.de/view.php?id=169 observation: blekko seams to block xxxbot agents (=0 results)	10 years ago
reger	23924348e2	url with semicolon or comma handling in proxy request apply patch supplied with bugreport http://mantis.tokeek.de/view.php?id=540	10 years ago
reger	9025fe3518	upd error message for proxy fix http://mantis.tokeek.de/view.php?id=539	10 years ago
Michael Peter Christen	97ba5ddbb7	configuration option for maxload limit for remote search	10 years ago
reger	c454ef69c6	add shortMemory check to heuristic search and skip operation on shortMemory (no request to remote openserch systems)	10 years ago
reger	9e1ec5fec4	refactor: just some more useages of constant for term ":[* TO *]"	10 years ago
reger	8c491f51a5	remove hardcoded initialization of language nav if not used	10 years ago
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	10 years ago
Michael Peter Christen	1cb290170e	refactoring of autotagging code (combined same code pieces)	10 years ago
Michael Peter Christen	c3b55455fc	enhanced initialization speed of vocabularies by using better normalization and by removal of unused data structures	10 years ago
Michael Peter Christen	68c605d637	replace with CommonPattern.SPACE for split	10 years ago
Michael Peter Christen	de3e373913	using precompiled CommonPattern.TAB for split	10 years ago
Michael Peter Christen	1f5047b15f	using precompiled pattern CommonPattern.SEMICOLON for splits	10 years ago
Michael Peter Christen	a8a2b7a803	persistency for vocabulary facet switch	10 years ago
Michael Peter Christen	efbc9a3561	introducting a new getConfig method which parses comma-separated llists from setting fields; refactoring for all places where such lists are parsed	10 years ago
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	10 years ago
Michael Peter Christen	ac19690d30	refactoring with CommonPattern.COMMA	10 years ago
Michael Peter Christen	cf9b22ca5c	do not reindex based on vocabulary fields (there are meanwhile many of them) and some default settings	10 years ago
Michael Peter Christen	5a060c9f26	refactoring of reindexSolr (just replaced constant string)	10 years ago
Michael Peter Christen	b5a55c8b3d	fix for wkhtmltopdf (custom header does not work)	10 years ago
Michael Peter Christen	3d717b749a	fix for urlmaskfilter	10 years ago
Michael Peter Christen	bee5ee7cce	removed some warnings	10 years ago
Michael Peter Christen	783cf6fbc7	the LinkedBlockingQueue is much faster than the ArrayBlockingQueue (strange but this is the result of a test: ArrayBlockingQueue: 39461 lines / second; LinkedBlockingQueue: 60774 lines / second)	10 years ago
Michael Peter Christen	6390454652	fix for vocabulary on/off setting	10 years ago
Michael Peter Christen	a3c5995bde	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	5ca0762179	fix: eom on parsing ico file by genericImageParser trace: java.lang.OutOfMemoryError: Java heap space at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75) at java.awt.image.Raster.createPackedRaster(Raster.java:467) at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032) at java.awt.image.BufferedImage.<init>(BufferedImage.java:331) at net.yacy.document.parser.images.bmpParser$IMAGEMAP.<init>(bmpParser.java:149) at net.yacy.document.parser.images.bmpParser.parse(bmpParser.java:69) at net.yacy.document.parser.images.genericImageParser.parse(genericImageParser.java:116)	10 years ago
Michael Peter Christen	4cd2d68e03	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	dc5700148f	update to latest code changes from json.org	10 years ago
reger	42b0672be3	Let auto-disabled crawls recover if low resource condition vanished. Analog to autodisabled DHT switch autodisabled crawls back on upon mem ok by remembering the autodisable by conf parameter.	10 years ago
Michael Peter Christen	287c528f46	replaced old JavaApplicationStub for Mac Application framework with new script. Adopted the YaCyApp environment and fixed a problem in the startYACY.sh application wrapper which caused wrong usage of logging option -l which caused that files had been written to the YaCy application folder. As a result of this fix, it is not necessary any more to change path settings in Info.plist if libraries are changed.	10 years ago
Michael Peter Christen	4c9d2a7c64	reverted 'do not show all options' strategy. This is actually confusing new users. Will be activated maybe again if there is an optional tutorial mode which can be switched on for this special purpose of running a tutorial.	10 years ago
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	10 years ago
reger	24f68a4eb7	refactor opensearch heuristic introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors, which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector. The manager enforces now a min 15s delay between calls to external systems. Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation. default heuristicopensearch.conf: - openbdb.com removed - seems not longer to deliver results - config via solrconnector to datacite.org added (large technical library archive)	10 years ago
Michael Peter Christen	3b51636ecb	fix for mediawiki import	10 years ago
Michael Peter Christen	b07afbc115	a test with http://validator.w3.org/feed/#validate_by_input shows that the time format was wrong; we must use RFC-822	10 years ago
Michael Peter Christen	8cafdb989a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	66839f73fa	remove debug limit from commit before	10 years ago
reger	4214f250d0	Add option for extended search (Autosearch) to Bookmark.html asking all connected peers for the searchterm added as description to the bookmark created by the bookmark icon. Intended for searches/research projects with not sufficient results from local and DHT selected remote target peers. Function: the process checks newly created bookmarks for description starting with "query=..." and takes this to ask every peer for 20 search results and adds it to the local index in a background job. link to start/stop the process added to /Bookmarks.html	10 years ago
reger	8e751d754a	- add javadoc to busythread with hint about the init parameter useage - remove obsolete 10_httpd config parameter	10 years ago
Michael Peter Christen	3e6c3e2237	documents pushed over the api/push_p.html interface will have their unique flag set by default	10 years ago
Michael Peter Christen	35c24608cc	fix for division by zero (rare cases)	10 years ago
Michael Peter Christen	4144c7cc52	do not write frame links to webgraph	10 years ago
reger	4eb89d7f15	revert clickservlet (default was indeed a mistakenly)	10 years ago
Michael Peter Christen	c9e2128260	please commit new files under your own name, this file was not created by me.	10 years ago
reger	d44d8996d0	Added a “don't store remote search results” option This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules). Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index. To be able to improve the local index a Click-Servlet option was added additionally. If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index. The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks) The option check-boxes are placed in ConfigPortal.html	10 years ago
reger	c156548efe	add info text to metadata page (htmlresponsewriter) on no documents found	10 years ago
reger	3ac1d14a21	improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list. This prevents unusual mapping of supported fileextension -> mimetype (like htm=application/x-tex)	10 years ago
Michael Peter Christen	d2792a43fd	do not write iframe and embed links into webgraph, but use them anyway for crawling	10 years ago
Michael Peter Christen	3cd7deb3b8	do not flush non-errors to stdout because this is a concurrency issue. the flush-call appeared very often in thread dumps with high load, so this hopefully gives some performances	10 years ago
Michael Peter Christen	4e3e2acc69	Merge branch 'master' of gitorious.org:yacy/rc1-fixed_percent-encoding	10 years ago
Michael Peter Christen	ecb6a59e9e	do not translate gif images into png images for thumbnails. Instead, stream the original to the search result thumb viewer. This has two reasons: - animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a known bug which is obviously not yet fixed - animated gifs now appear in the search result also as animation	10 years ago
arucard21	3e9871291f	Applied URL-decoding prior to HTML-encoding. This removes percent-encoding from text shown in HTML	10 years ago
reger	6a04563578	Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top. By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations and individual addition/changes are still respected.	10 years ago
reger	51ec9c1f44	fix "null" title in response writer for documents with multivalued title	10 years ago
reger	73ba5d8ef7	adjust fieldtype and description of field httpstatus_redirect_s in CollectionSchema - the field is not used (delete candidate)	10 years ago
reger	1f9389396a	fix NPE related 500 (Bad Request) response of UrlProxy on blacklisted urls, by adding parameter HTTPDeamon and removing unused hostAddress lookup code in sendRespondError	10 years ago
reger	f856edecb6	fix proxy redirect (http status 302) response fixes http://mantis.tokeek.de/view.php?id=517 The url given in bug report uses a gzip input stream which causes the HTTPClient.writeto() throw an IOException due to incomplete input stream. This in turn prevents the 302 reponse to the client browser. By limiting to serve target content just on httpstatus=200 will proxy the header reponse and client browsers redirect settings can be honored.	10 years ago
Michael Peter Christen	cc090bcb01	enhanced initialization of autotagging	10 years ago
Michael Peter Christen	a0576ec737	fix for pdf sub-page result preparation	10 years ago
Michael Peter Christen	6ad43c4a8b	removed debug code	10 years ago
Michael Peter Christen	407cfff010	fix to wkhtmltopdf usage	10 years ago
Michael Peter Christen	5d321d3dc5	fixes to wkhtmltopdf call	10 years ago
Michael Peter Christen	eb78388a98	changed prefer strategy for http unique in such a way that http is preferred over https. While this is a bad idea from the standpoint of security it is more common applicable for environments where http and https mix and for some domains https is not available. Then the double-check is possible even if no postprocessing is performed.	10 years ago
Michael Peter Christen	9e588944fa	prevent NPE during initialization of very large vocabularies	10 years ago
Michael Peter Christen	aaf7d4775a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	8c3e5b7b6d	added experimental pdf splitting which enables YaCy to split pdfs during parsing into individual pages and add them all using different URLs. These constructed urls are generated from the source url with an appended page=<pagenumber> attribute to the url get/post properties. This will distinguish the different page entries. The search result list will then replace the post parameter with a url anchor # mark which causes that the original url is presented in the search result. These URLs can be opened directly on the correct page using pdf.js which is now built-in into firefox. That means: if you find a search hit on page 5 and click on the search result, firefox will open the pdf viewer and shows page 5.	10 years ago
Michael Peter Christen	d14114697c	the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible.	10 years ago
reger	deb75a1dbe	fix refactored size() -> filesize() in YMarkMetadata	10 years ago
reger	198102304b	refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size())	10 years ago
reger	c6f634a4f2	remove redundant caching of urlhash in URIMetadataNode (is already cached in underlaying DigestURL .url) upd pom keyword for maven-antrun-plugin	10 years ago
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	10 years ago
Michael Peter Christen	d3e71ed070	fixes for searches when initialization of large autotagging libraries have not been finished	10 years ago
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	10 years ago
Michael Peter Christen	c9c700b510	reduction of http requests to YaCy using the correct cache-control, expires and last-modified headers in http response.	10 years ago
reger	13cca2b114	fix missing AppPath upd Maven plugin versionid	10 years ago
Michael Peter Christen	65125439fe	added query modifier 'on'. This makes it possible to search for date occurrences within the (web) page documents (not the document last-modified!). This works only if the solr field dates_in_content_sxt is enabled. A search request may then have the form "term on:<date>", like gift on:24.12.2014 gift on:2014/12/24 * on:2014/12/31 For the date format you may use any kind of human-readable date representation(!yes!) - the on:<date> parser tries to identify language and also knows event names, like: bunny on:eastern .. as long as the date term has no spaces inside (use a dot). Further enhancement will be made to accept also strings encapsulated with quotes.	10 years ago
Michael Peter Christen	1cfddea578	added (very experimental) Solr response writer for snapshot image results	10 years ago
Michael Peter Christen	7287dd764e	added url, date, time and page number on pdf snapshot footer	10 years ago
Michael Peter Christen	8b5d074715	fix for image parser (there is a class missing!)	10 years ago
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	10 years ago
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	10 years ago
Michael Peter Christen	3354cd63be	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	10 years ago
reger	63846ddb89	add final SolrQueryRequest.close to SolrServlet	10 years ago
reger	9edc7308aa	update to metadata-extractor-2.7.0.jar add 2 simple JUnit test cases for jpeg and tif parsing	10 years ago
Michael Peter Christen	578ae29f1e	added a note that the servlet is linked using web.xml	10 years ago
reger	6c3f36def1	- fix path to default heuristic.cfg - deprecate unused ProxyServlet	10 years ago
Michael Peter Christen	bbf0ac40c3	add the actual DateDetection class... (missed in latest commit)	10 years ago
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	10 years ago
Michael Peter Christen	c3c2b6999b	fixes on wkhtmltopdf	10 years ago
Michael Peter Christen	114f0afc1e	enable sku as anchor in html response writer	10 years ago
Michael Peter Christen	aa80cb1159	enhanced tagging preparation speed which reduces initialization time for very large vocabularies	10 years ago
Michael Peter Christen	6a1865f507	refactoring date -> lastModified	10 years ago
Michael Peter Christen	ab6cc3c88c	added concurrent generation of snapshot pdfs	10 years ago
Michael Peter Christen	413eeefed4	added character set detection library from http://www-archive.mozilla.org/projects/intl/chardet.html	10 years ago
Michael Peter Christen	7bfc5b80cb	added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms	10 years ago
Michael Peter Christen	87b53b3572	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
reger	5d67e165d9	remove redundant null check in ResponseHeader.lastModified added a JUnit testcase for ResponseHeader dates (using age()), adjusted age() to pass all tests	10 years ago
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	10 years ago
reger	e52370728a	fix startup stop on missing HTCACHE/SNAPSHOT directory	10 years ago
reger	e5236aa7ca	Merge origin/master	10 years ago
reger	70cf7060a4	coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510	10 years ago
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	10 years ago
Michael Peter Christen	8b522687e0	added toString() methods to feed classes which makes it possible to export full rss feed files out of the RSSFeed class	10 years ago
reger	568c991405	remove the unused Request variable (fix of prev. commit)	10 years ago
reger	d6539ba597	Merge origin/master	10 years ago
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	10 years ago
Michael Peter Christen	a304058840	added Image Events as another option to generate images with a mac if no Ghostscript is available or does not work...	10 years ago
Michael Peter Christen	d83de9ecf5	added another path for the convert command because on older Macs ImageMagick has a different installation location	10 years ago
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	10 years ago
reger	28456dfc09	skip creation of unused Bluelist contenttransformer	10 years ago
Michael Peter Christen	321840fde3	Replaced all fixed thread pools with cached thread pools. The cached thread pools will flush their cached (dead) threads after 60 seconds. This will cause that YaCy now runs constantly withl about 50 threads, about 100 at peak times. Previously, about 400 threads had been cached and kept in a hibernation state, which caused that the numproc counter in /proc/user_beancounters (exists only in VM-hosted linux) was as high as the cached number of threads. This caused that VM supervisors terminated whole VM sessions if a limit was reached. Many VM providers have limits of numproc=96 which made it virtually impossible to run YaCy on such machines. With this change, it will be possible to run many YaCy instances even on VM hosts.	10 years ago
Michael Peter Christen	7bfab5eb9d	set Busy- and Blocking-Threads to daemon mode (they will now not prevent YaCy from termination if still running)	10 years ago
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	10 years ago
Michael Peter Christen	d5bac64421	recognize more html file types for snapshots	10 years ago
Michael Peter Christen	a1ee101079	recognize more html file extensions	10 years ago
Michael Peter Christen	8480641f2d	fix to xvfb-run usage (quotes did not parse in xvfb-run, default values are appropriate)	10 years ago
Michael Peter Christen	68b040e31e	added fail-over missing http proxy service (i.e. overload) and quiet mode	10 years ago
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	10 years ago
Michael Peter Christen	c35170a305	more logging	10 years ago
Michael Peter Christen	e8be07ec78	grr	10 years ago
Michael Peter Christen	6f81bb756c	wrap wkhtmltopdf with xvfb if necessary	10 years ago
Michael Peter Christen	0119f8665d	more logging when failing to create pdf snapshot	10 years ago

... 2 3 4 5 6 ...

3391 Commits (5445f38070af55ed56a5c826e20838925cfc2519)