yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	19a051bec8	more monitoring for postprocessing and enhanced layout in Crawler monitor page	11 years ago
Michael Peter Christen	9cf9727685	fix for wrong counter	11 years ago
Michael Peter Christen	fceac8cffd	more monitoring for postprocessing	11 years ago
Michael Peter Christen	6842783761	fixed and enhanced postprocessing	11 years ago
Michael Peter Christen	219d5934a4	fixed termination bug in Solr Connector	11 years ago
Michael Peter Christen	bf1bdd52a6	prevent requesting of 0-facets (which actually exist)	11 years ago
Michael Peter Christen	9d5895f643	enhanced and fixed postprocessing	11 years ago
Michael Peter Christen	f86fe90eda	enhanced mass storage speed to remote solr servers	11 years ago
Michael Peter Christen	6ed9821209	fixed several problems in solr connectors	11 years ago
Michael Peter Christen	191fd3d7e7	added an optimization option to HandleSet mass data storage structure	11 years ago
Michael Peter Christen	94b565ea0d	fixed keepalive min value	11 years ago
reger	b26787dc2d	- DefaultServlet: remove static gzip option YaCy doesn't use pre-gzip'ed static html pages - ProxyServlet: remove not neede procedure - Server init: skip one overlaping servlet context	11 years ago
Michael Peter Christen	24a052ecb9	removed debug code for existsByIds	11 years ago
Michael Peter Christen	087df05e24	added option to Config_Network_p.html to enable remote search while DHT-Receive is switched off.	11 years ago
Michael Peter Christen	1a4a69c226	set more logger to 'final static'	11 years ago
Michael Peter Christen	c60947360d	logger should be static	11 years ago
Michael Peter Christen	69b8d61c47	fix for search requests in GSA interface which contain 'funny' characters (like ':' etc.)	11 years ago
orbiter	b085cb522b	replaced old existsByIds for embedded Solr with obviously much faster new selection method (including stil existing debug code to test that this is in fact better)	11 years ago
reger	b29d262e70	implement Jetty8HttpServerImpl.generateSocketAddress (code 1:1 copied from serverCore)	11 years ago
orbiter	4234b0ed6c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
orbiter	909bbb49d8	added (partly commented) test code for url rewrite methods .. to be completed	11 years ago
reger	066a1ecf0a	add highlight queryparams to solrservlet if missing - modify query params in Solr parameter map (instead of querystring)	11 years ago
Michael Peter Christen	899e7e92b0	added debug code	11 years ago
reger	4684330505	Merge origin/master into jetty Conflicts: source/net/yacy/cora/federate/solr/responsewriter/HTMLResponseWriter.java	11 years ago
reger	1437c45383	merge rc1/master	11 years ago
Michael Peter Christen	87a956e881	calculating and showing the number of files and the average size of a file in the HTCACHE in ConfigHTCache_p.html	11 years ago
Michael Peter Christen	acc1f8a749	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	81d9e23532	fixed another memory leak in the PDF parser: the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space which cannot be cleaned if PDFont.clearResources is called. The attempt to clean the class cache therefore causes that the class is loaded and this cache is initialized with some rubbish. I tried to prevent to instantiate this class by usage of a hacked findLoadedClass call to the SystemClassLoader (which is protected ...). Now, without using the PDF parser at all, 8MB of RAM space is not occupied, however, when the first PDF arrives this space will be taked and never given back to GC. WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!	11 years ago
Michael Peter Christen	c152d996e6	reduced footprint of BookmarksDB which can take quite a lot of memory if the number of bookmarks is high (i.e. > 2000 URLs)	11 years ago
Michael Peter Christen	81bb50118e	found and fixed a huge memory leak in solr caching (inside Solr). The not-flushed Solr cache is now handled in this way: - it is smaller by default - an Solr-internal process is started to flush the cache periodically (this does NOT clean the cache, just removes old objects) - a Solr-external process (the standard YaCy cleanup-process) now has direct access to the solr internal cache and flushes them completely. The time frame for such a flush is defined by the cleanup-process frequency, by default 10 minutes.	11 years ago
reger	7b17cdf6dd	add content_type:image/* to image search - see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result - try it yourself with following sample query /solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type adresses also possible url without or deviating extension.	11 years ago
reger	082c9a98c1	move writeHeaders from Jetty8 servlet to YaCyDefaultServlet - after removing Jetty server dependency (of Response using HttpServletResponse only)	11 years ago
sixcooler	987f410011	URL-export:add query and fix for cast-class-exception	11 years ago
Michael Peter Christen	a8253ca49c	added missing unicode transformation in href link contents during parsing	11 years ago
Michael Peter Christen	0cf9e9580b	added clickdepth and CR computation debug code to verify that the process is complete	11 years ago
reger	b85f702f22	add AccessTracker logging to SolrServlet	11 years ago
reger	de1f02420b	implement HtmlResponseWriter to solrServlet (and rss / opensearch responswriter) as in yacy select servlet. - set contenttype of HTLM/GrepHTML-Reponsewriter to "text/html" - set a contenttype to GSAsearchServlet	11 years ago
Michael Peter Christen	234a974955	load image only if their parser flag is activated	11 years ago
Michael Peter Christen	b2c329929f	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	60187a4ec2	fix in html parser	11 years ago
Michael Peter Christen	e1c1e57877	less overhead calling exist() with only one hash	11 years ago
reger	3d5d366f1c	fix html header in Solr HTMLResponseWriter - move 1st body content after </head> tag - add closing <span> tag	11 years ago
reger	bfdb404867	implement a Jetty reconnect to work with Configbasic_p.html port change - instead of shutting down the server it should be sufficient to manipulate the Jetty http connector	11 years ago
Michael Peter Christen	5a02d650ee	avoid cloning	11 years ago
reger	d6760df3e5	fix servlet class exist check to use default path only (in Jetty8YaCyDefaultServlet) - del redundant doget code in yacydefaultservlet - small declaration code opts - del obsolete libt/proxyservlet.java	11 years ago
reger	b38de92a16	Merge origin/master into jetty	11 years ago
Michael Peter Christen	cc39667399	Speed enhancements and less CPU usage during Solr searches when using the embedded Solr (the default). This was obtained by cirumventing solrj search encapsulation and the implementation of direct index access methods to Solr. The effect will not only be seen during search, but this has also a strong effect on suggestions (much more) and less CPU power usage during index distribution (which needs many search requests)	11 years ago
Michael Peter Christen	434e13b46d	in host browser also show the properties of failed documents including referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)	11 years ago
reger	6944225037	- add GSA search /gsa/search servlet for Jetty to Server init - include SecurityHandler check for /gsa/ /solr/ - change one more YaCyDefaultServlet dependency from Jetty to std. javax.Servlet	11 years ago
reger	53cb30a221	reduce logging (by assigning logger to existing logger) - small additional cleanups	11 years ago
reger	332c6d4fe1	reactivate Domain handler for .yacy / .yacyh handling	11 years ago
reger	b1ce70434e	resolve merge conflict - add missing import statement	11 years ago
reger	7869a4c070	Merge origin/master into jetty - merge conflict resolve	11 years ago
reger	f017066197	Merge origin/master into jetty	11 years ago
reger	06da6f517c	add YaCyProxyServlet to handle /proxy.html?url=proxyurl - based on Jetty ProxyServlet - at this time use existing HTTPD ProxyHandler for url rewrite - add jetty-client jar (dependency in Jetty ProxyServlet) reuse ProxyHandler.convertHeaderFromJetty in YaCyDefaultServlet	11 years ago
reger	69599566f9	catch one more malformed url in proxy url rewrite	11 years ago
reger	605530fec5	catch proxy url rewrite exception malformed url (" http:\/\/" ) may cause error response testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test	11 years ago
Michael Peter Christen	9bb7eab389	hacks to prevent storage of data longer than necessary during search and some speed enhancements. This should reduce the memory usage during heavy-load search a bit.	11 years ago
orbiter	3c3cb78555	- removed a lot of garbage and bloated code from GuiHandler. - transformed log lines to String before they are stored because the storage space is about 1:250 (45kb for one line before transformation, 180 bytes afterwards) - this saves up to 10MB RAM so we can increase the number of lines to 1000 again.	11 years ago
Michael Peter Christen	5afa6e3aee	Automatically flush the log cache if a short memory status is reached. For the default of 200 lines this can flush about 10MB.	11 years ago
Michael Peter Christen	030d0776ff	Enhanced crawl start for very, very large crawl lists (i.e. > 5000) which had a problem because of badly used concurrency. This fix also caused a redesign of the whole host deletion process. This should fix bug http://bugs.yacy.net/view.php?id=250	11 years ago
Michael Peter Christen	6aabc4e5c8	reduced logging line memory, 10000 lines had filled up 450MB! grrr. (thank you, a bomb from the past)	11 years ago
Michael Peter Christen	1a8783147b	enhanced computation of number of solr documents.	11 years ago
Michael Peter Christen	4948c39e48	added concurrency for mass crawl check	11 years ago
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	11 years ago
Michael Peter Christen	82621bead0	When doing bootstraping, always accept one seedlist-File without checking the date of the file. This should help to start the peer in case that the user has a completely wrong date setting.	11 years ago
Michael Peter Christen	691d7e70fa	added hint to development/commit rss feed	11 years ago
orbiter	20bbde8665	fix for mustmatch regex computation: result had correct semantic, but may have contained multiple same expressions within the disjunction of domain-restrictions. This fix removes the redundant restrictions and makes the regex shorter.	11 years ago
reger	cb2dbcb843	add graceful Jetty shutdown option - as Jetty stop is not synced, yet - include jetty jars and servlet-3.0 api jar in Eclipse .classpath	11 years ago
reger	f46c723398	allow to choose used http server, YaCy-Anomic or Jetty - defaults to Jetty (in this branch) - add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking	11 years ago
reger	da4ff5aefa	add YaCy HttpCommand "authenticate" check to DefaultServlet	11 years ago
Michael Peter Christen	c833d02cf5	fixed webgraph postprocessing (did nothing and repeated to do this...)	11 years ago
Michael Peter Christen	74d0256e93	enhanced postprocessing: fixed bugs, enable proper postprocessing also without the harvestingkey, remove crawl profiles after postprocessing, speed-up for clickdepth computation.	11 years ago
reger	1adb4b8741	merge rc1/master	11 years ago
reger	77a73c7475	add YaCy HttpCommand "location" check to DefaultServlet	11 years ago
Michael Peter Christen	7b69c438f7	more methods for the table class	11 years ago
Michael Peter Christen	820b896146	Replaced the inframe loading from yacy.net for donations with the loading of this iframe from the local host. To make this more flexible, this iframe is loaded once after startup from yacy.net.	11 years ago
reger	cc223b14a4	remove wrong content mod in SSI parser for virtual path /currentyacypeer/ (is handled on start of request handling)	11 years ago
reger	5606291574	fix last commit (not needed test of GZipInputStream)	11 years ago
reger	f9eed8cb44	add support for gzip encoded multipart forms (needed for transferRWI.html) - quick and dirty reuse of existing HTTPDemon implementation	11 years ago
reger	cf32a92629	- add size check to multipart form data handling of YaCyDefaultServlet (same as in HTTPDemon.parseMultipart) - reduce Jetty logging - give build.run a bit more memory (set to YaCy.default 600m from 512m)	11 years ago
reger	705f147820	- add localpeername.yacy to list of local address detection for AbstractRemoteHandler - use proxy via header info as in legacy proxy handler	11 years ago
reger	0d4efabaa8	fix YaCy version string in proxy headers (config parameter vString not longer used)	11 years ago
reger	2226189743	disable domainhandler due to error - domainhandler causes closed response output stream in following handlers on addresses resolved to local peer (like in hello protocoll preventing peer to switch to senior peer)	11 years ago
reger	eea504c117	update Info.plist small DefaultServlet refactoring	11 years ago
reger	a44eede8b8	merge rc1/master	11 years ago
sixcooler	d9a02ed277	NPE fix for my last commit	11 years ago
reger	54a0272338	searchpage javascript (latestinfo) causes reset of search statistic after moving to next page - disabled call via setTimeout in yacysearch.html	11 years ago
sixcooler	61f627eb85	fix for ssl-connections from proxy-usage staying in close-wait-state + some extra 'close' in HttpClient	11 years ago
Michael Peter Christen	d328cc4a83	fix for didyoumean, added also more asian alphabets	11 years ago
Michael Peter Christen	90c8577840	enhanced ranking; patches to replace old ranking	11 years ago
reger	e74f548551	make legacy http server (serverCore) implement YaCyHttpServer interface	11 years ago
reger	71d2655c02	downgrade to Jetty 8 to assure support of JRE 1.6 - introduce a YaCyHttp interface to modulize/separate http server - adjust the Jetty version specific implementation part (in package net.yacy.http) - putting the version specific code in classes starting with Jetty8xxxx - moved existing Jetty9xxx implementation into a test class (to keep the code) - adjust build to the changed jars - make use of the introduced YaCyHttpServer interface in related htroot servlets - adjust other test cases/classes	11 years ago
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	11 years ago
orbiter	6efa7532d2	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
orbiter	5f5a97bafc	added the anchor text within web pages to the searcheable entities of a web page. This can be of benefit for the ranking if these fields are used for boosts.	11 years ago
orbiter	705b3338ee	list more fields available for search and for ranking boosts	11 years ago
sixcooler	d536092fe4	fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout for eg. caused by massive requests when crawl from file	11 years ago
Michael Peter Christen	78e7aadb26	removed unused initialization method	11 years ago
Michael Peter Christen	4fbc4740df	removed warnings	11 years ago
Michael Peter Christen	21aa6a0321	migration to Solr 4.5.0	11 years ago
Michael Peter Christen	ef31d0f279	fix for rss reader, see http://bugs.yacy.net/view.php?id=294	11 years ago
Michael Peter Christen	101a6e6e14	Patch the citation index for links with canonical tags. This shall fulfill the following requirement: If a document A links to B and B contains a 'canonical C', then the citation rank computation shall consider that A links to C and B does not link to C. To do so, we first must collect all canonical links, find all references to them, get the anchor list of the documents and patch the citation reference of these links.	11 years ago
reger	daebeb93aa	add call to AccessTracker to jetty security handler	11 years ago
reger	172aefaeeb	adjust YaCySecurityHandler to Jetty 9 conventions - mainly adjust prepareConstraintInfo to use the RoleInfo.setChecked as in Jetty Source distribution - use constraint check behavior as in ConstraintSecurityHandler see http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/tree/jetty-security/src/main/java/org/eclipse/jetty/security/ConstraintSecurityHandler.java?id=jetty-9.0.5.v20130813	11 years ago
reger	6f9ed439d3	- expand localHostName check of AbstractRemoteHandler to pevent request is handled as proxy request - make domain handler not relay on included path in resolved .yacy address	11 years ago
reger	561ea135af	fix : forgot adding security handler	11 years ago
reger	c7c706fd9f	merge with rc1/master	11 years ago
reger	272b196d05	update Jetty server init() to activate yacy-domain and transparent proxy handler - adding domain & proxy handler to a context (as it was in inital design) (context required for dispatcher) - make handler context and servlet context parallel available (to allow use of YaCyDefaultServlet to handle legacyServlets) - set transparent proxy request handled after dispatch.forward to skip further handling for .yacy domain requests	11 years ago
reger	fd119deb00	fix NPE on modified since check ( Response.requestHeader allowed to be null)	11 years ago
reger	66145a0410	- add welcome file (index.html) support to YaCyDefaultServlet - change SolrServlet default search field (&df) to text_t	11 years ago
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	11 years ago
Michael Peter Christen	a52f3a597e	fix for canonical-from-http-header feature	11 years ago
Michael Peter Christen	2dd7c5be44	added parsing of http-canonical tags (untested, could not find an example page)	11 years ago
Michael Peter Christen	4476dea5ba	do not fail if a wrong boost key is used; instead, print only a warning See also: http://bugs.yacy.net/view.php?id=293	11 years ago
reger	ab9583d429	add default field (&df) to SolrServlet query if missing	11 years ago
Michael Peter Christen	3bf0104199	fix for crawl domain counter limitation (limit was reached too early)	11 years ago
Michael Peter Christen	82bfd9e00a	- crawl profiles shall be deleted from active and passive stacks if they are deleted to terminate the crawl because otherwise the crawl will go on after the load-from-passive stack policy. - better check if a crawl is terminated using the loader queue.	11 years ago
Michael Peter Christen	1b3d26dd23	hack to remove most of the warning: deprecated messages (but not all, one is left)	11 years ago
Michael Peter Christen	a496313248	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
sixcooler	3c48fc65fd	reverted RemoteInstance to deprecated methods of httpClient-4.2 this should work with current remote-Solr-Instances	11 years ago
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	11 years ago
Michael Peter Christen	095053a9b4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
sixcooler	0cae420d8e	some dns-timing changes: since httpclient uses the domain-cache it is useful not to clean the domain cache until crawling is running (domains are filled into this cache) On huge crawl-starts (eg. from file) my DNS did not follow the high rates - so I reduced the rate and give some more time(-out)	11 years ago
sixcooler	15b1bb2513	bump to httpClient-4.3	11 years ago
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	11 years ago
orbiter	14442efa6d	when profiles are cleaned, there shall be first a callback showing which profiles are cleaned. This shall enable a profile-termination-driven postprocessing. To do this, index writings must carry the profile key which will be implemented in another (next) step.	11 years ago
orbiter	0013d0d0bb	removed superfluous class	11 years ago
orbiter	f90d5296cb	Added new data structure to be used by the balancer (not used yet). These data structures will enable the balancer to store the crawl queue into individual queues, one each for a single host.	11 years ago
orbiter	0e8d752462	refactoring	11 years ago
orbiter	8ac2e8c8c9	added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values.	11 years ago
orbiter	d86d2be5c3	automatically removed Places autotagging if no location library is wanted	11 years ago
orbiter	214a087cdf	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	96ed0c980e	- added hosthash to all documents (also fail documents which is needed there for deletion), this fixes a problem for the deletion of old documents for new crawl starts - added clickdepth and citation computation for fail documents	11 years ago
Michael Peter Christen	179ad281f9	close include byte buffer after usage	11 years ago
reger	52dd491c04	fix not necessary use of DigestURL	11 years ago
reger	6b9a624808	remove double declaration of TLD_any_zone_filter	11 years ago
reger	5111841e5b	- reduce Jetty debug logging - fix Context path initialization	11 years ago
reger	bc6ebb3c06	adjust to DigestURI changes from master to DigestURL	11 years ago
reger	561cbc7ee2	use more YaCy HeaderFramework constants (instead of Jetty's)	11 years ago
reger	5c4ba9b5db	merge rc1 master	11 years ago
reger	70c51775ae	Merge remote-tracking branch 'origin/master' into jetty	11 years ago
reger	4b77733e59	implement a YaCyDefaultServlet to handle YaCy-servlets within Jetty server - the implementation is inspired by Jetty's DefaultServlet - handles static html content and YaCy servlets - translates between standard servlet request/response and YaCy request/response specification With the implementation of YaCy-servlets as servlet instead via a jetty handler it's closer to servlet standard and carries less jetty specific dependencies.	11 years ago
orbiter	d2effd21db	fix for npe during location search	11 years ago
orbiter	828603e4f1	fix for 100%CPU problem in error cache cleaning process	11 years ago
orbiter	c64b51134e	hack to add all tokens from the url to text_t. This was working for the RWI index (and still is working) but not for solr-only search indexes. Maybe we should find a solution using a separate search field instead.	11 years ago
orbiter	6e8377b8ad	do not check all words with synonym library if the library is empty	11 years ago
orbiter	70ba74b23a	disabled ipv4 preference to enable ipv6-only networks like freifunk	11 years ago
orbiter	f3be1930cb	CPU problem when pusing to the error cache; wrong class, ConcurrentHashMap needed for concurrency	11 years ago
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	11 years ago
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	11 years ago
Michael Peter Christen	31920385f7	set anchor rel attribute of all links to "nofollow" if the html meta contains a robots:nofollow or if the http header contains a "X-Robots-Tag: nofollow"	11 years ago
reger	9619b8743c	add Solr Servlet	11 years ago
Michael Peter Christen	57e00baf26	fix for parsing of image links inside of anchor links (image-links)	11 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	11 years ago
Michael Peter Christen	3ea9bb4427	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	11 years ago
reger	13fc86c960	Merge remote-tracking branch 'origin/master' into jetty	11 years ago
reger	f7f86d8a5d	update to Jetty 9 jars - include javax.servlet 3.0	11 years ago
reger	603368fc3e	remove redundant declaration of USER_AGENT	11 years ago
reger	bd71b14d25	add mandatory p2p parameter to templatePattern	11 years ago
reger	b8da176c5d	adjust setHandled to request of call parameter	11 years ago
reger	127adbf5cf	remove references to 10_http thread (legacy http server) and add needed get/set function to jetty http server wrapper	11 years ago
Michael Peter Christen	1a8c64117f	decreased the responseHeaderDB database which is now flushed more frequently. This will preserve more documents in the cache in case of a crash.	11 years ago
reger	36b7159282	- remove double initialization of jetty - refactor some var assignments	11 years ago
reger	63ed04260a	Merge remote-tracking branch 'origin/master' into jetty	11 years ago
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	11 years ago
reger	2ee68f76f6	added read parameter from multi-part form fields (to nasty quick-fix)	11 years ago
Michael Peter Christen	9cc8468b30	added tools to visualize image generation (i.e. during testing)	11 years ago
reger	105cf8f593	changes to adjust jetty to recent code changes	11 years ago
reger	aafef72a8a	merged current rc1/master into jetty branch to allow further development with latest version ServerSideIncludes and servlet return values need further work (for working jetty integration) - TODO: added nasty quickfix to allow SSI - needs further work - TODO: YaCy servlet return values/parameters are not handled	11 years ago
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	11 years ago
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	11 years ago
Michael Peter Christen	7a5574cd51	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	11 years ago
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	11 years ago
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	11 years ago
Michael Peter Christen	85b1922244	activated image type navigation for image search	11 years ago
Michael Peter Christen	9e12fdff23	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	ab1201fdfd	fixed wrong facet count	11 years ago
Michael Peter Christen	049c3b3f2e	added an option to exclude image search results from text search. This is on by default.	11 years ago
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	11 years ago
Michael Peter Christen	e8e558a9b7	fix for content domain classification in URIMetadataNode	11 years ago
Michael Peter Christen	a8c5bfcf58	avoid to create unnecessary objects	11 years ago
Michael Peter Christen	5a0de1b77d	moving image description text to image text field	11 years ago
Michael Peter Christen	dc179bd61f	fix for catchall query goal for image search	11 years ago
reger	392174de8c	remove all_words, all_strings lists from QueryGoal - only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only	11 years ago
Michael Peter Christen	169ef8963d	one more fix for image search	11 years ago
Michael Peter Christen	cb85b22725	redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update)	11 years ago
reger	29967102a2	optimized QueryGoal (reducing mem and computation by removing all_hashes) - all_hashes used for text highlighting and word distance computation which can be done with include_hashes only	11 years ago
orbiter	f106345eef	link strings should not be tokenized	11 years ago
orbiter	deadeb406e	image alt tag strings should be tokenized	11 years ago
reger	d0e78082d1	return field names in index instead of in schema for SolrServerConnector.getFields	11 years ago
Michael Peter Christen	1a3e42eca4	index migration to lucene 4.4	11 years ago
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	11 years ago
Michael Peter Christen	3c5abedabf	NPE during shutdown fix	11 years ago
Michael Peter Christen	e4cbe9232d	fixed a crawler bug where a double-occurring url was not re-crawled because the double-check error was written to the error-db and never deleted. No the error-db is cleared on every start and these double-messages are not written to the error-db any more.	11 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
Michael Peter Christen	0f3d8890db	removed an assert which causes a shortcut call circuit	11 years ago
Michael Peter Christen	6d5fefe060	added missing files :(	11 years ago
Michael Peter Christen	554c0351dd	fix for http://bugs.yacy.net/view.php?id=286	11 years ago
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	11 years ago
Michael Peter Christen	1c62fa7698	fix for bad snippets in gsa api	11 years ago
Michael Peter Christen	697613170d	less logging for postprocessing (this was a debugging logging with high CPU load)	11 years ago
reger	b4016ff324	- remove possible double initialization of rdfa parser - use ordered list to use preferred parser for mime/extension first (relates to html, rdfa, argument parser) - harmonize xhtml extension config for the 3 html base parsers	11 years ago
reger	f0575bd44b	FieldReIndex: omit active vocabulary fields from reindex detection	11 years ago
reger	a5019bc470	make Vocabulary Navigator tags a hard result entry filter by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query) TODO: as vocabularies are only locally valid, auto-switch to Searchdom.LOCAL could be considered.	11 years ago
reger	a67a4b7d86	improve tld: query modifier filter pattern (to prevent tld:net accepting www.abcinet.org)	11 years ago
reger	02fe8b43ba	Field Re-Indexing: display list of fields in reindex queue change servlet to display statistic on 1st click (instead after refresh)	11 years ago
sixcooler	7f501b7c38	clear some caches before reporting low Memory do not break lines in Network-table-rows	11 years ago
reger	b355dd52c6	Index Administration - Field Re-Indexing: exclude internal Solr _version_ field from obsolete field check	11 years ago
sixcooler	8a96140f92	fix / workaround for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4750 + Seed.hash should be final	11 years ago
Michael Peter Christen	2857499467	fix to collection schema; bug appeared for _txt fields with empty String as content	11 years ago
Michael Peter Christen	dbfa865700	added a stub of a class for crawler redesign	11 years ago
Michael Peter Christen	76afcccaaf	fix for default boolean post values: the default value MUST NOT be TRUE, because it's normal that a boolean value is missing in the post argument if a checkbox is not selected. Added also some style enhancements to IndexFederated, removed the Solr attachment manual and replaced it with a link to the wiki which explains this in more detail.	11 years ago
orbiter	252c525709	fixed feed api servlet and and enhanced RSSReader class	11 years ago
orbiter	d38c3c14d8	fix for CGI test	11 years ago
Michael Peter Christen	31902f54df	fix for NPE which happens within solr code at MultiMapSolrParams.java, line 52 in case that the array arr.length == 0	11 years ago
Michael Peter Christen	f13df9dbb6	migration to solr 4.4.0	11 years ago
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	11 years ago
sixcooler	7d53ac86a3	fix for Blacklist (-Administration)	11 years ago
reger	f2d99053ed	Field Re-Indexing: prevent endless error loop in ReindexSolrBusyThread on Solr exception (by skipping query causing the exception) (occured during testing while working on q=store:[* TO *])	11 years ago
reger	92d3f71b16	htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used), note: stream.close is done by caller (Textparser.parseSource) - removed unnecessary reset in AugmentParser - added stream.mark in tdfatripleimpl. to make stream.reset work here	11 years ago
orbiter	87cfeaa4f3	fix for npe	11 years ago
orbiter	268a36aaff	emergency fix for crawler: this will otherwise cause loss of complete crawl queue if latency of remote system is too low	11 years ago
orbiter	d05e0c5368	wait a bit longer before doing the first peer ping	11 years ago
orbiter	b8f57f7703	don't be noisy when doing background tasks that may be allowed to fail	11 years ago
Roland Haeder	0343f0668c	Fix for NPE: E 2013/07/26 20:29:29 BUSYTHREAD Runtime Error in serverInstantThread.job, thread 'net.yacy.search.Switchboard.cleanupJob': null; target exception: null java.lang.NullPointerException at net.yacy.search.schema.CollectionConfiguration.convergenceStep(CollectionConfiguration.java:1116) at net.yacy.search.schema.CollectionConfiguration.postprocessing(CollectionConfiguration.java:897) at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165) Conflicts: source/net/yacy/search/schema/CollectionConfiguration.java	11 years ago
Roland Haeder	b58ca8622d	Some cleanups: - added SKINS_PATH_DEFAULT as same as LISTS_PATH_DEFAULT was added - Added 'final' keyword to a string	11 years ago
Roland Haeder	7263bb82fb	Fix for NPE on shutdown: java.lang.NullPointerException at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732) at net.yacy.search.Switchboard.access00(Switchboard.java:207) at net.yacy.search.Switchboard.run(Switchboard.java:3049)	11 years ago
Roland Haeder	13433d41a1	Log this exception better Conflicts: source/net/yacy/kelondro/blob/Tables.java	11 years ago
orbiter	080d80c9de	do not write an empty failreason in case that there is no fail. Because of the lazy instantiation rule this value was not actually written, but if lazy instantiation is switched on, then this causes that all crawl starts delete all crawl-start-hosts completely because this looks for filled error reasons.	11 years ago
Michael Peter Christen	4c242f9af9	always use a default value for boolean options to have transparency for the outcome if the attribute is missing in servlets	11 years ago
Michael Peter Christen	61e015268b	fix in forced deletion: forced commit needed	11 years ago
Michael Peter Christen	83e2921b39	new test case for http://bugs.yacy.net/view.php?id=141	11 years ago
Michael Peter Christen	304aacb2cc	fix for http://bugs.yacy.net/view.php?id=267	11 years ago
Michael Peter Christen	c3b2301b2f	fix for http://bugs.yacy.net/view.php?id=268	11 years ago
reger	aa1a1f1d2c	- small adjustment to make sure genericParser is tried last -- for some documents genericParser grabs document instead of specific available parser due to unordered pick of 1st to try parser (like .ps .rdf files and other) - remove redundant file extension registration	11 years ago
orbiter	3e901dcb06	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
orbiter	f50b596e0b	do not run dht ditribution if system load is over 2.5	11 years ago
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	11 years ago
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	11 years ago
sixcooler	af740f3058	changed optimization to a segment-size of index-size/5.000.000 + one if not idle + one (and force) if postprocessing	11 years ago
Michael Peter Christen	336f86394c	replaced StringBuffer with StringBuilder	11 years ago
Michael Peter Christen	aeac2fb763	replaced more containsKey() -> get() usages by a simple get(), followed by a test for NULL. This should increase the application speed and reduces the lookup time for the affected methods by 50%	11 years ago
orbiter	5364c4dcc9	delayed first peer-ping to send the first ping out after the http got up; if the ping comes before the http is up, it cannot be recognized as senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266	11 years ago
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	11 years ago
orbiter	c124037f19	removed forced non-soft commits to prevent index fragmentation	11 years ago
Michael Peter Christen	31483c47e1	fixed problem with remote luke requests	11 years ago

... 3 4 5 6 7 ...

2439 Commits (8d60d4d56e1b0e1def80ca826582f237bf308dfc)