yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	710054bb37	implement gzip input handling directly in defaultservlet (making reference to legacy httpdemon obsolete)	11 years ago
Michael Peter Christen	b4b0d14c04	fix for display bug	11 years ago
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
Michael Peter Christen	075b6f9278	refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer.	11 years ago
Michael Peter Christen	8470dfe3f8	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	46016fa153	autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) !!! left existing update blacklist setting untouched !!! (existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html) - moved old blacklist patch to migration.java	11 years ago
Michael Peter Christen	8aeef73d49	fix for virtual root nodes	11 years ago
Michael Peter Christen	7c7fbb9818	find depth-matches also for edge targets	11 years ago
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	11 years ago
Michael Peter Christen	6ea8bb7348	using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size	11 years ago
Michael Peter Christen	b21c208b4d	enhanced hashcode computation for MultiProtocolURL	11 years ago
Michael Peter Christen	ce1d1b2fa0	fix for maximum tag length in parser	11 years ago
Michael Peter Christen	17e0956312	refactoring of SystemLoad calls (only one backend tool)	11 years ago
Michael Peter Christen	a37d067692	refactoring	11 years ago
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	11 years ago
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	11 years ago
Michael Peter Christen	9e503b3376	also delete the robots.txt file from the cache when a new crawl is started	11 years ago
orbiter	67501c9dda	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	11 years ago
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	11 years ago
Michael Peter Christen	8068e68474	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	11 years ago
reger	f326a67561	fix: typo in default charset in metadata2solr update pom and NB build to Solr 4.7.1 libs	11 years ago
Michael Peter Christen	df138084c0	do solr optimization independently from memory and load constraints: - not doing an optimization will likely cause a too many files exception - without optimization performance will be even worse which would prevent optimization in the future as well (prevent a deadlock situation)	11 years ago
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	11 years ago
Michael Peter Christen	0f3fbae438	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	1a6e0354db	update commons-compress.jar to 1.8	11 years ago
Michael Peter Christen	68417a05c5	different algorithm to test checkalive as it depends less on the existence of wget (or curl) on the OS.	11 years ago
Michael Peter Christen	6b0e62ec59	Emergency bugfix for killYACY.sh as the file yacy00.log does not exist in case that a too many open files error exist. In such a case, the file yacy00.log does not exist but only the file yacy00.log.lck. In the long term a different solution should be addressed.	11 years ago
Michael Peter Christen	ee92d748b5	test using compound file format, see UseCompoundFile in https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig This appears to be necessary as many times a java.io.FileNotFoundException: (Too many open files) appears. See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate users at http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr We cannot force users to do a "ulimit -n 1000000", so this action seems to be required.	11 years ago
Michael Peter Christen	d2055f3d4b	next development version 1.71 It's nowhere explained or declared, but since some time we follow the schema that uneven version numbers are used for development versions and even numbers for release versions. That concept may change sometime but this is used at this time to distinguish development from main.	11 years ago
reger	d1b5180dd9	upd version in pom	11 years ago
Michael Peter Christen	d051d2d85f	release 1.7	11 years ago
Michael Peter Christen	0a95fd27f3	update of seed list	11 years ago
Michael Peter Christen	6e84770fd9	Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 Conflicts: locales/ru.lng	11 years ago
malykhin.dmitry	f509cd4aab	Update russian translation	11 years ago
Michael Peter Christen	f296a529d5	update to german locale	11 years ago
Michael Peter Christen	734778c0c8	fixed a time-out problem in the default servlet which is also a logging problem because the error log showed the wrong reason (file not found) instead the actual reason (time-out).	11 years ago
Michael Peter Christen	466d90ad42	fixed a problem with resource observer; probably coming from uncatched exceptions within the apache library which appear only in concurrency environments.	11 years ago
Michael Peter Christen	c8d4a63604	eliminating the word 'Facet' from the interface because it is ugly. If people do not know what search navigation is, then they also do not know what a 'facet' is.	11 years ago
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	11 years ago
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	11 years ago
Michael Peter Christen	8443255e18	better link structure limit calibration	11 years ago
Michael Peter Christen	7f5733638b	fix for linkstructure computation: now also detecting dead links	11 years ago
Michael Peter Christen	3ce8eff21b	another fix for inbound/outbound detection	11 years ago
Michael Peter Christen	d4b5c457e4	NPE fix	11 years ago
Michael Peter Christen	36a66b0704	fix for parsing of numeric value in case that boolean values are given	11 years ago
orbiter	41730c8048	better logging in template engine: shows filename of servlets where errors in templates occur	11 years ago

... 9 10 11 12 13 ...

11227 Commits (3562b5e3a42c7c440ff791637863764842db1f23) All Branches Search

11227 Commits (3562b5e3a42c7c440ff791637863764842db1f23)

All Branches