yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	b2fe4b7b1a	added a handling of appearances of yacy bot entries in robots.txt if this entry addresses the yacy peer (directly or indirectly) and it grants a crawl-delay of 0. Then all forced pause mechanisms in YaCy are switched off and the domain is crawled at full speed. crawl delay values can be assigned to either - all yacy peers using the user-agent yacybot - a specific peer with peer name <peer-name>.yacy or - a specific peer with peer hash <peer-hash>.yacyh git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7639 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	7962d35425	- removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons: 1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file 2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible - fixed deletion of crawl start point urls in crawl stack and balancer double-check - fixed a problem with steering self-call (no resolving of localhost) - added more logging for the crawler to supervise why crawl urls are not taken by the loader - added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url - fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	c5051c4020	) fixed bug which caused entries to not be deleted when deleting by URL on IndexCreateWWWLocalQueue_p.html (I hope this did not break anything else) ) cleaned up code a little bit git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7493 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4588b5a291	- fixed document number limitation for crawls that restrict the number of documents per domain - some restructuring of the document counting and logging structures was necessary - better abstraction of CrawlProfiles - added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation - more refactoring to get the LibraryProvider more clean - some refactoring of the Condenser class git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	a563b05b60	enhanced crawler: - added a new queue 'noload' which can be filled with urls where it is already known that the content cannot be loaded. This may be because there is no parser available or the file is too big - the noload queue is emptied with the parser process which indexes the file names only - the 'start from file' functionality now also reads from ftp crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7368 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	fffb91447a	fixed crawl queue delete function git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7357 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	65eaf30f77	redesign of crawl profiles data structure. target will be: - permanent storage of auto-dom statistics in profile - storage of profiles in WorkTable data structure not finished yet. No functional change yet. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7088 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	a82a93f2fc	- better url double check in crawler - more logging for error urls git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7032 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	1a8a134e0c	continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 and continued in SVN 6790 The result should be a less usage of new String() and less memory usage (since a String-encapsulated byte[] has 40 bytes overhead) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6815 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	25aef069a6	continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	1e8e79b9ef	redesign of reference hash (URL-hash) parameter hand-over: pass value as byte[], not as String. This should cause that less byte[] <-> String conversions are made during time-critical tasks. This redesign is not yet complete, more to come .. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	362b7a929b	added extensive memory protection logic to avoid out of memory errors that may be caused by the RowCollection memory allocation function git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6521 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	4a5100789f	replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6510 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	f677d534b1	start of a really extensive refactoring which will produce a hierarchical package structure with the domain yacy.net as package root - moved here the logging classes as part of the new net.yacy.kelondro package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6391 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	573d03c7d7	added configuration to enable ram table copy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6304 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	ca72ed7526	-removed superfluous crawl cache -refactoring of crawler classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6221 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	13c63f4082	a set of small fixes to crawling behaviour git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6216 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	ce1adf9955	serialized all logging using concurrency: high-performance search query situations as seen in yacy-metager integration showed deadlock situation caused by synchronization effects inside of sun.java code. It appears that the logger is not completely safe against deadlock situations in concurrent calls of the logger. One possible solution would be a outside-synchronization with 'synchronized' statements, but that would further apply blocking on all high-efficient methods that call the logger. It is much better to do a non-blocking hand-over of logging lines and work off log entries with a concurrent log writer. This also disconnects IO operations from logging, which can also cause IO operation when a log is written to a file. This commit not only moves the logger from kelondro to yacy.logging, it also inserts the concurrency methods to realize non-blocking logging. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6078 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	95e8cbd1c3	new fully redesigned balancer and bugfixes regarding lost profile handles and killed crawls git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6025 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	138422990a	- removed useCell option: the indexCell data structure is now the default index structure; old collection data is still migrated - added some debugging output to balancer to find a bug - removed unused classes for index collection handling - changed some default values for the process handling: more memory needed to prevent OOM git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5856 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
lotus	ab0030d7a7	allow dht-out for remote-crawl processing peers on default settings git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5834 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	37f892b988	added new concurrent merger class for IndexCell RWI data git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5735 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	67aaffc0a2	- added Latency control to the crawler: because of the strongly enhanced indexing speed when using the new IndexCell RWI data structures (> 2000PPM on my notebook), it is now necessary to control the crawling speed depending on the response time of the target server (which is also YaCy in case of some intranet indexing use cases). The latency factor in crawl delay times is derived from the time that a target hosts takes to answer on http requests. For internet domains, the crawl delay is a minimum of twice the response time, in intranet cases the delay time is now a halve of the response time. - added API to monitor the latency times of the crawler: a new api at /api/latency_p.xml returns the current response times of domains, the time when the domain was accessed by the crawler the last time and many more attributes. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5733 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	024da2916b	refactoring of logging git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5544 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	d39d420b39	performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5376 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	2d65887723	- fix for bug in new profile handling - added a new feature in ymageChart (cannot be seen yet, just wait... will be used in profiling chart) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5261 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	ff68f394dd	fix for problem with balancer and lost crawl profiles: if crawl profile ist lost, no robots.txt is loaded any more git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5258 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	1bbf362cef	update to the crawl balancer: better organization and better crawl delay prediction git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5176 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
danielr	621b473b18	* removed some warnings of findbugs (http://findbugs.sf.net ) - removed unnecessary code (unused variables, String.toString) - corrected some calculations (cast int to double or long ;) - improved little performance (using Integer.valueOf() instead of new Integer) - log if some File-actions fail (mkdir(), delete(), ...) and some ignored exceptions - finalized some (more) fields - finally close some streams - made inner classes static if not using environment - generalized some equals (from specificClass to Object) - fixed some potential nullpointer accesses git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5039 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
danielr	3bb870bfcd	added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	c3d461d191	- removed superfluous copyright statement - updated my email address git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5011 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	3ca98fee42	removed superfluous copyright statement git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5010 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	474659a71f	- modified and enhanced the crawl balancer: better list export, fixing of damaged crawl queue at start-up, re-sorting at start-up to enhance domain order - added option to set minimum crawl delta for domains in balancer - added default values to crawl deltas in yacy.init - added configuration for these deltas in performance queues - enhanced performance setting computation (more time for indexing queue for a faster flush - remote crawling is now enabled during local crawling if indexer has space and time for more links - added database stub for new distributed file system - refactoring of time computation to get an abstraction level that will be used by a TTL rule in new distributed file system git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4966 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
danielr	7feae906aa	- organize imports - removed potential null pointer accesses - removed unnecessary casts git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4893 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	2f381b8d7a	- fixed at least two causes for a NPE after a use case switch. A large refactoring was neccessary - added another crawl start option: automatic restriction to sub-path - removed crawlStartSimple and renamed crawl start expert to crawl start (without expert) - some changes to texts in crawl start - added some more deletions when an web index is deleted: delete also queues and robots cache git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4881 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	dd75b3cabc	- patch for bad profiles - time-out when deleting profiles git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4793 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	1689030ee8	refactoring: moved all crawler classes into their own package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4768 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago

37 Commits (78d6d6ca0640149bb10645ac142f05dd3bb90794)