Commit Graph

98 Commits (5b3acc12cd4b4343c4e7d7f0a20a1da8ea8d5f6a)

Author SHA1 Message Date
Roland 'Quix0r' Haeder a093ccf5eb Now used synchronization in all close() methods to make sure all objects
13 years ago
Michael Peter Christen f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
13 years ago
Michael Christen 22f05c83ff fixed default must-match filter for full domain crawls - the old filter
13 years ago
Michael Peter Christen 0cc0290978 bugfix for a must-not-match pattern check. This bug did not make the
13 years ago
Michael Peter Christen c6c61be3f0 fix for http://bugs.yacy.net/view.php?id=148
13 years ago
Michael Peter Christen 9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
13 years ago
orbiter 8895d8c1cd removed unnecessary log entries
13 years ago
orbiter 3a807e10cf - added a cache for active crawl profiles to the crawl switchboard
13 years ago
orbiter a7df70221e refactoring
14 years ago
orbiter b250e6466d implemented crawl restrictions for IP pattern and country lists
14 years ago
orbiter d2ea250d99 refactoring:
14 years ago
orbiter 2cba860693 - fix for wrong entries in NOLOAD indexing queue (that caused that urls had been only indexed based on their url and not loaded)
14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
14 years ago
orbiter 10e2f588f8 - enhanced ybr ranking computation
14 years ago
orbiter bd55dcee50 - commented out experimental distributed ranking loading
14 years ago
orbiter 5b579e21a3 code cleanup
14 years ago
orbiter 6fa439c82b - refactoring of robots
14 years ago
f1ori 0b02083e97 * function for simple crawl of one url
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
14 years ago
orbiter 4c013d9088 more UTF8 getBytes() performance hacks
14 years ago
orbiter 7962d35425 - removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons:
14 years ago
orbiter cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
orbiter 9b25a33fd9 - fixed numerous bugs
14 years ago
orbiter 56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
orbiter c36da90261 added a very fast ftp file list generator to site crawler:
14 years ago
f1ori def4253555 * add option to network definition to provide a domainlist (syntax like in blacklists)
14 years ago
orbiter f6eebb6f99 replaced auto-dom filter with easy-to-understand Site Link-List crawler option
15 years ago
orbiter 65eaf30f77 redesign of crawl profiles data structure. target will be:
15 years ago
orbiter 3197ca42ed preparations to move the HTCache into cora:
15 years ago
orbiter 0d81731e88 fixed crawler bug caused by NPE in logging
15 years ago
orbiter a82a93f2fc - better url double check in crawler
15 years ago
orbiter 171f2bd84e - removed unused network oanet
15 years ago
sixcooler d88b9606d1 fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2923
15 years ago
sixcooler 15e8c13526 ... migrating to HttpComponents-Client-4.x ...
15 years ago
orbiter 22dbbcfa56 better (and corrected) recognition of intranet and internet-addresses. This corrects the isLocal property that is used by network definitions to restrict index ranges to local and global addresses. Address locations (intranet or internet) had been partly identified by the top level domain of the host address. Since intranet addresses can also be addressed using a host name that is in a country domain it is necessary to do a dns resolving for each check. The check is supported by a local dns cache so the intranet/internet check should not affect network traffic too much. To ensure that the cache works properly the cache class was upgraded to better concurrency data structures.
15 years ago
orbiter dcd01698b4 added a 'transition feature' that shall lower the barrier to move from g**gle to yacy (yes!):
15 years ago
orbiter 11639aef35 - added new protocol loader for 'file'-type URLs
15 years ago
orbiter c45117f81f fixed dates in metadata
15 years ago
orbiter 1a8a134e0c continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 and continued in SVN 6790
15 years ago
orbiter 25aef069a6 continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775
15 years ago
orbiter b88f5fbb4b slightly changed crawling policy
15 years ago
orbiter c09a995930 better logging of double occurrences of urls in the crawler
15 years ago
orbiter 473b11033d fixed network switch process - crawling did not work after a switch before this fix
15 years ago
orbiter dd459281c8 applied code changes that are recommended by PMD
15 years ago
orbiter 4a5100789f replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration.
15 years ago
orbiter 4431b9767e added about 450 replacements for printStackTrace() methods to pipe such traces into the log at DATA/LOG/
15 years ago
orbiter a0e891c63d - some redesign in UI menu structure to make room for new 'Content Integration' main menu containing import servlets for Wikimedia Dumps, phpbb3 forum imports and OAI-PMH imports
15 years ago