Commit Graph

36 Commits (d223cf0ae4b5853841f141fec98590497798b1d5)

Author SHA1 Message Date
reger 0c97cc2440 skip unused call parameter for hashSentence()
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Marc Nause 809b4e1fd9 Team added support for URLs with unicode characters in host part to
11 years ago
Michael Peter Christen fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
11 years ago
Michael Peter Christen 2602be8d1e - removed ZURL data structure; removed also the ZURL data file
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Michael Peter Christen 47b1c81d08 - refactoring
11 years ago
Michael Peter Christen 89c0aa0e74 added collection_sxt to error documents
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen d6b82840f8 added a feature to find similarities in documents.
12 years ago
Michael Peter Christen 21fe8339b4 - enhanced generation of url objects
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of & parts inside of the
12 years ago
Michael Peter Christen a06930662c replaced some more .getBytes() with UTF8/ASCII.getBytes()
12 years ago
Michael Peter Christen 1533bfd63b refactoring
12 years ago
Michael Peter Christen 8219a445f3 refactoring
12 years ago
Michael Peter Christen 00c1c777fa refactoring
12 years ago
Michael Peter Christen 24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
12 years ago
Michael Peter Christen 1687737771 Abstraction of HandleMap and HandleSet
12 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen 7c1ba99755 removed more unused method parameters
13 years ago
Michael Peter Christen 1825f165b8 better integration of blacklist according to use case
13 years ago
Roland 'Quix0r' Haeder edaa09b9b1 Rewrote all String blacklist types to enum 'BlacklistType', closes bug
13 years ago
Michael Peter Christen 10da7335ea performance hack: use a hash cache for all hashes that are computed by a
13 years ago
Michael Peter Christen c15fcde1c8 add-on to latest commit
13 years ago
Michael Peter Christen cf47d94888 performance hack to parse numbers inside of substrings without actually
13 years ago
Michael Peter Christen 14f67f217c refactoring of ContentDomain: now subclass of Classification
13 years ago
Michael Peter Christen a1a5b015d8 refactoring: moved document Classification to cora package
13 years ago
Michael Peter Christen ef78f22ee1 performance hack
13 years ago
Roland 'Quix0r' Haeder a3083d13bf Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user
13 years ago
orbiter 5a55397f99 some last-minute performance hacks
13 years ago
orbiter d2ea250d99 refactoring:
13 years ago