Commit Graph

151 Commits (0587077d066efe32ea92133bb5b64bf9f2d5dbea)

Author SHA1 Message Date
orbiter c6f0bd05f8 better removal of stored urls when doing a crawl start
11 years ago
orbiter 469e0a62f1 added new button to terminate all crawls
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen 9e503b3376 also delete the robots.txt file from the cache when a new crawl is
11 years ago
Michael Peter Christen 1c21b3256d fix for robots.txt handling: delete old entry before starting a new
11 years ago
Michael Peter Christen a6bb9be97e - added d3.js for visualizations using embedded svg
11 years ago
Michael Peter Christen bd54b85d46 fix for relative sitemap urls
11 years ago
reger d052bbdfe1 prevent exception on Site Crawl if no start url is given
11 years ago
Michael Peter Christen a86c2fe77d fixed usage of media flag when started by automated process
11 years ago
Michael Peter Christen 6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
11 years ago
reger 41c126978b fix bug: Crawl Start (Expert) crawls "?-URLs" even if told not to do so
11 years ago
Michael Peter Christen 0db8e34625 enhanced webgraph processing
11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
11 years ago
orbiter 74c86a72a0 better default value for crawler user agent
11 years ago
Michael Peter Christen 030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
11 years ago
Michael Peter Christen 1a09771be8 fixed sitemap crawl start
11 years ago
Michael Peter Christen 82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
11 years ago
Michael Peter Christen e40671ddb7 better and consistent deletions for error urls
11 years ago
Michael Peter Christen 2602be8d1e - removed ZURL data structure; removed also the ZURL data file
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Michael Peter Christen e879b97b0a added line to enhance debugging
11 years ago
Michael Peter Christen 76afcccaaf fix for default boolean post values: the default value MUST NOT be TRUE,
11 years ago
Michael Peter Christen 4c242f9af9 always use a default value for boolean options to have transparency for
11 years ago
orbiter 9c681cc00d added segment sizes, postprocessing status and cpu load to crawler
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 89c0aa0e74 added collection_sxt to error documents
11 years ago
Michael Peter Christen bcc623a843 refactoring of load_delay: this is a matter of client identification
12 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
12 years ago
Michael Peter Christen f1c5338210 prepartion for greedy crawl profiles and refactoring
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
Michael Peter Christen f93501e6e0 nice crawl name if crawl is started with file:// (was: null)
12 years ago
Michael Peter Christen b24d1d18e4 removed synchronization and concurrency in Fulltext class, concurrent
12 years ago
Michael Peter Christen e26bdd4a52 fixes to deletion methods (removed unnecessary concurrency and added
12 years ago
Michael Peter Christen cca19d94d4 re-declared some fields to be of type string rather than text which
12 years ago
Michael Peter Christen 25499eead5 - added a new field for the regular expression in crawl start
12 years ago
orbiter 2c3b024196 if the crawl was paused (automatically), show the reason for pausing in
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen 91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
12 years ago
Michael Peter Christen 0b6566a389 optimizations when starting large crawl requests with many start urls in
12 years ago
Michael Peter Christen be27567b53 allow more links when starting a crawl by file
12 years ago
Michael Peter Christen 0fe7b6fd3b migrated the index export methods from the old metadata to solr. Now
12 years ago
Michael Peter Christen 4735bd47f4 - changed solr commit call and added an optimize option. Since Solr
12 years ago
Michael Peter Christen fb0fa9a102 - fixed 'delete from subpath' during crawl start which deleted nothing;
12 years ago
Michael Peter Christen eca68fa197 added debug code to crawler monitor
12 years ago
Michael Peter Christen 5fd3b93661 added deletion of hosts during crawl start if deleteold option was given
12 years ago