Commit Graph

171 Commits (ec759591627fbb3d674594e6886f763020277d83)

Author SHA1 Message Date
Michael Peter Christen 3c4c69adea fix for
10 years ago
Michael Peter Christen 6c2e6f1f37 remove redundant code
10 years ago
Michael Peter Christen 97930a6aad added must-not-match filter to snapshot generation.
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen 710a0efa1b generalized time period computations
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
reger 0260d3d800 Allow to hide linkstructure graphic in crawl monitor
10 years ago
Michael Peter Christen 5d4167f977 reacivated clear stacks code for termination of all crawls because this
10 years ago
Michael Peter Christen 8600ea01dd automatically swith on query option in case intranet protocols (smb/ftp)
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
Michael Peter Christen a95af11050 enhancement for clearing the crawl queue
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
Michael Peter Christen ad0da5f246 added new web page snapshot infrastructure which will lead to the
10 years ago
Michael Peter Christen 8c1a89cb34 added another decoration flag to switch off network graphics in crawler
10 years ago
Michael Peter Christen 9bc3e457dd fix for termination of all crawls
10 years ago
Michael Peter Christen 542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen f23c4142e0 added option to configure a custom user agent within allip networks
11 years ago
reger ca5437dd50 fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149
11 years ago
reger 1b37b12998 fix: CrawlStartExpert.html # From File with missing filename
11 years ago
orbiter c6f0bd05f8 better removal of stored urls when doing a crawl start
11 years ago
orbiter 469e0a62f1 added new button to terminate all crawls
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen 9e503b3376 also delete the robots.txt file from the cache when a new crawl is
11 years ago
Michael Peter Christen 1c21b3256d fix for robots.txt handling: delete old entry before starting a new
11 years ago
Michael Peter Christen a6bb9be97e - added d3.js for visualizations using embedded svg
11 years ago
Michael Peter Christen bd54b85d46 fix for relative sitemap urls
11 years ago
reger d052bbdfe1 prevent exception on Site Crawl if no start url is given
11 years ago
Michael Peter Christen a86c2fe77d fixed usage of media flag when started by automated process
11 years ago
Michael Peter Christen 6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
11 years ago
reger 41c126978b fix bug: Crawl Start (Expert) crawls "?-URLs" even if told not to do so
11 years ago
Michael Peter Christen 0db8e34625 enhanced webgraph processing
11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
11 years ago
orbiter 74c86a72a0 better default value for crawler user agent
11 years ago
Michael Peter Christen 030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
11 years ago
Michael Peter Christen 1a09771be8 fixed sitemap crawl start
11 years ago
Michael Peter Christen 82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
11 years ago
Michael Peter Christen e40671ddb7 better and consistent deletions for error urls
11 years ago
Michael Peter Christen 2602be8d1e - removed ZURL data structure; removed also the ZURL data file
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Michael Peter Christen e879b97b0a added line to enhance debugging
11 years ago
Michael Peter Christen 76afcccaaf fix for default boolean post values: the default value MUST NOT be TRUE,
11 years ago
Michael Peter Christen 4c242f9af9 always use a default value for boolean options to have transparency for
11 years ago
orbiter 9c681cc00d added segment sizes, postprocessing status and cpu load to crawler
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 89c0aa0e74 added collection_sxt to error documents
11 years ago