Commit Graph

189 Commits (60dc1241a3e69561f28689c6cdfc0ba8a76c1939)

Author SHA1 Message Date
luccioman 6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
6 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman 534f09e92b Added and updated hint messages about remote crawler status
6 years ago
luccioman cced94298a Added a new crawler document filter type using Solr syntax
7 years ago
Michael Christen e0dc632020 removed transformer
7 years ago
luccioman fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME)
7 years ago
luccioman 519fc9a600 Issue #156 : new option to clean up (or not) search cache on crawl start
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
Michael Peter Christen 369b8e0e0b added json(p) endpoint for crawl start
8 years ago
luccioman 89017e17e4 Converted ajax URL to relative and added a check on the response status.
8 years ago
reger 395f2e8946 Make ServletRequest implement the standardized HttpServletRequest interface,
8 years ago
reger 042c2868df del abandoned indexcleaner.html, servlet deleted with commit
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
reger b71a60c04b fix NPE in CrawlMonitorRemoteStart servlet due to missing startURL
9 years ago
reger 45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
9 years ago
Michael Peter Christen 225200194a every time a crawl is started, the user expects a different search
9 years ago
Michael Peter Christen 0a37d8af89 in case that a site crawl is started for urls with file:// path, the
9 years ago
Michael Peter Christen 3c4c69adea fix for
10 years ago
Michael Peter Christen 6c2e6f1f37 remove redundant code
10 years ago
Michael Peter Christen 97930a6aad added must-not-match filter to snapshot generation.
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen 710a0efa1b generalized time period computations
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
reger 0260d3d800 Allow to hide linkstructure graphic in crawl monitor
10 years ago
Michael Peter Christen 5d4167f977 reacivated clear stacks code for termination of all crawls because this
10 years ago
Michael Peter Christen 8600ea01dd automatically swith on query option in case intranet protocols (smb/ftp)
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
Michael Peter Christen a95af11050 enhancement for clearing the crawl queue
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
Michael Peter Christen ad0da5f246 added new web page snapshot infrastructure which will lead to the
10 years ago
Michael Peter Christen 8c1a89cb34 added another decoration flag to switch off network graphics in crawler
10 years ago
Michael Peter Christen 9bc3e457dd fix for termination of all crawls
10 years ago
Michael Peter Christen 542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen f23c4142e0 added option to configure a custom user agent within allip networks
11 years ago
reger ca5437dd50 fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149
11 years ago
reger 1b37b12998 fix: CrawlStartExpert.html # From File with missing filename
11 years ago
orbiter c6f0bd05f8 better removal of stored urls when doing a crawl start
11 years ago
orbiter 469e0a62f1 added new button to terminate all crawls
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen 9e503b3376 also delete the robots.txt file from the cache when a new crawl is
11 years ago
Michael Peter Christen 1c21b3256d fix for robots.txt handling: delete old entry before starting a new
11 years ago
Michael Peter Christen a6bb9be97e - added d3.js for visualizations using embedded svg
11 years ago
Michael Peter Christen bd54b85d46 fix for relative sitemap urls
11 years ago
reger d052bbdfe1 prevent exception on Site Crawl if no start url is given
11 years ago
Michael Peter Christen a86c2fe77d fixed usage of media flag when started by automated process
11 years ago
Michael Peter Christen 6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
11 years ago