Commit Graph

122 Commits (eb78388a98cdcc407ca705e5ebf84fb5f66e1f1c)

Author SHA1 Message Date
Michael Peter Christen 932faafffe reactivated on-demand snapshot loading
10 years ago
Michael Peter Christen 2362ad7c34 fix for a count issue in snapshot api
10 years ago
Michael Peter Christen 9971e197e0 Added a transaction interface to the snapshots: all documents in the
10 years ago
Michael Peter Christen 66b5a56976 Added and integrated new date detection class which can identify date
10 years ago
Michael Peter Christen ab6cc3c88c added concurrent generation of snapshot pdfs
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
Michael Peter Christen 4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a
10 years ago
reger ff18129def ViewFile servlet: update index if newer,
10 years ago
Michael Peter Christen 226aea5914 added a servlet which can create preview images, preview tumbnails and
10 years ago
Michael Peter Christen e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
Michael Peter Christen ad0da5f246 added new web page snapshot infrastructure which will lead to the
10 years ago
Michael Peter Christen 5bb52f79be reduce number of calls to queue.size() because that may be a bottleneck
10 years ago
Michael Peter Christen d8beafba3a fix for values in CrawlProfileEditor table and xml; now the full profile
10 years ago
Michael Peter Christen ec95dfa2e6 fixed crawl profile xml result which did not show the correct crawl
10 years ago
Michael Peter Christen e1bc768f9d more IPv6 bugfixes
10 years ago
reger fb1fcc2b03 handle noarchive tag, skip writing page to cache
10 years ago
Michael Peter Christen 9ac0c93f17 fix for subpath crawl filter
11 years ago
Michael Peter Christen 66106bdaf0 fix for crawler attribute maxdompages
11 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
11 years ago
Michael Peter Christen 542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
11 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
11 years ago
Michael Peter Christen b5fc2b63ea removed exist() retrieval functions from error cache and replaced it
11 years ago
Michael Peter Christen 62c72360ee cleanup of checkAcceptanceInitially in CrawlStacker, should avoid
11 years ago
Michael Peter Christen b5d78ba156 reduced number of solr queries during crawling
11 years ago
orbiter d7d38f9135 made number of open files in crawler configurable and increased default
11 years ago
Michael Peter Christen 9c6228d948 fix for deadlocks in crawler
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 075b6f9278 refactoring of the crawl balancer: the balancer is turned into an
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen 926d28dd3f fixed a bug which prevented crawl starts after a network switch
11 years ago
Michael Peter Christen d4b5c457e4 NPE fix
11 years ago
Michael Peter Christen b08375da33 fix for bad/missing values of size_i
11 years ago
Michael Peter Christen e485fbd0ce - let crawl loader jobs die after 10 seconds without new jobs
11 years ago
Michael Peter Christen bcd9dd9e1d enhanced concurrent loading by using a fixed set of concurrent loader
11 years ago
Michael Peter Christen 6ed9c0164e attaching names to all Threads to get a better view in profiling tools
11 years ago
Michael Peter Christen fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
11 years ago
orbiter ced1a96f9c fixed error cache
11 years ago
Michael Peter Christen 6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
11 years ago
Michael Peter Christen 0168f80c28 new crawling factors can now be changed during runtime
11 years ago
Michael Peter Christen 77531850b5 reverted crawling strategy from latest commit.
11 years ago
Michael Peter Christen c0da966dfa enhanced crawler speed
11 years ago
Michael Peter Christen 0d235a565b cleanup crawl loader jobs
11 years ago
reger 28eae57e8b spend CrawlQueues a fremem routine
11 years ago
Michael Peter Christen 1a4a69c226 set more logger to 'final static'
11 years ago
Michael Peter Christen 87a956e881 calculating and showing the number of files and the average size of a
11 years ago
orbiter 20bbde8665 fix for mustmatch regex computation: result had correct semantic, but
11 years ago
Michael Peter Christen 82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
12 years ago
Michael Peter Christen 91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug
12 years ago