Commit Graph

286 Commits (76cc809f2a2afdfdfa9ff1b36e0a0e0ca3903b33)

Author SHA1 Message Date
Michael Peter Christen ab6cc3c88c added concurrent generation of snapshot pdfs
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
Michael Peter Christen 4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a
10 years ago
reger 568c991405 remove the unused Request variable
10 years ago
reger ff18129def ViewFile servlet: update index if newer,
10 years ago
Michael Peter Christen 226aea5914 added a servlet which can create preview images, preview tumbnails and
10 years ago
Michael Peter Christen e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without
10 years ago
Michael Peter Christen 25a64c51b3 moved snapshot generation out of the html handler to prevent that
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
Michael Peter Christen ad0da5f246 added new web page snapshot infrastructure which will lead to the
10 years ago
Michael Peter Christen 84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
10 years ago
Michael Peter Christen a39419f2ef more stacks shall be considered for on-demand loading, not only
10 years ago
Michael Peter Christen 5bb52f79be reduce number of calls to queue.size() because that may be a bottleneck
10 years ago
Michael Peter Christen a34f837592 better delete all files in path when removing host crawl stack
10 years ago
Michael Peter Christen 10b1db430a if we have many hosts, use on-demand earlier
10 years ago
Michael Peter Christen 6983dff334 explain crawl denial when not switched to intranet mode
10 years ago
Michael Peter Christen d8beafba3a fix for values in CrawlProfileEditor table and xml; now the full profile
10 years ago
Michael Peter Christen ec95dfa2e6 fixed crawl profile xml result which did not show the correct crawl
10 years ago
Michael Peter Christen 9b1958e8ca more ipv6 bugfixes
10 years ago
Michael Peter Christen e1bc768f9d more IPv6 bugfixes
10 years ago
reger fb1fcc2b03 handle noarchive tag, skip writing page to cache
10 years ago
Michael Peter Christen 6491270b3a large IPv6 redesign of peer ping methods!
10 years ago
Michael Peter Christen 67cd4c37bd activated the new apk parser which was already ready but not included in
11 years ago
Michael Peter Christen 025516f682 fix for crawl limit for number of pages fail
11 years ago
orbiter 3ac31614a3 added option to reverse-sort YaCy tables (internal API change only)
11 years ago
Michael Peter Christen bf18a39d0e replaced warning with info
11 years ago
Michael Peter Christen ebd0be2cea fixes and speed updates for search process
11 years ago
Michael Peter Christen a7dd89c4de changed method to write the citation index: do not catch up references
11 years ago
orbiter 4ae7aead28 addon to latest fix
11 years ago
Michael Peter Christen eca9380e3d bugfix for crawler double-check: if an url is redirected, the
11 years ago
Michael Peter Christen 9ac0c93f17 fix for subpath crawl filter
11 years ago
Michael Peter Christen 66106bdaf0 fix for crawler attribute maxdompages
11 years ago
Michael Peter Christen 49d91b94c3 npe fix in crawler
11 years ago
Michael Peter Christen c465b791af typo
11 years ago
Michael Peter Christen 3c23b89823 less logging
11 years ago
Michael Peter Christen 1609763be5 toString fix
11 years ago
Michael Peter Christen 001e05bb80 do not store failure of loading of robots.txt into the index as a fail
11 years ago
Michael Peter Christen 05d58e4df0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
11 years ago
orbiter 22ce4fb4dd better error handling for remote solr queries and exists-checks
11 years ago
orbiter e9163e7e10 fix for malformed hostpath names in crawl balancer
11 years ago
Michael Peter Christen 6e1dc444c3 added a snippet test function in ViewFile: you can now search for a
11 years ago
orbiter 4b06adb751 fix for file urls
11 years ago
Michael Peter Christen 542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
11 years ago
Michael Peter Christen 4eec1a7452 refactoring (change Metadata name of load time data structure to avoid
11 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
11 years ago
Michael Peter Christen b5fc2b63ea removed exist() retrieval functions from error cache and replaced it
11 years ago
Michael Peter Christen 62c72360ee cleanup of checkAcceptanceInitially in CrawlStacker, should avoid
11 years ago
Michael Peter Christen b5d78ba156 reduced number of solr queries during crawling
11 years ago
Michael Peter Christen 06ab72d1af enhanced crawler host round-robin strategy
11 years ago
Michael Peter Christen 49886fab08 enhanced debugging
11 years ago
Michael Peter Christen b893c42a0f bugfix for image search
11 years ago
Michael Peter Christen 74c249288a added a push api to make it possible to upload files directly without
11 years ago
Michael Peter Christen ba6ffddefc refactoring
11 years ago
reger 92d1604a31 Crawler hostbalancer does not delete finished queue files,
11 years ago
orbiter d7d38f9135 made number of open files in crawler configurable and increased default
11 years ago
reger ca5437dd50 fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149
11 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
reger 1600414450 fix NPE on continuing crawls after YaCy restart
11 years ago
Michael Peter Christen c1c1be8f02 fix for slow crawling and better logging in balancer
11 years ago
Michael Peter Christen 3acf416335 npe fix
11 years ago
orbiter 2f63bd0261 enhanced Host Balancer strategy: fair round robin
11 years ago
Michael Peter Christen 8b32dd5f9e special strategy for balancer: do not remove targets with zero wait time
11 years ago
Michael Peter Christen 9c6228d948 fix for deadlocks in crawler
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen 06afb568e2 new Strategies in Balancer:
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 075b6f9278 refactoring of the crawl balancer: the balancer is turned into an
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen 9e503b3376 also delete the robots.txt file from the cache when a new crawl is
11 years ago
Michael Peter Christen 1c21b3256d fix for robots.txt handling: delete old entry before starting a new
11 years ago
Michael Peter Christen 926d28dd3f fixed a bug which prevented crawl starts after a network switch
11 years ago
Michael Peter Christen d4b5c457e4 NPE fix
11 years ago
Michael Peter Christen 8b44fcf0f4 added missing @Override annotation
11 years ago
Michael Peter Christen 85a427ec54 support for multiple sitemaps in robots.txt
11 years ago
Michael Peter Christen b08375da33 fix for bad/missing values of size_i
11 years ago
reger dd5bf0b71b cleanup old reference to HTTPDemon.setAlternativeResolver
11 years ago
Michael Peter Christen e485fbd0ce - let crawl loader jobs die after 10 seconds without new jobs
11 years ago
Michael Peter Christen bcd9dd9e1d enhanced concurrent loading by using a fixed set of concurrent loader
11 years ago
Michael Peter Christen 6ed9c0164e attaching names to all Threads to get a better view in profiling tools
11 years ago
Michael Peter Christen fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
11 years ago
orbiter da5d4128bf prevent npe
11 years ago
orbiter a878c7982c prevent npe
11 years ago
orbiter ced1a96f9c fixed error cache
11 years ago
Michael Peter Christen 69391e5d9e changed strategy to test existence of documents in Solr: using the
11 years ago
Michael Peter Christen 8b14e92ba4 added button in host browser to re-load 404/failed documents
11 years ago
Michael Peter Christen 6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
11 years ago
Michael Peter Christen 0168f80c28 new crawling factors can now be changed during runtime
11 years ago
Michael Peter Christen 77531850b5 reverted crawling strategy from latest commit.
11 years ago
Michael Peter Christen c0da966dfa enhanced crawler speed
11 years ago
Michael Peter Christen 0d235a565b cleanup crawl loader jobs
11 years ago
Michael Peter Christen 1ea17bd9f3 - removed old metadata database and all migration code
11 years ago
Michael Peter Christen 022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
11 years ago
reger 28eae57e8b spend CrawlQueues a fremem routine
11 years ago
reger 6932aa4d7a use configured admin-username for api calls
11 years ago
orbiter 3cb6c7861f fixed shutdown authenticaton problem
11 years ago
orbiter f3ac923a7e ftp client shall be able to open non-anonymous ftp servers if login
11 years ago
Michael Peter Christen 82c0525e71 wrong logger fix
11 years ago
Michael Peter Christen 552ef9f18e fix for bad ErrorCache.exists test (bug from latest commit)
11 years ago
Michael Peter Christen 303f5694ba avoid usage of existsByQuery. If a document can be loaded by the ID
11 years ago