Commit Graph

82 Commits (4eddabee4221f0d8deec55a157b520e1e1c140ce)

Author SHA1 Message Date
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
sixcooler 5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
luc 755efac17d Use same max file size when loading all resource bytes or opening stream
9 years ago
luc f01d49c37a Process large or local file images dealing directly with content
9 years ago
luc 5bbb2e1730 Ensure resource is closed when reading a full file InputStream
9 years ago
reger fa08ca207e ! finish running crawls before applying !
9 years ago
reger 141cd80456 correct log msg text
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 783cf6fbc7 the LinkedBlockingQueue is much faster than the ArrayBlockingQueue
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
Michael Peter Christen 28683530cd fixes to usage of no-cache: use and recognize also the no-store
10 years ago
reger 568c991405 remove the unused Request variable
10 years ago
reger ff18129def ViewFile servlet: update index if newer,
10 years ago
Michael Peter Christen 25a64c51b3 moved snapshot generation out of the html handler to prevent that
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
Michael Peter Christen ad0da5f246 added new web page snapshot infrastructure which will lead to the
10 years ago
Michael Peter Christen 84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
10 years ago
Michael Peter Christen 67cd4c37bd activated the new apk parser which was already ready but not included in
10 years ago
Michael Peter Christen ebd0be2cea fixes and speed updates for search process
10 years ago
Michael Peter Christen eca9380e3d bugfix for crawler double-check: if an url is redirected, the
10 years ago
orbiter 22ce4fb4dd better error handling for remote solr queries and exists-checks
10 years ago
orbiter 4b06adb751 fix for file urls
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen b893c42a0f bugfix for image search
11 years ago
Michael Peter Christen ba6ffddefc refactoring
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen b08375da33 fix for bad/missing values of size_i
11 years ago
reger dd5bf0b71b cleanup old reference to HTTPDemon.setAlternativeResolver
11 years ago
Michael Peter Christen bcd9dd9e1d enhanced concurrent loading by using a fixed set of concurrent loader
11 years ago
Michael Peter Christen 69391e5d9e changed strategy to test existence of documents in Solr: using the
11 years ago
Michael Peter Christen 8b14e92ba4 added button in host browser to re-load 404/failed documents
11 years ago
Michael Peter Christen 022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
11 years ago
reger 6932aa4d7a use configured admin-username for api calls
11 years ago
orbiter 3cb6c7861f fixed shutdown authenticaton problem
11 years ago
orbiter f3ac923a7e ftp client shall be able to open non-anonymous ftp servers if login
11 years ago
reger fd119deb00 fix NPE on modified since check ( Response.requestHeader allowed to be null)
11 years ago
Michael Peter Christen 91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug
11 years ago
Michael Peter Christen 2602be8d1e - removed ZURL data structure; removed also the ZURL data file
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
11 years ago
orbiter 26366596d9 fix for a problem which ocurres when a site is crawled where the start
11 years ago
Michael Peter Christen a88a62f7aa added a feature to set a collection for a crawl result based on a
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
reger 2b7a38640a extend content type detection on file extension for .tif .tiff .htm
11 years ago