Commit Graph

49 Commits (5b3acc12cd4b4343c4e7d7f0a20a1da8ea8d5f6a)

Author SHA1 Message Date
Michael Peter Christen e7e381d110 added configuration to switch off redirection following in crawler
13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
orbiter aa322bc6d0 fix
13 years ago
orbiter f183d3822c added a default accept header in http requests since some http fraud detection functions check that this header field exist
13 years ago
orbiter d2ea250d99 refactoring:
14 years ago
sixcooler 59b767eebd stop loading via http at defined maximum of bytes - even size is unknown before loading
14 years ago
orbiter 10e2f588f8 - enhanced ybr ranking computation
14 years ago
orbiter 6fa439c82b - refactoring of robots
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 96c32e87b0 fixes to crawler and new user-agent crawl-delay handling
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
f1ori 9d2159582f * fix system update if urls are in blacklist (for example for very general blacklists like *.de)
14 years ago
f1ori 741a87a3e9 * make .yacy-domains crawlable (.yacy-domains are local domains, so only in custom networks/peers)
14 years ago
f1ori dca9e16f51 * don't index pages, which redirect, twice
14 years ago
orbiter 2c549ae341 fixed a number of small bugs:
15 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
15 years ago
orbiter 5870b13f3a - code cleanup / added debug line for further investigation in HTTPDemon.parseMultipart
15 years ago
sixcooler 17eebd4ef8 counting crawler traffic again:
15 years ago
orbiter 65eaf30f77 redesign of crawl profiles data structure. target will be:
15 years ago
orbiter 3197ca42ed preparations to move the HTCache into cora:
15 years ago
orbiter 844f158686 - removed dependencies in header framework:
15 years ago
orbiter 90531f78ff refactoring of the cora package to get subpackages for http and ftp (smb to come)
15 years ago
sixcooler a6ed6e8cb9 ... migrating to HttpComponents-Client-4.x ...
15 years ago
sixcooler 15e8c13526 ... migrating to HttpComponents-Client-4.x ...
15 years ago
orbiter 87087f12fe - scanned remote search process and enhanced some data structure and synchronizations here and there
15 years ago
orbiter 3f93a0cc8f redesign of remote proxy settings
15 years ago
orbiter 11639aef35 - added new protocol loader for 'file'-type URLs
15 years ago
orbiter 2126c03a62 - removed download-limit that can be given for the crawler for non-crawler download tasks. This was necessary because the same procedure was used for other downloads like for the download of dictionary files where a limit is not useful. The limit still stays for the indexer
15 years ago
orbiter 25aef069a6 continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775
15 years ago
orbiter 3300930fc5 - (almost) fixed FTP crawler
15 years ago
orbiter 2d8f3ee301 some performance hacks
15 years ago
orbiter a0e891c63d - some redesign in UI menu structure to make room for new 'Content Integration' main menu containing import servlets for Wikimedia Dumps, phpbb3 forum imports and OAI-PMH imports
15 years ago
orbiter 5e8038ac4d - refactoring of blacklists
15 years ago
orbiter 3528b970d6 - refactoring
15 years ago
orbiter b79f4f062f refactoring of yacy documents and parsers: they depend now only on the kelondro classes
16 years ago
orbiter e7f18ba24b refactoring
16 years ago
orbiter ce8dc575ca refactoring
16 years ago
orbiter f677d534b1 start of a really extensive refactoring which will produce a hierarchical package structure with the domain yacy.net as package root
16 years ago
orbiter 735e2737e3 * added index segments
16 years ago
orbiter 3671c37989 added experimental oai-pmh reader and integrated it with the existing dublin core parser
16 years ago
low012 f65bfaa9af *) Removed base tag from errror page. This has been added by myself a long time ago as a workaround for some weird behavior of my router, but as it turns out, it does more bad than good in general: If HTTPS is used for communication with YaCy, entering a wrong passwort led to an errror page with a form which would send username and password unencrypted with the user possibly being unaware of this.
16 years ago
orbiter 3e9dcfc204 fix for http://forum.yacy-websuche.de/viewtopic.php?p=17504#p17504
16 years ago
orbiter 44579fa06d - fixed a problem loading images through yacy's document loader,
16 years ago
orbiter 161d2fd2ef redesign of access to the HTCache (now http.client.Cache):
16 years ago
orbiter 4da9042e8a code simplification
16 years ago
orbiter 1d8d51075c refactoring:
16 years ago
orbiter b332dfad67 - inserted request object into response object which carries this now instead generating new objects
16 years ago
orbiter ca72ed7526 -removed superfluous crawl cache
16 years ago