Commit Graph

40 Commits (81737dcb1810374d54cff7810b404b4009f0182b)

Author SHA1 Message Date
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
Michael Peter Christen a1a5b015d8 refactoring: moved document Classification to cora package
13 years ago
Roland 'Quix0r' Haeder fa08ed5ae5 Fixed a lot CHMOD rights (no need for execute flag on *.java/*.html) and introduced local/remote crawl size ratio based check
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
14 years ago
f1ori 41e146116a fixes size of document in case the server doesn't give the size in the header
14 years ago
sixcooler ce248cc8dd less byte-arrays of response-content, less byte-array <-> stream conversation
14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
14 years ago
orbiter cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
orbiter 10ae8d961b - cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring)
14 years ago
f1ori e4aabaa1c3 * fix negative filelength for files >2G
14 years ago
f1ori ee3cef91e8 * fix filesize in ftp crawls
14 years ago
orbiter 56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
f1ori 7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null)
14 years ago
orbiter 6a166c2040 patches for bad proxy behaviour
15 years ago
orbiter 65eaf30f77 redesign of crawl profiles data structure. target will be:
15 years ago
orbiter 3197ca42ed preparations to move the HTCache into cora:
15 years ago
orbiter b6fb239e74 redesign of parser interface:
15 years ago
orbiter 777195e8d1 more abstraction for access of LoaderDispatcher and cache
15 years ago
orbiter c45117f81f fixed dates in metadata
15 years ago
orbiter 25aef069a6 continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775
15 years ago
orbiter 3300930fc5 - (almost) fixed FTP crawler
15 years ago
orbiter dff4f95c78 some patches to get the torrent parser working
15 years ago
orbiter 52470d0de4 - fix for xls parser
15 years ago
orbiter 5e8038ac4d - refactoring of blacklists
16 years ago
orbiter 3528b970d6 - refactoring
16 years ago
orbiter b79f4f062f refactoring of yacy documents and parsers: they depend now only on the kelondro classes
16 years ago
orbiter e7f18ba24b refactoring
16 years ago
orbiter ce8dc575ca refactoring
16 years ago
orbiter bea3b99aff moved table and util classes
16 years ago
orbiter 6aa474f529 - better logging for web cache access and fail reasons
16 years ago
orbiter 3671c37989 added experimental oai-pmh reader and integrated it with the existing dublin core parser
16 years ago
low012 f65bfaa9af *) Removed base tag from errror page. This has been added by myself a long time ago as a workaround for some weird behavior of my router, but as it turns out, it does more bad than good in general: If HTTPS is used for communication with YaCy, entering a wrong passwort led to an errror page with a form which would send username and password unencrypted with the user possibly being unaware of this.
16 years ago
orbiter 161d2fd2ef redesign of access to the HTCache (now http.client.Cache):
16 years ago
orbiter 4da9042e8a code simplification
16 years ago
orbiter 1d8d51075c refactoring:
16 years ago
orbiter 5bb8074150 removed the indexing queue. This queue was superfluous since the introduction of the blocking queues last year, where documents are parsed, analysed and stored in the index with concurrency.
16 years ago
orbiter b332dfad67 - inserted request object into response object which carries this now instead generating new objects
16 years ago
orbiter ca72ed7526 -removed superfluous crawl cache
16 years ago