Commit Graph

66 Commits (28b30231c32731e393133430a81b65b1d36cc1f0)

Author SHA1 Message Date
Michael Peter Christen b1e7c11fba fix for pattern matcher in html parser
13 years ago
orbiter 7f851d62a7 replaced HashARC with SizeLimited Objects which are less costly
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
Michael Peter Christen ad09b786bf clean up parser data
13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
13 years ago
Michael Peter Christen de903a53a0 parser refactoring & hacks
13 years ago
Michael Peter Christen 508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
13 years ago
Michael Peter Christen f3167def64 do not fill the keywords with title content if keywords do not exist.
13 years ago
Michael Peter Christen 77f795756c fixing redirects and status codes: storing of status code in
13 years ago
Michael Peter Christen be928815fc fixed wrong parsing of style and script
13 years ago
Michael Peter Christen 0284a4d88f more fixes for double precision of coordinates
13 years ago
Michael Peter Christen 9b4c699526 ehanced location search:
13 years ago
Michael Peter Christen c15fcde1c8 add-on to latest commit
13 years ago
Michael Peter Christen ba6aaabc51 refactoring + parser bugfixes
13 years ago
Michael Peter Christen 453010bd68 - solved problems with backpath normalization
13 years ago
Michael Peter Christen 8d63a5887c bugfixes
13 years ago
Michael Peter Christen 9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
13 years ago
Michael Peter Christen 7e4e3fe5b6 free some memory after parsing html
13 years ago
Michael Peter Christen 4540174fe0 memory hacks
13 years ago
Michael Peter Christen b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large
13 years ago
Michael Christen c04bfaa51b refactoring
13 years ago
Michael Christen 1f4afb4dc0 performance hacks
13 years ago
Al Sutton 8993cac4d8 Initial performance improvements
13 years ago
orbiter 5a55397f99 some last-minute performance hacks
13 years ago
orbiter 1c007188ad bugfixes in html parser
14 years ago
orbiter 5dd2efc9a2 - bugfixes in html parser
14 years ago
orbiter 51cf697acd refactoring: moved all score-related classes to new ranking package
14 years ago
orbiter 299af4943c added another memory protection hack
14 years ago
orbiter bda3eec0ff added parsing of canonical link element to html parser
14 years ago
orbiter 9706fc55aa enhanced content scraper (should discover urls much faster in case of very large plain texts)
14 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks
14 years ago
orbiter 3ed4a09368 small features, some bug fixes and performance hacks
14 years ago
orbiter 021840e5ba removed (almost) deadlocks and unnecessary CPU load
14 years ago
orbiter 4e8fa03514 added more attributes to html evaluation
14 years ago
orbiter f6077b3cc0 added more attributes for html parser and enhanced data structures
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
14 years ago
orbiter 958ff4778e enhanced location search:
14 years ago
orbiter 0430a94eaa the location search shows now not re-evaluated locations but only such locations that are attached as metadata to web pages
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
orbiter 78d4c45d09 enhancement during search process: fast fail of search in case that all index feeder have terminated.
14 years ago
orbiter 30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
14 years ago
orbiter cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
14 years ago
orbiter e717bf74ba more logging, more care about OOMs
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
low012 3d95981f7d *) cleaning up the code a little bit
14 years ago
orbiter 9b25a33fd9 - fixed numerous bugs
14 years ago
orbiter 56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
14 years ago
orbiter c36da90261 added a very fast ftp file list generator to site crawler:
14 years ago
f1ori a025b1da89 * fix bug when browsing local filesystem (e. g. repository) with yacy
14 years ago