Commit Graph

32 Commits (117a85987989210f3b3295778e12bbaf2f5cd733)

Author SHA1 Message Date
luccioman 7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 11a7f923d4 Distinguish response parsing failures from unexpected exceptions.
8 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman a9cb083fa1 Improved consistency between loader openInputStream and load functions
8 years ago
luccioman b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
8 years ago
reger ce87025462 further avoid to set connect info properties as header value
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
luc 755efac17d Use same max file size when loading all resource bytes or opening stream
9 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 28683530cd fixes to usage of no-cache: use and recognize also the no-store
10 years ago
Michael Peter Christen 84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
10 years ago
Michael Peter Christen b893c42a0f bugfix for image search
11 years ago
Michael Peter Christen ba6ffddefc refactoring
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen b08375da33 fix for bad/missing values of size_i
11 years ago
reger fd119deb00 fix NPE on modified since check ( Response.requestHeader allowed to be null)
12 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
12 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
12 years ago
reger 2b7a38640a extend content type detection on file extension for .tif .tiff .htm
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
Michael Peter Christen 252bb51f98 fix for wrong mime type in noload crawler
12 years ago
reger 276e63401e small sanitary fixes
12 years ago
Michael Peter Christen d6b82840f8 added a feature to find similarities in documents.
12 years ago
Michael Peter Christen 4a14122ba7 in case that a crawl profile has a collection assigned, use the
12 years ago
Michael Peter Christen a06930662c replaced some more .getBytes() with UTF8/ASCII.getBytes()
12 years ago
Michael Peter Christen 00c1c777fa refactoring
13 years ago