Commit Graph

203 Commits (f3cc818305e5cb31691af71e93ce0be7135255c2)

Author SHA1 Message Date
Michael Peter Christen bd3f2483a1 replaced url and date retrieval by only url retrieval
3 years ago
Michael Peter Christen 163ba26d90 replaced check for load time method
3 years ago
sgaebel 26223dc25a replaces getLoadTime() by exists() with a simpler query
4 years ago
Michael Christen cfa27d2fd5 fixed links
5 years ago
luccioman cced94298a Added a new crawler document filter type using Solr syntax
7 years ago
luccioman 929e0d6eae Replaced improper ByteBuffer.equals() implementation by Arrays.equals()
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument
8 years ago
luccioman 7263d17436 Removed mentions of deprecated LURL-db.
8 years ago
reger 4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
9 years ago
reger 6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
9 years ago
Michael Peter Christen ef8cd80593 fix for npe
9 years ago
reger ca3d26a401 harmonize wordsintitle & CollectionSchema.title_words_val calculation,
9 years ago
reger d882991bc5 Implement sharing of ioDispatcher for term & citation index
10 years ago
Michael Peter Christen 97930a6aad added must-not-match filter to snapshot generation.
10 years ago
Michael Peter Christen 9d8f426890 adding a try-catch to link graph processing to prevent that a single
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 7db2888336 fixed font size and print page generation in pdf snapshots
10 years ago
Michael Peter Christen 3b51636ecb fix for mediawiki import
10 years ago
Michael Peter Christen 3e6c3e2237 documents pushed over the api/push_p.html interface will have their
10 years ago
Michael Peter Christen 932faafffe reactivated on-demand snapshot loading
10 years ago
Michael Peter Christen 66b5a56976 Added and integrated new date detection class which can identify date
10 years ago
Michael Peter Christen 6a1865f507 refactoring date -> lastModified
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
Michael Peter Christen 70f03f7c8e do not cache search requests to Solr if the result is used for
10 years ago
Michael Peter Christen 6a2a669db4 added loading of the synonyms file from addon/synonyms into the
10 years ago
Michael Peter Christen 0a879c98e7 added new 'firstSeen' database table and necessary data structures which
10 years ago
Michael Peter Christen 6d3d4c4ea6 changed the concurrent enumeration of query results in such a way that
11 years ago
Michael Peter Christen 81f9b34da7 increaesed ability ot search for all images on a single server within
11 years ago
Michael Peter Christen a7dd89c4de changed method to write the citation index: do not catch up references
11 years ago
Michael Peter Christen 05d58e4df0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
11 years ago
orbiter 22ce4fb4dd better error handling for remote solr queries and exists-checks
11 years ago
orbiter 738989aab7 reverted commit f94c91315b because the
11 years ago
Michael Peter Christen f94c91315b if the webgraph is used, then use it also for reference computation to
11 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
11 years ago
Michael Peter Christen 5b94a257ce no timeout for large reference collections
11 years ago
Michael Peter Christen 8ad41a882c fixed several problems with postprocessing:
11 years ago
Michael Peter Christen 53948da7d0 tried to make last_modified recognition smarter
11 years ago
Michael Peter Christen 9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
orbiter c250fac9f4 linkstructure refactoring to get more options for clickdepth analysis
11 years ago
Michael Peter Christen bd886054cb new structure and enhancements for link graph computation:
11 years ago
Michael Peter Christen 63c9fcf3e0 free configuration of postprocessing clickdepth maximum depth and time
11 years ago
Michael Peter Christen 51800007c4 - added concurrency to postprocessing of webgraph document
11 years ago
Michael Peter Christen fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
11 years ago
Michael Peter Christen 7640834b37 removed double concurrency to put Solr documents into the index. The
11 years ago
Michael Peter Christen 0f6b72f24b do not use luke requests for remote solr servers if the result is
11 years ago