Commit Graph

448 Commits (8cbc1c970ae158e2b67e9c2e75c3ad9a4f155079)

Author SHA1 Message Date
Michael Peter Christen 1687737771 Abstraction of HandleMap and HandleSet
12 years ago
orbiter 482afed07c reduced logging overhead (a bit)
13 years ago
orbiter bbfa497a3c replaced more size() > 0 by !isEmpty()
13 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen 801972fe6f fix for url camel case parser and sentence reader
13 years ago
Michael Peter Christen fbc1a2030d fix for sitemap importer: can now also import very large sitemaps within
13 years ago
Michael Peter Christen 92731e5287 fix for sevenzip parser
13 years ago
Michael Peter Christen 8efc1c1078 - fixed a memory leak (or bad usage) during parsing/snippet fetch
13 years ago
Michael Peter Christen b1e7c11fba fix for pattern matcher in html parser
13 years ago
Michael Peter Christen b0c408788b made class methods static where possible
13 years ago
Michael Peter Christen 7c1ba99755 removed more unused method parameters
13 years ago
Michael Peter Christen 0301aba1e9 removed unused method parameters
13 years ago
Michael Peter Christen d3964253ae - added @SuppressWarnings to unused servlet method parameters
13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code
13 years ago
orbiter fc0f9543fe More SentenceReader cleanup
13 years ago
orbiter 586bb0eb6a Simplified SentenceReader (no more Reader inside..)
13 years ago
orbiter 7f851d62a7 replaced HashARC with SizeLimited Objects which are less costly
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
orbiter bb8dcb4911 automatically adopt size of word cache to available memory
13 years ago
Michael Peter Christen ad09b786bf clean up parser data
13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
13 years ago
Michael Peter Christen de903a53a0 parser refactoring & hacks
13 years ago
Michael Peter Christen 1825f165b8 better integration of blacklist according to use case
13 years ago
Michael Peter Christen ce8d4b87d9 fixes for new eclipse 'Juno' warning 'Resource leak'.
13 years ago
Michael Peter Christen 0c345d1559 giving threads name so its easier to see whats happening during
13 years ago
Michael Peter Christen 508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
13 years ago
Michael Peter Christen f3167def64 do not fill the keywords with title content if keywords do not exist.
13 years ago
Michael Peter Christen 77f795756c fixing redirects and status codes: storing of status code in
13 years ago
Michael Peter Christen dbdd697f4d moved RDFaParser.xsl configuration file to defaults
13 years ago
Michael Peter Christen 786be7d175 better integration of RDFaParser
13 years ago
Michael Peter Christen de3ef8ad73 removed unimportant warnings
13 years ago
Michael Peter Christen 24bbe359ca integrate also geonames library files for less cities. these are more
13 years ago
Michael Peter Christen 223a5440ab preventing that an empty pnd is inserted into the vocabularies
13 years ago
Michael Peter Christen 963f92ed9a - merged files
13 years ago
Michael Peter Christen dd88d0ace2 more logging
13 years ago
Michael Peter Christen 94d54e2d91 added recognition of multi-word terms in vocabulary matching
13 years ago
Michael Peter Christen 64c0268b2b show triplestore metadata in yacydoc and viewfile
13 years ago
Michael Peter Christen c2f0d16d2c fixed vocabulary initialization
13 years ago
Michael Peter Christen df3531f8d5 added the generation of virtual vocabularies using the pnd
13 years ago
Michael Peter Christen a0f1decd82 - added loading of the dbpedia pnd triplestore in the dictionary loader
13 years ago
Michael Peter Christen 16d8f33795 added objectlink generation to vocabulary generation and editor
13 years ago
Michael Peter Christen d45718251e refactoring (Localization -> Location)
13 years ago
Michael Peter Christen b8b3c87ba7 - renamed localization to location (that was confusing)
13 years ago
Michael Peter Christen e89747bb67 - added automated generation of vocabularies from url stubs
13 years ago
Michael Peter Christen 79464189a4 The 'Locale' vocabulary, which is generated by geo data, has now the
13 years ago
Michael Peter Christen 61bb52d55c - using http://purl.org/dc/terms/references to refer from an
13 years ago
Michael Peter Christen 50c576599b allow multiple parser options instead of printing an error
13 years ago
Michael Peter Christen 8b53771db2 changed behavior of navigation processing:
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
cominch bbfc53b663 bugfix
13 years ago
cominch 65c5826d93 bugfix
13 years ago
cominch 5f8ba7f4f2 small changes
13 years ago
cominch 90512640bf Added config switches for custom parser
13 years ago
cominch bcbd8eee33 Add several parsers, for RDFa and rdf files.
13 years ago
cominch 9cbfc1a1c0 augmentedProxy, which forwards every proxy request to a
13 years ago
Michael Peter Christen cde20911bb saved a bit more ram using UTF8 String compression for OpenGeoDB and
13 years ago
Michael Peter Christen 225ee42879 made the GeoLocation into an interface with the current
13 years ago
Michael Peter Christen 96e9d77270 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
Michael Peter Christen 96c8119b50 added GeoLocation / GeoPoint classes which uses less memory than
13 years ago
Michael Peter Christen 461a0ce052 removed warnings
13 years ago
Michael Peter Christen 2fe207f813 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
Michael Peter Christen 514700291a moved Vocabulary to cora package (added in git
13 years ago
Michael Peter Christen 0284a4d88f more fixes for double precision of coordinates
13 years ago
Michael Peter Christen 964406ad17 added concurrency enhancement to xml parser
13 years ago
Michael Peter Christen e0d8643226 - performance hacks
13 years ago
Michael Peter Christen 6e83b02b83 - bugfix for surrogate file reader
13 years ago
Michael Peter Christen 9b4c699526 ehanced location search:
13 years ago
Michael Peter Christen 4d3cc02168 replaced old bzip2 library against better documented commons-compress
13 years ago
Michael Peter Christen c15fcde1c8 add-on to latest commit
13 years ago
Michael Peter Christen 81737dcb18 removed stack trace from swf parser since we cant do anything there
13 years ago
Michael Peter Christen acf8d521a2 fix for http://bugs.yacy.net/view.php?id=126
13 years ago
Michael Peter Christen 89142d1e8d removed (not all) warnings
13 years ago
Roland 'Quix0r' Haeder a093ccf5eb Now used synchronization in all close() methods to make sure all objects
13 years ago
Michael Peter Christen ba6aaabc51 refactoring + parser bugfixes
13 years ago
Michael Peter Christen 09484955dc added new entry class for embed tags
13 years ago
Michael Peter Christen 453010bd68 - solved problems with backpath normalization
13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
Michael Peter Christen f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
13 years ago
Michael Peter Christen a1a5b015d8 refactoring: moved document Classification to cora package
13 years ago
Michael Peter Christen 4d5da75814 fix for parser problem if a <a>-tag is 'within' html tags with unclosed
13 years ago
Michael Peter Christen 046f3a7e8d check if httpc has decompressed the release file and rename the file
13 years ago
Michael Peter Christen e101c2e0e2 added changes from copperdust (submitted by email):
13 years ago
Michael Peter Christen 8d63a5887c bugfixes
13 years ago
Michael Peter Christen 9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
13 years ago
Michael Peter Christen 7e4e3fe5b6 free some memory after parsing html
13 years ago
Michael Peter Christen 4540174fe0 memory hacks
13 years ago
Michael Peter Christen 2e5cd6a1b2 fixed parser extension deny list generation and usage
13 years ago
Michael Peter Christen 8bee1472c9 there is no noindex, only nofollow in links
13 years ago
Michael Peter Christen c560a582ac fix for single-word vocabulary lines
13 years ago
Michael Peter Christen ef78f22ee1 performance hack
13 years ago
Michael Peter Christen 1f4f60654a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
reger 32104360ce PDFParser - return at least first 3 pages of PDF
13 years ago
Michael Peter Christen eadb58dd87 small enhancements in pdf parser
13 years ago
reger b616de5973 PDFParser - return at least first 3 pages of PDF
13 years ago
Michael Peter Christen 7f9b6b7a0c added switches to ConfigParser to accept/deny documents by their
13 years ago
Michael Peter Christen 4901cee3cc suppress auto-tagged subject entries when sending out or receiving
13 years ago
Michael Peter Christen 83009d86f7 added the vocabulary navigator. It can be very simply tested by
13 years ago
Michael Peter Christen a58dc4a91f added autotagging to document condenser:
13 years ago
Michael Peter Christen 254adea51c small fixes
13 years ago
Michael Peter Christen b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large
13 years ago
Michael Christen e6d51363ee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
Marek Otahal 72adbeae90 !Important: move from Hashtable to HashMap
13 years ago
Michael Christen fa8da7f89d vocabularies are now also used as source for a did-you-mean computation
13 years ago
Michael Christen eaec14ecc4 Dictionaries from words caches can now be used as autotagging vocabulary
13 years ago
Michael Peter Christen 91940fdf56 redesign of WordCache to be prepared to hold multiple
13 years ago
Michael Christen bd40a10230 added autotaggig stub .. only reading and parsing of vocabularies at
13 years ago
Michael Christen c04bfaa51b refactoring
13 years ago
Michael Christen 1f4afb4dc0 performance hacks
13 years ago
Michael Christen 762e0ecfb6 fixed localization dictionaries, see
13 years ago
Michael Christen 9cd469e6d6 added pull request from als plus an NPE fix
13 years ago
Al Sutton 39898cb94a Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer
13 years ago
Al Sutton 4c67a964a1 Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer
13 years ago
Al Sutton 3f9b9f953f Added close() to ensure buffer close actions are invoked
13 years ago
Al Sutton d73c84f9a0 Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown
13 years ago
Al Sutton f02ea27b31 Added missing closure of ByteArrayInputSteam
13 years ago
Al Sutton 8993cac4d8 Initial performance improvements
13 years ago
orbiter ebd840ebf6 - enhanced description on search front page
13 years ago
orbiter e22f8497c9 - tested the ARC methods
13 years ago
orbiter 5a55397f99 some last-minute performance hacks
13 years ago
apfelmaennchen 564374d1fe - included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand.
13 years ago
orbiter 804e48888b smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter
13 years ago
orbiter 85d6bf4ac4 fixed urls to media content during indexing
13 years ago
orbiter 0d858d48ec replaced String with StringBuilder in suggestion process
13 years ago
orbiter d2ea250d99 refactoring:
13 years ago
low012 277b454a62 *) added comments
13 years ago
orbiter 6b22865dbc - removed some warinings
13 years ago
orbiter 8a428d3e77 ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation
13 years ago
orbiter 85a5487d6d YaCy can now use the solr index to compute text snippets. This makes search result preparation MUCH faster because no document fetching and parsing is necessary any more.
13 years ago
orbiter 0819e1d397 protection against OOM cases in image parser. See also bugs.yacy.net/view.php?id=54
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
orbiter 610b01e1c3 - added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index.
13 years ago
orbiter b5252ef91f added new word recommendation library in DictionaryLoader_p.html
13 years ago
orbiter 1c007188ad bugfixes in html parser
13 years ago
orbiter 231074bf0a fixed a parsing bug by reverting SVN 7766
13 years ago
low012 24e76a7b69 *) Replaced occurrences of "Wikimedia" with "MediaWiki" where applicable. (Thanks to the folks of 0x20.be for pointing this out.)
13 years ago
orbiter 5dd2efc9a2 - bugfixes in html parser
13 years ago
orbiter 51cf697acd refactoring: moved all score-related classes to new ranking package
13 years ago
sixcooler eb14111200 encapsulate potential expensive objects in TextSnippet to allow GC them asap
13 years ago
sixcooler a311596881 finishing up my commits (7855-7858) which could be helpful for
13 years ago
sixcooler 9170a434ed throwing an exception again in FileUtils.copy(reader, writer)
13 years ago
sixcooler ce248cc8dd less byte-arrays of response-content, less byte-array <-> stream conversation
13 years ago
sixcooler 59b767eebd stop loading via http at defined maximum of bytes - even size is unknown before loading
13 years ago
orbiter 299af4943c added another memory protection hack
14 years ago
orbiter b06faab9d3 do not allocate a StringBuilder object in case that there is not enough memory for that
14 years ago
orbiter 2d4bb139d3 - added counting of links with noindex tag for solr index
14 years ago
orbiter bda3eec0ff added parsing of canonical link element to html parser
14 years ago
orbiter 9706fc55aa enhanced content scraper (should discover urls much faster in case of very large plain texts)
14 years ago
orbiter f667b9c289 enhanced identificator: using AtomicInteger for counter
14 years ago
orbiter 115abc8917 - more attributes for search progress bar
14 years ago
orbiter 77fe69395d added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html
14 years ago