Commit Graph

320 Commits (4eb89d7f152c6b54028b21205c2bf99a6eeb302f)

Author SHA1 Message Date
Michael Peter Christen 3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 13 years ago
orbiter 3190347814 added a synonyms_t field to solr and a process to read synonym files. 13 years ago
Michael Peter Christen 411d0e839b added an underline text field to solr to record all underlined texts 13 years ago
sixcooler 6c50d016ed pdf- and zipParser should not use forced Memory-Limits 13 years ago
Michael Peter Christen 8219a445f3 refactoring 13 years ago
Michael Peter Christen 00c1c777fa refactoring 13 years ago
Michael Peter Christen e54ac38095 - some corrections in usage of getFile() and getFileName() 13 years ago
Michael Peter Christen 528d6763fa - added new solr fields: 13 years ago
orbiter 67f2866cd0 small fixes 13 years ago
orbiter d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All 13 years ago
orbiter 482afed07c reduced logging overhead (a bit) 13 years ago
orbiter bbfa497a3c replaced more size() > 0 by !isEmpty() 13 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty() 13 years ago
Michael Peter Christen fbc1a2030d fix for sitemap importer: can now also import very large sitemaps within 13 years ago
Michael Peter Christen 92731e5287 fix for sevenzip parser 13 years ago
Michael Peter Christen 8efc1c1078 - fixed a memory leak (or bad usage) during parsing/snippet fetch 13 years ago
Michael Peter Christen b1e7c11fba fix for pattern matcher in html parser 13 years ago
Michael Peter Christen b0c408788b made class methods static where possible 13 years ago
Michael Peter Christen 7c1ba99755 removed more unused method parameters 13 years ago
Michael Peter Christen 0301aba1e9 removed unused method parameters 13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code 13 years ago
orbiter 7f851d62a7 replaced HashARC with SizeLimited Objects which are less costly 13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one 13 years ago
Michael Peter Christen ad09b786bf clean up parser data 13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing. 13 years ago
Michael Peter Christen de903a53a0 parser refactoring & hacks 13 years ago
Michael Peter Christen ce8d4b87d9 fixes for new eclipse 'Juno' warning 'Resource leak'. 13 years ago
Michael Peter Christen 0c345d1559 giving threads name so its easier to see whats happening during 13 years ago
Michael Peter Christen 508a81b86c added solr field 'refresh_s' which stores the refresh url contained in 13 years ago
Michael Peter Christen f3167def64 do not fill the keywords with title content if keywords do not exist. 13 years ago
Michael Peter Christen 77f795756c fixing redirects and status codes: storing of status code in 13 years ago
Michael Peter Christen dbdd697f4d moved RDFaParser.xsl configuration file to defaults 13 years ago
Michael Peter Christen be928815fc fixed wrong parsing of style and script 13 years ago
Michael Peter Christen 50c576599b allow multiple parser options instead of printing an error 13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there 13 years ago
cominch 5f8ba7f4f2 small changes 13 years ago
cominch bcbd8eee33 Add several parsers, for RDFa and rdf files. 13 years ago
cominch 9cbfc1a1c0 augmentedProxy, which forwards every proxy request to a 13 years ago
Michael Peter Christen 2fe207f813 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 13 years ago
Michael Peter Christen 0284a4d88f more fixes for double precision of coordinates 13 years ago
Michael Peter Christen 964406ad17 added concurrency enhancement to xml parser 13 years ago
Michael Peter Christen e0d8643226 - performance hacks 13 years ago
Michael Peter Christen 9b4c699526 ehanced location search: 13 years ago
Michael Peter Christen 4d3cc02168 replaced old bzip2 library against better documented commons-compress 13 years ago
Michael Peter Christen c15fcde1c8 add-on to latest commit 13 years ago
Michael Peter Christen 81737dcb18 removed stack trace from swf parser since we cant do anything there 13 years ago
Michael Peter Christen acf8d521a2 fix for http://bugs.yacy.net/view.php?id=126 13 years ago
Michael Peter Christen 89142d1e8d removed (not all) warnings 13 years ago
Roland 'Quix0r' Haeder a093ccf5eb Now used synchronization in all close() methods to make sure all objects 13 years ago
Michael Peter Christen ba6aaabc51 refactoring + parser bugfixes 13 years ago
Michael Peter Christen 09484955dc added new entry class for embed tags 13 years ago
Michael Peter Christen 453010bd68 - solved problems with backpath normalization 13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD 13 years ago
Michael Peter Christen 4d5da75814 fix for parser problem if a <a>-tag is 'within' html tags with unclosed 13 years ago
Michael Peter Christen 046f3a7e8d check if httpc has decompressed the release file and rename the file 13 years ago
Michael Peter Christen 8d63a5887c bugfixes 13 years ago
Michael Peter Christen 9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a 13 years ago
Michael Peter Christen 7e4e3fe5b6 free some memory after parsing html 13 years ago
Michael Peter Christen 4540174fe0 memory hacks 13 years ago
Michael Peter Christen 1f4f60654a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 13 years ago
reger 32104360ce PDFParser - return at least first 3 pages of PDF 13 years ago
Michael Peter Christen eadb58dd87 small enhancements in pdf parser 13 years ago
reger b616de5973 PDFParser - return at least first 3 pages of PDF 13 years ago
Michael Peter Christen b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large 13 years ago
Michael Christen c04bfaa51b refactoring 13 years ago
Michael Christen 1f4afb4dc0 performance hacks 13 years ago
Michael Christen 9cd469e6d6 added pull request from als plus an NPE fix 14 years ago
Al Sutton 39898cb94a Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer 14 years ago
Al Sutton 4c67a964a1 Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer 14 years ago
Al Sutton 3f9b9f953f Added close() to ensure buffer close actions are invoked 14 years ago
Al Sutton d73c84f9a0 Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown 14 years ago
Al Sutton 8993cac4d8 Initial performance improvements 14 years ago
orbiter 5a55397f99 some last-minute performance hacks 14 years ago
orbiter 804e48888b smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter 14 years ago
low012 277b454a62 *) added comments 14 years ago
orbiter 8a428d3e77 ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation 14 years ago
orbiter 0819e1d397 protection against OOM cases in image parser. See also bugs.yacy.net/view.php?id=54 14 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled. 14 years ago
orbiter 610b01e1c3 - added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index. 14 years ago
orbiter 1c007188ad bugfixes in html parser 14 years ago
orbiter 231074bf0a fixed a parsing bug by reverting SVN 7766 14 years ago
orbiter 5dd2efc9a2 - bugfixes in html parser 14 years ago
orbiter 51cf697acd refactoring: moved all score-related classes to new ranking package 14 years ago
sixcooler 9170a434ed throwing an exception again in FileUtils.copy(reader, writer) 14 years ago
orbiter 299af4943c added another memory protection hack 14 years ago
orbiter b06faab9d3 do not allocate a StringBuilder object in case that there is not enough memory for that 14 years ago
orbiter bda3eec0ff added parsing of canonical link element to html parser 14 years ago
orbiter 9706fc55aa enhanced content scraper (should discover urls much faster in case of very large plain texts) 14 years ago
orbiter 77fe69395d added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html 14 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks 14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources: 14 years ago
orbiter e28bd0d038 fix for some possible causes of memory leaks 14 years ago
orbiter 10e2f588f8 - enhanced ybr ranking computation 14 years ago
orbiter 3ed4a09368 small features, some bug fixes and performance hacks 14 years ago
orbiter 021840e5ba removed (almost) deadlocks and unnecessary CPU load 14 years ago
orbiter 9248a4eef4 reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder' 14 years ago
orbiter 4e8fa03514 added more attributes to html evaluation 14 years ago
orbiter 528da7c9ea removed unused class and added license header for new class 14 years ago
orbiter f6077b3cc0 added more attributes for html parser and enhanced data structures 14 years ago
orbiter d8e934c085 better abstraction of http client identification 14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content 14 years ago
orbiter 3d5104d357 - fixed a bug in crawl start with file name (npe in new url) 14 years ago
orbiter 958ff4778e enhanced location search: 14 years ago
orbiter 4c013d9088 more UTF8 getBytes() performance hacks 14 years ago
orbiter 0430a94eaa the location search shows now not re-evaluated locations but only such locations that are attached as metadata to web pages 14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser 14 years ago
orbiter 78d4c45d09 enhancement during search process: fast fail of search in case that all index feeder have terminated. 14 years ago
orbiter 1989ebc24b removed more warnings 14 years ago
orbiter 694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion 14 years ago
orbiter 30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding 14 years ago
lotus cb6d307bba adding extension for parser 14 years ago
orbiter e1b6916423 always try to guess the size of a StringBuilder to prevent too many memory re-allocations 14 years ago
orbiter cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'. 14 years ago
orbiter 8d14916c74 more patches for a better out-of-memory management 14 years ago
orbiter f8d0454c53 small bug fixes and experiments with search speed enhancement 14 years ago
orbiter a92d80a545 performance enhancements using an alternative to a insensitive collator (a complex string compare): 14 years ago
orbiter e717bf74ba more logging, more care about OOMs 14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain 14 years ago
orbiter 88773e4daa changed the default port from 8080 to 8090 14 years ago
low012 9f38c0023d *) Minor changes, mainly cleaning up a little bit, no functional changes. 14 years ago
orbiter 10ae8d961b - cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring) 14 years ago
low012 9eae33f886 *) Ooops... 14 years ago
low012 a001e8075c *) minor enhancements 14 years ago
low012 11ea966f9e *) added SID file (Commodore 64) sound file parser 14 years ago
low012 936e976c23 *) added FreeMind (http://freemind.sourceforge.net/) mindmap parser 14 years ago
low012 3d95981f7d *) cleaning up the code a little bit 14 years ago
low012 2a6499364d *) minor changes 14 years ago
low012 c0274bd123 *) minor changes 14 years ago
orbiter 59b70a5a92 another fix to the ftp crawler: now correct directory listings according to rfc2640 (path with spaces) and better title names for such files 14 years ago
orbiter 9b25a33fd9 - fixed numerous bugs 14 years ago
orbiter 7bdb13bf7f more fixes to smb crawling: better file names 14 years ago
orbiter c288fcf634 redesigned CrawlStartScanner user interface and added more features: 14 years ago
orbiter 56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls 14 years ago
orbiter c36da90261 added a very fast ftp file list generator to site crawler: 15 years ago
orbiter 4e2c14efbb fixed bugs in parser and ftp client 15 years ago
orbiter b769cce433 - added a catch-all parser for all documents that cannot be parsed: they will contributed with their document url for the search index only 15 years ago
f1ori a025b1da89 * fix bug when browsing local filesystem (e. g. repository) with yacy 15 years ago
orbiter 4c72885cba added a sitemap entry parser and loader for sitemaps 15 years ago
orbiter fb92f9ae8e added mime type image/jpeg (image/jpg is wrong but it is left here because it does not harm and this error also exists in configuration of web servers) 15 years ago
f1ori 7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null) 15 years ago
orbiter 58e74282af added a word counter statistic in condenser which is used by the did-you-mean to calculate best matches for given search words. 15 years ago
orbiter 0d363a94d7 more performance hacks 15 years ago
orbiter b8aee6d402 performance hacks for better search performance 15 years ago
orbiter aacf572a26 - enhancements for search speed 15 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed 15 years ago
f1ori e670e1ef8e add charset auto-detection for htmlParser 15 years ago
f1ori ddcd5ae78c fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2989 15 years ago
f1ori 8fe1102452 fix http://forum.yacy-websuche.de/viewtopic.php?p=20889#p18426 15 years ago
orbiter 84a023cbc8 fixed several search bugs 15 years ago
orbiter 114bdd8ba7 fixed old sitemap importer which was not able to parse urls containing post elements 15 years ago