yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
reger	28456dfc09	skip creation of unused Bluelist contenttransformer	10 years ago
Michael Peter Christen	321840fde3	Replaced all fixed thread pools with cached thread pools. The cached thread pools will flush their cached (dead) threads after 60 seconds. This will cause that YaCy now runs constantly withl about 50 threads, about 100 at peak times. Previously, about 400 threads had been cached and kept in a hibernation state, which caused that the numproc counter in /proc/user_beancounters (exists only in VM-hosted linux) was as high as the cached number of threads. This caused that VM supervisors terminated whole VM sessions if a limit was reached. Many VM providers have limits of numproc=96 which made it virtually impossible to run YaCy on such machines. With this change, it will be possible to run many YaCy instances even on VM hosts.	10 years ago
Michael Peter Christen	a1ee101079	recognize more html file extensions	10 years ago
reger	0c97cc2440	skip unused call parameter for hashSentence()	10 years ago
reger	5790c7242e	skip to tokenize punktuation as word in WordTokenizer remove unused variables in condenser related to Tokenizer	10 years ago
Michael Peter Christen	6a2a669db4	added loading of the synonyms file from addon/synonyms into the knowledge loader	10 years ago
Michael Peter Christen	07c5b57953	removed warnings	10 years ago
reger	59c6532a65	add link extraction to pdfParser this extracts clickable links in pdf and adds it to the list of links include a test case for this function this is the corrected comment for commit: `aa2e15d846`	10 years ago
reger	aa2e15d846	allow url parameter in worktable apicall allow url=wwwl?param=a&param=b (with ?, & encoded) fix: http://mantis.tokeek.de/view.php?id=100 fix double adding of '&' in MultiProtocolURL.escape()	10 years ago
reger	b0c87d8240	fix image search expand box, cut-off of 2nd capture line height tested with IE11 and Firefox 32 (change worked for both to show 2nd line without cutting off height) +fix charset parameter in metadataImageParser +update start errMsgTxt to "java 1.7"	10 years ago
Michael Peter Christen	3073c69aee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	eaccce3467	added metadataImageParser for tif and psd (Photoshop) images. This is a modified genericImageParser adding tif (and psd) support even if java ImageIO plugin for tif is not installed in JDK. Adds just tif and psd to the available parsers. Uses the same library to extract metadata, so could eventually be merged with genericImageParser. All detected metadata are added to the parsed document (potentially some more as with genericImageParser)	10 years ago
reger	a69f5358ff	use javax ImageIO getReader to add supported image extension/mime genericImageParser uses javax ImageIO, supported images depend on available plugins for ImageIO package (this is JDK installation specific). Jpeg, png and gif are availabel by default. Tif and others only on avalable plugin (in classpath). Add supported image type dynamically on startup.	11 years ago
Michael Peter Christen	67cd4c37bd	activated the new apk parser which was already ready but not included in the parser initialization. To make the apk parser usable, the handling of application type links had to be modified. Now all documents which have not a parser attached are placed to the noload-queue while all other documents are parsed using the associated parser class. This may have side-Effects on other parsers and the display of different file classes (images, apps, videos).	11 years ago
reger	03a7a29db3	limit OAI import urn resolver try for Deutsche National Library The resolver service of National Library uses name space nbn, limit use of nbn-resolving.de accordingly to urn:nbn: - add resolver for rfc's	11 years ago
orbiter	b6d57f06eb	enhanced the apk parser (up to beeing production-ready). The parser is not yet activated and will be after the next release step.	11 years ago
orbiter	c9e593cf78	removed warnings	11 years ago
reger	e9eae45b55	simplify rssreader and improve atom feed link extraction - type detection (rss/atom) - init type parameter overwritten during parse, parameter obsolete - detection by endtag changed to simpler first-tag evaluation - channel image not used, removed related extra parser handling - remove unused code (set/getImage) in rssfeed - atom link extraction to account for possible multipe link tags - spec limits link to one with rel="alternate" or one without rel attribute not accounting for the follwing type & hreflang exception yet: o atom:entry elements MUST NOT contain more than one atom:link element with a rel attribute value of "alternate" that has the same combination of type and hreflang attribute values.	11 years ago
reger	8f77719091	fix "Ljava.lang.String" in crawl queue anchor name (e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue)	11 years ago
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	11 years ago
orbiter	08409ec680	no idea why the words max was an ordered one. This change increaes speed dunring document processin a bit	11 years ago
Michael Peter Christen	b44626e55b	fixed target_alt_t in webgraph	11 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	11 years ago
Michael Peter Christen	e039e78210	small bugfixes	11 years ago
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	11 years ago
Michael Peter Christen	f3a6b6e21e	fix for bad URL decoding	11 years ago
Michael Peter Christen	aee5b108e5	added linkScraperParser, a parser which ignores the text like the generic parser but extracts links like the htmlParser. This should be used for ASCII documents without known text format annotation like source code files or json documents. Probably also good for xml files without known schema.	11 years ago
reger	40133ba2d0	fix NPE in Condenser, discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference"	11 years ago
reger	cb2c17d236	extract author and keywords in .doc and .ppt parser	11 years ago
orbiter	fec673c9d1	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	4a66af716d	added apkParser stub (work in progress)	11 years ago
reger	2d67f29244	adjust mergeDocument after parsing to - preserve charset and languages - fix merge of author	11 years ago
Michael Peter Christen	0d29b972cc	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	7847a93558	fix AbstractParser.singleList not adding null strings - prevents null titles in oo... parser (as detected by ParserTest) - correct ParserTest dc_description check (dc_description allowed to return 0 length array)	11 years ago
Michael Peter Christen	8acae852a0	write <em>-tagged texts also into the bold_txt field	11 years ago
reger	3b559e7846	optimize pdfParser skip starting reader thread if all content already read	11 years ago
reger	09f73b790f	fix pdfParser not closed warning from pdfbox for encrypted pdf on exit due to missing permission to extract	11 years ago
reger	d8d318233e	fix logging settings - add missing .level - remove obsolete jena settings - set default level=INFO to prevent debug logging of not explicite specified classes	11 years ago
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	11 years ago
orbiter	88f4af90da	removed warnings	11 years ago
reger	8a7c68e4c7	content of surrogates/out never accessed (remove) After import the conent is never accessed but may take up a lot of disk space, also the getLoadedOAIServer (which lists the files in surrogate out) is not used. Making the surrogate.out obsolete. Removed keeping of xmls after import.	11 years ago
reger	2eb7682772	add html5 audio/video <source> tag to html content scraper - <source src=.. type=..> tag content is added to embed collection	11 years ago
reger	0b6db04e40	fix contentscraper img height/width parsing prevent numberformat exception on common "100px" property - include in test case	11 years ago
reger	121d25be38	recover sax fatal error on OAI-PMH import of xml with entity error this allows to continue loading next resumptionToken even if import file caused sax parser error fix http://mantis.tokeek.de/view.php?id=63	11 years ago
reger	86f6975edc	exclude html tags in in/outboundlinks_anchortext_txt parsed text - some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags, remove all tags for text property (inline img tags are still parsed) - added test case for above (to htmlParserTest) - fix solr test case	11 years ago
Michael Peter Christen	5746aae3db	add canonical links to the same crawldepth, not the next crawldepth	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
Michael Peter Christen	ce1d1b2fa0	fix for maximum tag length in parser	11 years ago
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	11 years ago

1 2 3 4 5 ...

512 Commits (ff035a20e7b9a931206239f2513c426576f8cab1)