reger
209a7374bd
remove unused import pdfParser
8 years ago
reger
de1c1c16db
Improve pdf text extraction resource handling.
...
For sort pdf <= 3 pages use already extracted content,
only for long pdf > 3 pages reassign content and close internal writer (to direct free buffers)
8 years ago
reger
f254fcfc67
fix htmlParser <script> text extraction on code containing expression
...
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109
Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script>
section (until a closing tag is detected </script>.
Possible side effect, missing </script> end-tag will truncate trailing
content text.
8 years ago
Michael Peter Christen
02d0b3172c
Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
Michael Peter Christen
d4f45cf05e
added dc.date.modified and dc.date.created to date parser
8 years ago
luccioman
6a4d51d8f9
Cleaned up some Javadoc warnings.
8 years ago
reger
4c9be29a55
fix concurrency issue with htmlParser using not current scraper data
...
resulting in incorrect data for some html index metadata.
Details see http://mantis.tokeek.de/view.php?id=717
8 years ago
reger
b522d540b9
Include itemprop latitude/longitude (see schema.org) in attribute
...
parsing for lat/lon.
Harmonize number parsing for lat/lon to parseDouble.
Fix endDate_dts value assignment.
8 years ago
reger
083df255e4
fix html tag attribute parsing containing attribute w/o value
...
e.g. itemscope or autofocus (in such case the next key was not properly
recognized).
8 years ago
reger
cb95b7339a
include html5 <time> tag in content scraper,
...
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
8 years ago
reger
c50e23c495
reduce creation of empty legacy RequestHeader() in situation where null
...
is acceptable (less for garbage collection).
8 years ago
luccioman
f0639d810c
Customized name for Threads still using the default "Thread-n" pattern.
...
This makes threads monitoring easier to read.
8 years ago
luccioman
47af33a04c
Advanced Crawl from local file : better processing of large files.
...
Applied strategy : when there is no restriction on domains or
sub-path(s), stack anchor links once discovered by the content scraper
instead of waiting the complete parsing of the file.
This makes it possible to handle a crawling start file with thousands of
links in a reasonable amount of time.
Performance limitation : even if the crawl start faster with a large
file, the content of the parsed file still is fully loaded in memory.
8 years ago
luccioman
7717a3d43d
Fixed license headers on files created to improve favicon management.
8 years ago
luccioman
6e1959f469
Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
...
Conflicts:
htroot/yacysearchitem.java
source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
source/net/yacy/search/schema/CollectionConfiguration.java
source/net/yacy/server/serverObjects.java
8 years ago
reger
14f7577231
add support for older Word versions (Word6/Word95) to docParser
8 years ago
reger
efcb6a1e74
fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
...
+ add mime text/xml as in use for rss in the wild
8 years ago
luccioman
b1b8e69da8
Fixed NullPointerException cases
8 years ago
reger
a4465c97d6
as requested, disable/remove old swf parser
...
http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5861#p33098
9 years ago
Michael Peter Christen
5e165a8150
removed unused imports
9 years ago
reger
4c7a77662a
eleminate dependency on file-extension in storeDocument but use supported mime-type
...
to also support handling of urls w/o corresponding file-extension.
For this refactor use of document.getParserObject() to alway return a Parser (for clean logic)
and define/move the scraperObject as local var of AbstractParser.
Adjust related calls to getParserObject (where actually a scraperObject is wanted).
Addionally skip appending url token to parsed text for dht metadata entries
(by default returned as result by rwi index).
9 years ago
reger
ebde21079a
refactor xlsParser to include Excel file attribute (like author) in parser result doc.
...
Similar to ppt and doc parser, completing a TODO in xlsParser.
9 years ago
luccioman
6e96c7341a
Merge remote-tracking branch 'origin/master'
...
Conflicts:
htroot/Load_MediawikiWiki.java
htroot/Load_PHPBB3.java
htroot/ViewImage.java
9 years ago
reger
9e94989237
upd to PDFBox 2.0.1
9 years ago
reger
24b0fa2a38
extend snapshot Html2Image.pdf2image to use PDFBox image export capability
...
if no external tool installed (and for Win)
Resulting jpg are not always perfect (if graphic included) but imho sufficient.
9 years ago
reger
1d940e5a94
upd commons-compress 1.11
9 years ago
reger
764f5100f0
fix delete of temp file after odt % ooxml parser
...
Close zipfile after parsing
9 years ago
reger
06d0e2aeb9
result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
...
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
9 years ago
luc
3cc5619d93
Improved HTML icons indexing and rendering in search results.
...
See http://mantis.tokeek.de/view.php?id=629
9 years ago
reger
2048b7e057
support scraping start-/enddate from html tag with property "datetime"
...
This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).
9 years ago
reger
900d4584ba
complet resource cleanup of lists in contentscraper's close()
9 years ago
reger
1f18653de0
pass parsed swf content trough htmlscraper
...
Swf may contain subset of html tags which shoul'd appear as text.
Especially <font> tag may totally screw up metadata servlet if not filtered out.
9 years ago
reger
18ecf57792
add support of compressed swf to swfParser
...
from JavaSWF2 (source compatible to WebCat).
Moved swf file signature check to parser
Changed use of synced vector to list swf InStream
9 years ago
reger
ff27824964
fix swfParser reading file signature
...
before passing to library (current version expects data w/o signature)
9 years ago
luc
571bc55937
Refactoring : use StandardCharsets constants instead of hard-coded
...
charset names.
9 years ago
reger
e84d94f8ca
fix mime table for ms office / open office documents
...
(causing wrong parser detect in intranet mode)
9 years ago
reger
14803d58cd
let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
Michael Peter Christen
d82d311995
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
...
# Conflicts:
# .classpath
9 years ago
reger
e163ea88f6
fix vsdParser (Visio) parser return statement
...
(final block un-necessary throw)
9 years ago
luc
f0478bb14d
BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
...
imageio-bmp-3.2 library.
- better BMP format flavours support
- handle PNG encoded icons
- handle transparency
Added some javadoc url references to .classpath
9 years ago
reger
0d3c5b223e
have psParser cleanup temp file
9 years ago
reger
7d0d19cb8e
avoid File.deleteOnExit() on temp files
...
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
9 years ago
reger
02e4489a23
set tmpfile.deleteOnExit by default,
...
to make sure files are removed on shutdown.
9 years ago
reger
20e18d79f8
harmonize document title for archive parsers
9 years ago
reger
112ae013f4
update bzip and bzip parser process,
...
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
9 years ago
reger
e76a90837b
update zip and tar parser process,
...
to return one document for the file with combined parser results of the
containing files.
9 years ago
reger
5d71fc70e3
fix tarParser early exit on looping content
...
- adjust check of data available according to doc
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
9 years ago
reger
2fcf6f104c
fix bzipParser recognition
...
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
9 years ago
reger
c6687dd560
fix a system.out to log.fine
...
in bmpParser
9 years ago
reger
c6495a5b62
add a log entry on parsing ajax crawling scheme snapshot
...
(prev. commit 9252e36aeb
)
9 years ago