Corrections
* Corrected GoogleBridge (URI extraction was incorrect) * Corrected ATOM format: * mime-type was incorrect * Hyperlinks were not clickable. * non-UTF8 characters are now properly filtered. * Corrected HTML format output: * Hyperlinks were not clickable. * Corrected error message when SimpleHtmlDom library is not installed. * Added changelog.
This commit is contained in:
parent
a84f111d8f
commit
4bf90735ef
7 changed files with 70 additions and 24 deletions
21
CHANGELOG.md
Normal file
21
CHANGELOG.md
Normal file
|
@ -0,0 +1,21 @@
|
||||||
|
rss-bridge Changelog
|
||||||
|
===
|
||||||
|
|
||||||
|
Alpha 0.1
|
||||||
|
===
|
||||||
|
* Firt tagged version.
|
||||||
|
* Includes refactoring.
|
||||||
|
* Unstable.
|
||||||
|
|
||||||
|
Current development version
|
||||||
|
===
|
||||||
|
* Corrected GoogleBridge (URI extraction was incorrect)
|
||||||
|
* Corrected ATOM format:
|
||||||
|
* mime-type was incorrect
|
||||||
|
* Hyperlinks were not clickable.
|
||||||
|
* non-UTF8 characters are now properly filtered.
|
||||||
|
* Corrected HTML format output:
|
||||||
|
* Hyperlinks were not clickable.
|
||||||
|
* Corrected error message when SimpleHtmlDom library is not installed.
|
||||||
|
* Added changelog.
|
||||||
|
|
21
README.md
21
README.md
|
@ -1,8 +1,6 @@
|
||||||
rss-bridge
|
rss-bridge
|
||||||
===
|
===
|
||||||
|
|
||||||
Version alpha 0.1
|
|
||||||
|
|
||||||
rss-bridge is a collection of independant php scripts capable of generating ATOM feed for specific pages which don't have one.
|
rss-bridge is a collection of independant php scripts capable of generating ATOM feed for specific pages which don't have one.
|
||||||
|
|
||||||
Supported sites/pages
|
Supported sites/pages
|
||||||
|
@ -10,19 +8,15 @@ Supported sites/pages
|
||||||
|
|
||||||
* `FlickrExplore` : [Latest interesting images](http://www.flickr.com/explore) from Flickr.
|
* `FlickrExplore` : [Latest interesting images](http://www.flickr.com/explore) from Flickr.
|
||||||
* `GoogleSearch` : Most recent results from Google Search. Parameters:
|
* `GoogleSearch` : Most recent results from Google Search. Parameters:
|
||||||
* q=keyword : Keyword search.
|
* `Twitter` : Twitter. Can return keyword/hashtag search or user timline.
|
||||||
* `Twitter` : Twitter. Parameters:
|
|
||||||
* q=keyword : Keyword search.
|
|
||||||
* u=username : Get user timeline.
|
|
||||||
|
|
||||||
Easy new bridge system (detail below) !
|
|
||||||
|
|
||||||
Output format
|
Output format
|
||||||
===
|
===
|
||||||
Output format can be used in any rss-bridge:
|
Output format can take several forms:
|
||||||
|
|
||||||
* `Atom` : ATOM Feed.
|
* `Atom` : ATOM Feed, for use in RSS/Feed readers
|
||||||
* `Json` : Json
|
* `Json` : Json, for consumption by other application.
|
||||||
* `Html` : html page
|
* `Html` : html page
|
||||||
* `Plaintext` : raw text (php object, as returned by print_r)
|
* `Plaintext` : raw text (php object, as returned by print_r)
|
||||||
|
|
||||||
|
@ -35,7 +29,7 @@ Requirements
|
||||||
===
|
===
|
||||||
|
|
||||||
* php 5.3
|
* php 5.3
|
||||||
* [PHP Simple HTML DOM Parser](http://simplehtmldom.sourceforge.net). (Put `simple_html_dom.php` in `vendor/simplehtmldom`).
|
* [PHP Simple HTML DOM Parser](http://simplehtmldom.sourceforge.net). (Put `simple_html_dom.php` in `vendor/simplehtmldom/`).
|
||||||
* Ssl lib activated in PHP config
|
* Ssl lib activated in PHP config
|
||||||
|
|
||||||
|
|
||||||
|
@ -46,7 +40,8 @@ I'm sebsauvage, webmaster of [sebsauvage.net](http://sebsauvage.net), author of
|
||||||
Thanks to [Mitsukarenai](https://github.com/Mitsukarenai) for the inspiration.
|
Thanks to [Mitsukarenai](https://github.com/Mitsukarenai) for the inspiration.
|
||||||
|
|
||||||
Patch :
|
Patch :
|
||||||
- Yves ASTIER (Draeli) : PHP optimizations, fixes, dynamic brigde/format list with all stuff behind and extend cache system. Mail : contact@yves-astier.com
|
|
||||||
|
* Yves ASTIER ([Draeli](https://github.com/Draeli)) : PHP optimizations, fixes, dynamic brigde/format list with all stuff behind and extend cache system. Mail : contact@yves-astier.com
|
||||||
|
|
||||||
Licence
|
Licence
|
||||||
===
|
===
|
||||||
|
@ -56,7 +51,7 @@ Code is public domain.
|
||||||
Technical notes
|
Technical notes
|
||||||
===
|
===
|
||||||
* There is a cache so that source services won't ban you even if you hammer the rss-bridge with requests. Each bridge has a different duration for the cache. The `cache` subdirectory will be automatically created. You can purge it whenever you want.
|
* There is a cache so that source services won't ban you even if you hammer the rss-bridge with requests. Each bridge has a different duration for the cache. The `cache` subdirectory will be automatically created. You can purge it whenever you want.
|
||||||
* To implement a new rss-bridge, create a new class in `bridges` directory and extends with `BridgeAbstract`. Look at existing bridges for examples. For items you generate in `$this->items`, only `uri` and `title` are mandatory in each item. `timestamp` and `content` are optional but recommended. Any additional key will be ignored by ATOM feed (but outputed to jSon). If you want your new bridge appear in `index.php`, don't forget add annotation.
|
* To implement a new rss-bridge, create a new class in `bridges` subdirectory. Look at existing bridges for examples. For items you generate in `$this->items`, only `uri` and `title` are mandatory in each item. `timestamp` and `content` are optional but recommended. Any additional key will be ignored by ATOM feed (but outputed to jSon).
|
||||||
|
|
||||||
Rant
|
Rant
|
||||||
===
|
===
|
||||||
|
|
|
@ -28,8 +28,13 @@ class GoogleSearchBridge extends BridgeAbstract{
|
||||||
$emIsRes = $html->find('div[id=ires]',0);
|
$emIsRes = $html->find('div[id=ires]',0);
|
||||||
if( !is_null($emIsRes) ){
|
if( !is_null($emIsRes) ){
|
||||||
foreach($emIsRes->find('li[class=g]') as $element) {
|
foreach($emIsRes->find('li[class=g]') as $element) {
|
||||||
$item = new \Item();
|
$item = new Item();
|
||||||
$item->uri = $element->find('a[href]',0)->href;
|
|
||||||
|
// Extract direct URL from google href (eg. /url?q=...)
|
||||||
|
$t = $element->find('a[href]',0)->href;
|
||||||
|
$item->uri = 'http://google.com'.$t;
|
||||||
|
parse_str(parse_url($t, PHP_URL_QUERY),$parameters);
|
||||||
|
if (isset($parameters['q'])) { $item->uri = $parameters['q']; }
|
||||||
$item->title = $element->find('h3',0)->plaintext;
|
$item->title = $element->find('h3',0)->plaintext;
|
||||||
$item->content = $element->find('span[class=st]',0)->plaintext;
|
$item->content = $element->find('span[class=st]',0)->plaintext;
|
||||||
$this->items[] = $item;
|
$this->items[] = $item;
|
||||||
|
|
|
@ -26,7 +26,8 @@ class AtomFormat extends FormatAbstract{
|
||||||
$entryTitle = is_null($data->title) ? '' : $data->title;
|
$entryTitle = is_null($data->title) ? '' : $data->title;
|
||||||
$entryUri = is_null($data->uri) ? '' : $data->uri;
|
$entryUri = is_null($data->uri) ? '' : $data->uri;
|
||||||
$entryTimestamp = is_null($data->timestamp) ? '' : date(DATE_ATOM, $data->timestamp);
|
$entryTimestamp = is_null($data->timestamp) ? '' : date(DATE_ATOM, $data->timestamp);
|
||||||
$entryContent = is_null($data->content) ? '' : '<![CDATA[' . htmlentities($data->content) . ']]>';
|
// We prevent content from closing the CDATA too early.
|
||||||
|
$entryContent = is_null($data->content) ? '' : '<![CDATA[' . $this->sanitizeHtml(str_replace(']]>','',$data->content)) . ']]>';
|
||||||
|
|
||||||
$entries .= <<<EOD
|
$entries .= <<<EOD
|
||||||
|
|
||||||
|
@ -66,13 +67,21 @@ EOD;
|
||||||
</feed>
|
</feed>
|
||||||
EOD;
|
EOD;
|
||||||
|
|
||||||
|
// Remove invalid non-UTF8 characters
|
||||||
|
|
||||||
|
// We cannot use iconv because of a bug in some versions of iconv.
|
||||||
|
// See http://www.php.net/manual/fr/function.iconv.php#108643
|
||||||
|
//$toReturn = iconv("UTF-8", "UTF-8//IGNORE", $toReturn);
|
||||||
|
// So we use mb_convert_encoding instead:
|
||||||
|
ini_set('mbstring.substitute_character', 'none');
|
||||||
|
$toReturn= mb_convert_encoding($toReturn, 'UTF-8', 'UTF-8');
|
||||||
return $toReturn;
|
return $toReturn;
|
||||||
}
|
}
|
||||||
|
|
||||||
public function display(){
|
public function display(){
|
||||||
// $this
|
$this
|
||||||
// ->setContentType('application/atom+xml; charset=' . $this->getCharset())
|
->setContentType('application/atom+xml; charset=utf8') // We force UTF-8 in ATOM output.
|
||||||
// ->callContentType();
|
->callContentType();
|
||||||
|
|
||||||
return parent::display();
|
return parent::display();
|
||||||
}
|
}
|
||||||
|
|
|
@ -16,10 +16,9 @@ class HtmlFormat extends FormatAbstract{
|
||||||
$entries = '';
|
$entries = '';
|
||||||
foreach($this->getDatas() as $data){
|
foreach($this->getDatas() as $data){
|
||||||
$entryUri = is_null($data->uri) ? $uri : $data->uri;
|
$entryUri = is_null($data->uri) ? $uri : $data->uri;
|
||||||
$entryTitle = is_null($data->title) ? '' : htmlspecialchars(strip_tags($data->title));
|
$entryTitle = is_null($data->title) ? '' : $this->sanitizeHtml(strip_tags($data->title));
|
||||||
$entryTimestamp = is_null($data->timestamp) ? '' : '<small>' . date(DATE_ATOM, $data->timestamp) . '</small>';
|
$entryTimestamp = is_null($data->timestamp) ? '' : '<small>' . date(DATE_ATOM, $data->timestamp) . '</small>';
|
||||||
$entryContent = is_null($data->content) ? '' : '<p>' . $data->content . '</p>';
|
$entryContent = is_null($data->content) ? '' : '<p>' . $this->sanitizeHtml($data->content). '</p>';
|
||||||
|
|
||||||
$entries .= <<<EOD
|
$entries .= <<<EOD
|
||||||
|
|
||||||
<div class="rssitem">
|
<div class="rssitem">
|
||||||
|
@ -52,7 +51,7 @@ EOD;
|
||||||
return $toReturn;
|
return $toReturn;
|
||||||
}
|
}
|
||||||
|
|
||||||
public function display(){
|
public function display() {
|
||||||
$this
|
$this
|
||||||
->setContentType('text/html; charset=' . $this->getCharset())
|
->setContentType('text/html; charset=' . $this->getCharset())
|
||||||
->callContentType();
|
->callContentType();
|
||||||
|
|
|
@ -90,6 +90,23 @@ abstract class FormatAbstract implements FormatInterface{
|
||||||
|
|
||||||
return $this->extraInfos;
|
return $this->extraInfos;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sanitized html while leaving it functionnal.
|
||||||
|
* The aim is to keep html as-is (with clickable hyperlinks)
|
||||||
|
* while reducing annoying and potentially dangerous things.
|
||||||
|
* Yes, I know sanitizing HTML 100% is an impossible task.
|
||||||
|
* Maybe we'll switch to http://htmlpurifier.org/
|
||||||
|
* or http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php
|
||||||
|
*/
|
||||||
|
public function sanitizeHtml($html)
|
||||||
|
{
|
||||||
|
$html = str_replace('<script','<‌script',$html); // Disable scripts, but leave them visible.
|
||||||
|
$html = str_replace('<iframe','<‌iframe',$html);
|
||||||
|
$html = str_replace('<link','<‌link',$html);
|
||||||
|
// We leave alone object and embed so that videos can play in RSS readers.
|
||||||
|
return $html;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
class Format{
|
class Format{
|
||||||
|
|
|
@ -15,7 +15,7 @@ require __DIR__ . '/Cache.php';
|
||||||
|
|
||||||
$vendorLibSimpleHtmlDom = __DIR__ . PATH_VENDOR . '/simplehtmldom/simple_html_dom.php';
|
$vendorLibSimpleHtmlDom = __DIR__ . PATH_VENDOR . '/simplehtmldom/simple_html_dom.php';
|
||||||
if( !file_exists($vendorLibSimpleHtmlDom) ){
|
if( !file_exists($vendorLibSimpleHtmlDom) ){
|
||||||
throw new \HttpException('"PHP Simple HTML DOM Parser" is missing. Get it from http://simplehtmldom.sourceforge.net and place the script "simple_html_dom.php" in the same folder to allow me to work.', 500);
|
throw new \HttpException('"PHP Simple HTML DOM Parser" library is missing. Get it from http://simplehtmldom.sourceforge.net and place the script "simple_html_dom.php" in '.substr(PATH_VENDOR,4) . '/simplehtmldom/', 500);
|
||||||
}
|
}
|
||||||
require_once $vendorLibSimpleHtmlDom;
|
require_once $vendorLibSimpleHtmlDom;
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue