Skip to content

Repeatable unescaping of html content leads to not valid html #5

@Joyfolk

Description

@Joyfolk

$data = html_entity_decode($data);

This line leads to invalid HTML for some documents (for example for /edsapi/rest/Retrieve?an=T115986&dbid=dmp) because of double decoding of HTML content (&amp;lt; becomes < inside HTML body).

Looks like there is no reason to decode HTML content here - it is already decoded inside SimpleXML object. The only thing left to decode is the content of the <ephtml> tags which is double encoded.
So, this line should probably be something like this:

$data = preg_replace_callback('/<ephtml>(.*?)<\/ephtml>/m', function($escaped) {
            return html_entity_decode($escaped[0]);
}, $data);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions