Apache » Cocoon »

  Cocoon 2.2
   homepage

Cocoon 2.2

How to configure consistent encoding in Cocoon

The best for internationalization, ie. support of umlauts, special characters, non-english languages, is to handle everything in UTF-8, because this is probably the most intelligent encoding available out there.

Note: If you need another encoding, simply replace all occurrences of UTF-8 with that one, but note that this guide was only tested with UTF-8, other encodings might not be supported at all places.

The following how-to covers the typical steps to achieve a consistent encoding throughout a Cocoon application. Some Background information can be found at the end of this page.

1. Sending all pages in UTF-8

You need to configure Cocoon's serializers to UTF-8. The XML serializer (<serialize type="xml" />) and the HTML serializer (<serialize type="html" />) need to be configured. To support all browsers, you must state the encoding to be used for the body and also include a meta tag in the html: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">. This is very important, since the browser will then send form requests encoded in UTF-8 (and browsers normaly don't mention the encoding in the request, so you have to assume they are doing it right). Here is the configuration for the serializer components for your sitemaps that will do that:

<serializer name="xml" mime-type="text/xml"
  src="org.apache.cocoon.serialization.XMLSerializer">
  <encoding>UTF-8</encoding>
</serializer>

<serializer name="html" mime-type="text/html; charset=UTF-8"
  src="org.apache.cocoon.serialization.HTMLSerializer">
  <encoding>UTF-8</encoding>

  <!-- the following common doctype is only included for completeness, it has no impact on encoding -->
  <doctype-public>-//W3C//DTD HTML 4.01 Transitional//EN</doctype-public>
  <doctype-system>http://www.w3.org/TR/html4/loose.dtd</doctype-system>
</serializer>

2. AJAX Requests with CForms/Dojo

If you use CForms with ajax enabled, Cocoon will make use of dojo.io.bind() under the hood, which creates XMLHttpRequests that POST the form data to the server. Here Dojo decides the encoding by default, which does not match the browser's behaviour of using the charset defined in the META tag. But you can easily tell Dojo which formatting to use for all dojo.io.bind() calls, just include that in the top of your HTML pages, before dojo.js is included:

<script>djConfig = { bindEncoding: "utf-8" };</script>

You might already have other djConfig options, then simply add the bindEncoding property to the hash map.

3. Decoding incoming requests: Servlet Container

When the browser sends stuff to your server, eg. form data, the ServletRequest will be created by your servlet container, which needs to decode the parameters correctly into Java Strings. If there is the encoding specified in the HTTP request header, he will use that, but unfortunately this is typically not the case. When the browser sends a form post, he will only say application/x-www-form-urlencoded in the header. So you have to assume the encoding here, and the right thing to assume is the encoding of the page you originally sent to the browser.

The servlet standard says that the default encoding for incoming requests should be ISO-8859-1 (Jetty is not according to the standard here, it assumes UTF-8 by default). So to make sure UTF-8 is used for the parameter decoding, you have to tell your servlet that encoding explicitly. This is done by calling ServletRequest.setCharacterEncoding(). To do that for all your requests, you can use a servlet filter like this one: SetCharacterEncodingFilter. Put this into one of your Cocoon blocks under src/main/java/my/package/filters/SetCharacterEncodingFilter so that the class will be in a jar that lands in WEB-INF/lib and thus being availble for use in the web.xml configuration.

Then you add the filter to the web.xml:

<filter>
  <filter-name>Set Character Encoding</filter-name>
  <filter-class>my.package.filters.SetCharacterEncodingFilter</filter-class>
  <init-param>
    <param-name>encoding</param-name>
    <param-value>UTF-8</param-value>
  </init-param>
</filter>

<!-- either mapping to URL pattern -->

<filter-mapping>
  <filter-name>Set Character Encoding</filter-name>
  <url-pattern>/*</url-pattern>
</filter-mapping>

<!-- or mapping to your Cocoon servlet (the servlet-name might be different) -->

<filter-mapping>
  <filter-name>SetCharacterEncoding</filter-name>
  <servlet-name>CocoonBlocksDispatcherServlet</servlet-name>
</filter-mapping>

Since the filter element was added in the servlet 2.3 specification, you need at least 2.3 in your web.xml, but using the current 2.4 version is better, it's the standard for Cocoon webapplications. For 2.4 you use a XSD schema:

<web-app version="2.4"
         xmlns="http://java.sun.com/xml/ns/j2ee"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd">

For 2.3 you need to modify the DOCTYPE declaration in the web.xml:

<!DOCTYPE web-app
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
    "http://java.sun.com/dtd/web-app_2_3.dtd">

4. Setting Cocoon's encoding (especially CForms)

To tell Cocoon to use UTF-8 internally, you have to set 2 properties:

org.apache.cocoon.containerencoding=utf-8
org.apache.cocoon.formencoding=utf-8

They need to be in some *.properties file under META-INF/cocoon/properties in one of your blocks. Note that the containerencoding must be the same as the one you specified in the SetCharacterEncodingFilter. But here we are using UTF-8 everywhere anyway.

5. XML Files

This is normally not a problem, since the standard encoding for XML files is UTF-8. However, they should always start with the following instruction, which should force your XML Editor to save them in UTF-8 (it looks like most of them do that, so there should not be a problem here).

<?xml version="1.0" encoding="UTF-8"?>

6. Special Transformers

The standard XSLT Transformers and others are working on SAX events, which are not serialized, thus encoding is not a problem. But there are some special transformers that pass stuff on to another library that does include serialization and might need a hint to use the correct encoding. One problem is for example the NekoHTMLTransformer: https://issues.apache.org/jira/browse/COCOON-2063.

If you think there might be a transformer doing things wrong in your pipeline, add a TeeTransformer between each step, outputting the XML between the transformers into temp1.xml, temp2.xml and so on to look for the place where your umlaute and special characters are messed up.

7. Your own XML serializing Sources

If you have your own Source implementation that needs to serialize XML, make sure it will do that in UTF-8 as well. A good idea is to use Cocoon's XML serializer, since we already configured that one to UTF-8 above. Sample code that does that is here: UseCocoonXMLSerializerCode

Further information

Browser encoding basics

Getting pages

If your Cocoon application needs to read request parameters that could contain special characters, i.e. characters outside of the first 128 ASCII characters, you'll need to pay attention to what encoding is used.

Normally a browser will send data to the server using the same encoding as the page containing the submitted form (or whatever). So if the pages are serialized using UTF-8, the browser will submit form data using UTF-8. The user can change the encoding, but it's quite safe to assume he/she won't do that (have you ever done it?).

The browser will either read the encoding from either the <meta> tag inside the HTML <head>:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

or from the HTTP Header Content-Type:

Content-Type: text/html; charset=UTF-8

One has to include both to support all browsers. This will be done by the HTML serializer if you configure it with the parameters mime-type and encoding, as stated above.

Sending form data

By default, if the browser doesn't explicitely mention the encoding, a servlet container will decode request parameters using the ISO-8859-1 encoding (independent of the platform on which the container is running). So in the above case where UTF-8 was used when serializing, we would be facing problems. An exception, that might hide the problem and which you will face when you use the handy mvn jetty:run to run your Cocoon application, is that Jetty uses UTF-8 by default. It does not adhere to the servlet container standard here.

You either have to configure your container with the default encoding you want (e.g. UTF-8), if that is possible, or you must use a servlet-filter solution like the SetCharacterEncodingFilter. Using a servlet filter also has the advantage that it will work for any servlet. Suppose your webapp consists of multiple servlets, with Cocoon being only one of them. Sometimes the processing could start in another servlet (which sets the character encoding correctly) and then be forwarded to Cocoon, while other times the processing could start immediately in the Cocoon servlet. It would then be impossible to know in Cocoon whether the request parameter encoding needs to be corrected or not (see below).

Request parameter decoding in Cocoon

Fixing a wrong servlet container

If you are not able to set the default encoding for your servlet container to what you actually want, it is possible to configure Cocoon to re-decode parameters properly. Suppose the servlet container has ISO-8859-1 default encoding set, but the requests from the browser are actually encoded in UTF-8. Then you can configure Cocoon with these properties:

org.apache.cocoon.containerencoding=iso-8859-1
org.apache.cocoon.formencoding=utf-8

For Java-insiders: what Cocoon actually does internally is apply the following trick to get a parameter correctly decoded: suppose "value" is a string containing a request parameter, then Cocoon will do:

value = new String(value.getBytes("ISO-8859-1"), "UTF-8");

So it recodes the incorrectly decoded string back to bytes and decodes it using the correct encoding. The first (ISO-8859-1 in the example) is the containerencoding, the second one the formencoding.

Not that this only works for core Cocoon concepts, eg. sitemaps, CForms and others accessing the request parameters. There are other components, eg. the JSPGenerator, that access the original HttpServletRequest object and thus do not get the correctly re-decoded parameter values (that is, if for example the JSP page itself would read request parameters). The only working solution seems to be the servlet-filter here.

Locally overriding the form-encoding

Cocoon is ideally suited for publishing to different kinds of devices, and it may well be possible that for certain devices, it is required to use different encodings. In this case, you can redefine the form-encoding for specific pipelines using the SetCharacterEncodingAction.

To use it, first of all make sure the action is declared in the map:actions element of the sitemap:

<map:action name="set-encoding" src="org.apache.cocoon.acting.SetCharacterEncodingAction"/>

and then call the action at the required location as follows:

<map:act type="set-encoding">
  <map:parameter name="form-encoding" value="some-other-encoding"/>
</map:act>

Operating System Preliminaries

Not having influence on request parameter decoding, but sometimes making trouble with text files, database communication, etc. are operating system language settings. Working with non-english characters may pose problems, as the JVM seems to detect the system language. If, e.g., german umlauts should be correctly processed with Cocoon on Linux, it is required to set the LANG environment variable to de like this:

export LANG=de

That's one of several ways of setting the JVM locale, see also SettingTheJvmLocale.

More readings