Doccano Labelling: Installation and Local Setup Best Practices

Step-by-step text classification project walkthrough using Doccano

Marc
7 min readAug 9, 2023
Doccano Logo (Source: GitHub)

Introduction

In the rapidly evolving world of artificial intelligence, accurate and high-quality labeled data plays a pivotal role in training successful language models. Data annotation, the process of assigning meaningful labels to raw data, can be a challenging and time-consuming task. As the demand for annotated data grows, the need for efficient and user-friendly tools becomes increasingly crucial.

Doccano is a powerful and versatile open-source software designed to streamline the text annotating process. Doccano offers a range of features that empower data scientists and developers to efficiently annotate large volumes of text data. Through an intuitive web-based interface, users can quickly annotate text entries, allowing for the creation of labeled datasets that serve as the foundation for training machine learning models.

Why pick Doccano over alternative annotation tools?

  1. Ease of Installation and Setup: Doccano provides a straightforward installation process, ensuring users can quickly begin labeling without unnecessary complications.
  2. Open-Source and Customizable: Being an open-source tool, Doccano can be tailored and extended to meet specific project requirements.
  3. Collaboration and Teamwork: Doccano supports collaborative annotation projects, enabling multiple users to work simultaneously on the same dataset.
  4. Annotation Variety: Doccano supports various annotation types making it suitable for a wide range of NLP tasks.
  5. Visualization and Quality Control: The platform provides data visualization tools that help users analyze and validate their labeled data.

Installation

To begin your journey with Doccano, you first need to install the platform on your local machine or server. Doccano is built using Python and Django, making it compatible with various operating systems.

Follow the steps below to install Doccano:

1. Prerequisites:

Before installing Doccano, ensure that you have the following prerequisites installed on your system:

  • Python version >= 3.8
  • pip (Python package manager)
  • Virtual environment (optional but recommended)

2. Create a Virtual Environment (Optional):

Creating a virtual environment is a good practice as it isolates your project’s dependencies from other Python projects on your system. To create a virtual environment, open your terminal or command prompt and execute the following command:

python3 -m venv doccano-env

This command creates a new virtual environment named doccano-env. You can replace doccano-env with your desired name if preferred.

3. Install Doccano:

With the virtual environment activated (if you chose to create one), you can now install Doccano using pip. Run the following command in your terminal or command prompt:

pip install doccano

4. Initialize Database:

SQLite 3 is the default database used by Doccano, this can be configured by the user to an alternative database if preferred. For example, if you prefer to use PostgreSQL instead of SQLite 3, install its dependencies using the following command:

pip install 'doccano[postgresql]'

and set the DATABASE_URL environment variable to:

DATABASE_URL="postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}:${POSTGRES_PORT}/${POSTGRES_DB}?sslmode=disable"

To initialize the database, run:

doccano init

For this purpose we will not be setting an environment variable, instead, we will be manually importing our data via .csv files.

5. Create a Superuser (Admin User):

To access the Doccano admin interface and manage projects, you need to create a superuser account by running the following command:

doccano createuser --username admin --password pass

Change the --username and --password parameters if you wish.

6. Start Task Queue

Starting a task queue allows you to upload and download files in Doccano, this is required for this use case as we are not importing data from a database.

To start the task queue, in a separate terminal run:

doccano task

7. Run Doccano

With everything set up, you are now ready to start Doccano. Run the following command:

doccano webserver --port 8000

and go to http://127.0.0.1:8000/.

Doccano setup using Docker/Docker Compose is available, for more details visit the GitHub repository here.

Project Setup

Now we have Doccano running on a web server, log in using the superuser credentials configurated at point 5 of the installation steps above.

Doccano Login Screen

After logging in, proceed through the following steps to set up your project for text classification labeling.

1. Select a Project Template

To create a new project, click the Create button located on the top left-hand side of the page.

Create Button for Starting a New Project

By default the Text Classification template is already selected.

Alternative labeling templates include Sequence Labelling, Sequence-to-Sequence Labelling, Intent Detection and Slot Filling, Image Classification/Captioning, Object Detection/Segmentation, and Speech-to-Text.

2. Project Details

Each project requires a project name and description, all other options are optional.

Project Detail Screen

If you’re classifying text for multi-label purposes keep Allow single label unticked, otherwise, tick this option meaning only a single label can be assigned per data entry.

It is also advised to tick Randomize document order, this ensures your data entries are not being processed in the order they’re imported. This is not essential but I personally prefer this option.

3. Creating your Labels

Once your project is created, navigate to the hamburger menu and select Labels.

Labels Option via the Project Hamburger Menu

Labels can be added manually or imported via a JSON file. Best practice is to import labels via a JSON file, this ensures consistency, enables versioning, and should multiple users be contributing, keeps everyone’s label configuration identical.

The expected JSON format is:

[
{
"text": "dog",
"suffix_key": "d",
"background_color": "#FF0000",
"text_color": "#76e32b"
},
{
"text": "cat",
"suffix_key": "c",
"background_color": "#FF0000",
"text_color": "#d45139"
}
]

Assigning a suffix_key allows for faster labeling via shortcuts, background_color and text_color are defined to determine the aesthetics of the buttons in the UI.

4. Importing your Data

Once you have your labels imported, you’ll want to import your dataset.

Dataset Option via the Project Hamburger Menu

Navigate to the hamburger menu and select Dataset, then select the Actions dropdown and Import Dataset.

State the file format of the file you’re importing and drop the file into the box displaying Drop files here.... You now have your dataset loaded and ready for labeling.

Labeling

Once you have your labels and dataset imported, the next step is to start annotating!

Within the Dataset page each row has an Annotate button and a Status column (either Finished or In Progress). Clicking on the Annotate button returns the following UI:

Doccano Annotation UI

This view displays your imported labels followed by the text you’re annotating. For demonstration purposes, you can see one example has already been labeled, progress is shown in the Progress cell.

After each text entry has been assigned one/multiple labels, hit ENTER on your keyboard to assign the CHECKED status to the data entry. You know when an entry has been marked as CHECKED when it displays a tick in the top-left box, otherwise a cross will appear.

Exporting your Annotated Dataset

After annotating your dataset, you can export it via the Dataset -> Export Dataset path.

Export Dataset UI

Select the file format you would like to export your dataset and select the Export only approved documents button. Selecting this button means only data entries that have been assigned as CHECKED will be exported.

Conclusion

This article provides a high-level walkthrough for setting up Doccano to perform text classification annotations. Installing Doccano and its dependencies is very simple when running locally, project setup is also quite straightforward as long as you have your data readily available and labels pre-defined in a JSON file.

This approach is more manual than people would ideally prefer, everything is run locally and datasets are imported/exported manually. Within an organization Doccano can be hosted as a web application, therefore enabling multi-user access.

When used for generating a labeled training dataset for machine learning purposes, active learning (auto labeling) can be performed by setting up a custom REST API request. Active learning drastically enhances annotation productivity and speed. More details on active learning can be found here.

If you enjoyed reading this article, please follow me on Medium, X (Twitter), and GitHub for similar content relating to Data Science, Artificial Intelligence, and Engineering.

Happy learning! 🚀

--

--

Marc

Lead Data Scientist • Writing about Machine Learning, Artificial Intelligence and Engineering