Doccano Labelling: Installation and Local Setup Best Practices
Introduction
In the rapidly evolving world of artificial intelligence, accurate and high-quality labeled data plays a pivotal role in training successful language models. Data annotation, the process of assigning meaningful labels to raw data, can be a challenging and time-consuming task. As the demand for annotated data grows, the need for efficient and user-friendly tools becomes increasingly crucial.
Doccano is a powerful and versatile open-source software designed to streamline the text annotating process. Doccano offers a range of features that empower data scientists and developers to efficiently annotate large volumes of text data. Through an intuitive web-based interface, users can quickly annotate text entries, allowing for the creation of labeled datasets that serve as the foundation for training machine learning models.
Why pick Doccano over alternative annotation tools?
- Ease of Installation and Setup: Doccano provides a straightforward installation process, ensuring users can quickly begin labeling without unnecessary complications.
- Open-Source and Customizable: Being an open-source tool, Doccano can be tailored and extended to meet specific project requirements.
- Collaboration and Teamwork: Doccano supports collaborative annotation projects, enabling multiple users to work simultaneously on the same dataset.
- Annotation Variety: Doccano supports various annotation types making it suitable for a wide range of NLP tasks.
- Visualization and Quality Control: The platform provides data visualization tools that help users analyze and validate their labeled data.
Installation
To begin your journey with Doccano, you first need to install the platform on your local machine or server. Doccano is built using Python and Django, making it compatible with various operating systems.
Follow the steps below to install Doccano:
1. Prerequisites:
Before installing Doccano, ensure that you have the following prerequisites installed on your system:
- Python version >= 3.8
- pip (Python package manager)
- Virtual environment (optional but recommended)
2. Create a Virtual Environment (Optional):
Creating a virtual environment is a good practice as it isolates your project’s dependencies from other Python projects on your system. To create a virtual environment, open your terminal or command prompt and execute the following command:
python3 -m venv doccano-env
This command creates a new virtual environment named doccano-env
. You can replace doccano-env
with your desired name if preferred.
3. Install Doccano:
With the virtual environment activated (if you chose to create one), you can now install Doccano using pip
. Run the following command in your terminal or command prompt:
pip install doccano
4. Initialize Database:
SQLite 3 is the default database used by Doccano, this can be configured by the user to an alternative database if preferred. For example, if you prefer to use PostgreSQL instead of SQLite 3, install its dependencies using the following command:
pip install 'doccano[postgresql]'
and set the DATABASE_URL
environment variable to:
DATABASE_URL="postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}:${POSTGRES_PORT}/${POSTGRES_DB}?sslmode=disable"
To initialize the database, run:
doccano init
For this purpose we will not be setting an environment variable, instead, we will be manually importing our data via .csv
files.
5. Create a Superuser (Admin User):
To access the Doccano admin interface and manage projects, you need to create a superuser account by running the following command:
doccano createuser --username admin --password pass
Change the --username
and --password
parameters if you wish.
6. Start Task Queue
Starting a task queue allows you to upload and download files in Doccano, this is required for this use case as we are not importing data from a database.
To start the task queue, in a separate terminal run:
doccano task
7. Run Doccano
With everything set up, you are now ready to start Doccano. Run the following command:
doccano webserver --port 8000
and go to http://127.0.0.1:8000/.
Doccano setup using Docker/Docker Compose is available, for more details visit the GitHub repository here.
Project Setup
Now we have Doccano running on a web server, log in using the superuser credentials configurated at point 5 of the installation steps above.
After logging in, proceed through the following steps to set up your project for text classification labeling.
1. Select a Project Template
To create a new project, click the Create
button located on the top left-hand side of the page.
By default the Text Classification template is already selected.
Alternative labeling templates include Sequence Labelling, Sequence-to-Sequence Labelling, Intent Detection and Slot Filling, Image Classification/Captioning, Object Detection/Segmentation, and Speech-to-Text.
2. Project Details
Each project requires a project name and description, all other options are optional.
If you’re classifying text for multi-label purposes keep Allow single label
unticked, otherwise, tick this option meaning only a single label can be assigned per data entry.
It is also advised to tick Randomize document order
, this ensures your data entries are not being processed in the order they’re imported. This is not essential but I personally prefer this option.
3. Creating your Labels
Once your project is created, navigate to the hamburger menu and select Labels
.
Labels can be added manually or imported via a JSON file. Best practice is to import labels via a JSON file, this ensures consistency, enables versioning, and should multiple users be contributing, keeps everyone’s label configuration identical.
The expected JSON format is:
[
{
"text": "dog",
"suffix_key": "d",
"background_color": "#FF0000",
"text_color": "#76e32b"
},
{
"text": "cat",
"suffix_key": "c",
"background_color": "#FF0000",
"text_color": "#d45139"
}
]
Assigning a suffix_key
allows for faster labeling via shortcuts, background_color
and text_color
are defined to determine the aesthetics of the buttons in the UI.
4. Importing your Data
Once you have your labels imported, you’ll want to import your dataset.
Navigate to the hamburger menu and select Dataset
, then select the Actions
dropdown and Import Dataset
.
State the file format of the file you’re importing and drop the file into the box displaying Drop files here...
. You now have your dataset loaded and ready for labeling.
Labeling
Once you have your labels and dataset imported, the next step is to start annotating!
Within the Dataset
page each row has an Annotate
button and a Status
column (either Finished
or In Progress
). Clicking on the Annotate
button returns the following UI:
This view displays your imported labels followed by the text you’re annotating. For demonstration purposes, you can see one example has already been labeled, progress is shown in the Progress
cell.
After each text entry has been assigned one/multiple labels, hit ENTER on your keyboard to assign the CHECKED status to the data entry. You know when an entry has been marked as CHECKED when it displays a tick in the top-left box, otherwise a cross will appear.
Exporting your Annotated Dataset
After annotating your dataset, you can export it via the Dataset
-> Export Dataset
path.
Select the file format you would like to export your dataset and select the Export only approved documents
button. Selecting this button means only data entries that have been assigned as CHECKED will be exported.
Conclusion
This article provides a high-level walkthrough for setting up Doccano to perform text classification annotations. Installing Doccano and its dependencies is very simple when running locally, project setup is also quite straightforward as long as you have your data readily available and labels pre-defined in a JSON file.
This approach is more manual than people would ideally prefer, everything is run locally and datasets are imported/exported manually. Within an organization Doccano can be hosted as a web application, therefore enabling multi-user access.
When used for generating a labeled training dataset for machine learning purposes, active learning (auto labeling) can be performed by setting up a custom REST API request. Active learning drastically enhances annotation productivity and speed. More details on active learning can be found here.