Tagged: open source
Making Content Work
Most ECM projects revolve around managing content and use it in business processes. We are very familiar with the benefits of managing content, preserving it, and using it in business transactions. The majority of ECM investments are in ensuring that content is not lost, it is available to the relevant people at the right time, and is presented to users to complete a task in hand. Documents are searched for once their active use in a business transaction is over as well.
The underlying assumption has always been that the content needs to be presented to a user when she needs it. So we attach necessary and sufficient metadata to our content or even make it content-searchable. 90% of the ECM users stop right there in their utilization of the content they have at their disposal.
The digital world is witnessing an analytics wave now. Trends and insights are the most commonly sought after buzzwords in the industry now. Big and small data are being analyzed left and right to search for the grains of wisdom that could ultimately provide that elusive competitive advantage. One interesting observation from an AIIM research shows that it is far easier to gain insights from publicly available data than from an organization’s internal resources. It is quite a truth that the first place anybody would look for a piece of information is Google and not an internal ECM repository. But there lies a huge difference when it comes to looking for insights. The information that we keep in our internal repositories are far more relevant for our organization than what Google can provide. So there has to be an effort to utilize the content that we store.
This is precisely where Content Analytics comes in. Even though any content analytics tools and technologies available today can no way match the human brains ability to decipher information, they provide a good start. Additionally such tools can process vast amounts of content, extract information, and to an extent apply semantic deciphering. Most of the tools are self-learners where the analytics improve as time goes. Content Analytics works a lot better with semi-structured information such as Twitter feeds, Facebook comments etc. in comparison with unstructured long-form content.
At a high level Content Analytics go through four major steps: bringing in content, extracting information, analysis, and generating output.
To bring-in content to an analytics tool one can employ crawlers or import mechanisms. Crawlers are common among the commercial tools available. Crawlers let the analytics platform to look for new information in specified sources and bring content in as and when it is available. Crawlers can work with internal content stores including ECM repositories or shared drives or even the Internet. Most tools provide options to push content to the platform either manually or automatically as well.
The next step is to extract information from content. The content could come in many forms: from text data to office documents to images to audio or video. Information needs to be extracted in text form to feed into an analytics module and this step employs filters to extract text data from the input content. A wide range of tools and technologies are available that helps in extracting information from varied source formats.
The analysis step is the most crucial part of content analytics. This is where the software reads and analyzes the inputs by performing text analytics algorithms. Text analytics involves sentence detection, tokenization, parts of speech recognition, classification or annotation, entity and relationship identification etc.
Once text analysis is completed, the tools let the analyzed and extracted information to be formatted according to downstream processing needs and then export to the relevant systems.
Content analytics will be more and more prominent as time goes by and the technology will evolve to a much higher acceptance level. Even though there are many commercial software available, much of the research is pioneered in the open source domain. In my opinion, content analytics is one of the technologies that we as ECM professionals can quickly subscribe to and provide considerable value to our customers.
Alfresco Community Edition
How does one eat a pizza? Most people take a slice in their hand and start biting from the narrow top. Some use a knife and a fork to neatly cut the pizza slice into thin pieces. There are children who scrape the cheesy bits and eat them with a fork and discard the base totally. There could be numerous other ways of eating the same pizza. If not, this will make a good research topic.
The Alfresco community edition is like that ill-fated pizza slice in my opinion. There are innumerable ways in which people use it. Many use the community edition out of the box. Yet a large section use Alfresco Share or Workdesk as the UI with the out of the box community edition server. There are others who deep dive into the code and make the necessary changes to suit them (with or without contributing the changes back to the community). There are others who build their own applications but use Alfresco as the repository.
Alfresco is the numero-uno open source ECM platform out there. Most customers who think of scaling the ECM tree would have downloaded and played with the Alfresco community edition. We did the same thing long ago and decided to use Alfresco community edition as yet another supported repository for our ECM UI framework product.
Having worked with ECM products such as FileNet, we were always apprehensive of the scalability aspect of Alfresco, the community edition to be precise. We have seen trillions of documents going in and coming back from FileNet repositories seamlessly or thousands of users working with their documents and tasks using FileNet based applications. FileNet anyway runs on high horse power servers in a clustered or farmed environment to scale. On the other hand Alfresco’s hardware resource requirements are minimal. I can easily run Alfresco on my 32-bit laptop. Naturally we sell FileNet based solutions to customers who operate high volumes or have many users. Alfresco community edition based offerings are typically for lower volume/lower user customers.
Recently one of our customers in India brought a performance issue with their Alfresco community edition based installation. They have only less than 10 users but have larger volumes of documents. The customer uses our capture solution as well as the document management application that uses the Alfresco community edition as the repository underneath. The issue was that the system was very slow. At the first instance we felt that we were vindicated in our assumptions that Alfresco community edition cannot scale beyond a point.
A closer look at the issue revealed that there might be a way out. The customer uses our capture product to ingest anywhere between 15000 to 25000 documents a day to the repository. All the documents for a day get into a folder specifically created for that date. Further analysis prompted us to think that too many documents in one folder could be the one hindering the performance. So we changed the capture export configuration to create sub-folders within the day folder and limit the maximum documents per sub-folder to less than 2000. SharePoint used to have a performance issue when the number of documents in a folder exceeded 2K and may be that awareness might have helped in trying something like this out. Anyway the change worked like a charm and the repository sucked in the pending documents in a jiffy.
The customer is using the system well and so we are delighted too. As of now the customer has about 4+ million documents in the repository. The entire ECM infrastructure runs on a single lower-end server. The return on investment on this solution has been tremendous. It might not be a bad idea to get an ECM setup on the Alfresco Community Edition after all!