Skip navigation.

Feed aggregator

SharePoint 2016: What’s new? And What to expect?

Yann Neuhaus - Mon, 2015-05-11 02:11

The Ignite Event took place in Chicago last week. According to what has been presented, here are the first news about the new SharePoint 2016! 

SharePoint 2016 looks like an enhancement of the existing SharePoint 2013 version.
Most of the SP2013 features will continue to work but in a more powerful way.

alt SharePoint 2016 will be focused on the following points:
  • Security Compliance
  • Advanced Analytics and Insights (data analysis and reporting)
  • Cloud experience
The new features for this release include:
  • Links: All URLs will be “Resource ID” based URLs. That means that if one of your document is rename, the document link won’t be broken.
  • User Profile Service Application won’t be available anymore.
  • In order to use the User Profile Services, you will need to use the Forefront Identity Manager separated from the SharePoint Farm (outside). The Active Directory will be there but there are no sync anymore
  • SharePoint 2016 is a Cloud Inspired Infrastructure. The emphasis here is building a Hybrid platform to let users benefit from enhanced cloud innovations that Microsoft often releases for Office 365. One of the most important experience that MS will be introducing to On-Premises clients is the use of Cloud based Search. The power of Delve & Office Graph can also be now applied to the On-Premises Content and the integrated cloud of course.
  • Add On-Premises Content to online Delve board
  • Zero downtime Patching: all upgrades of SharePoint can be done online.

  • New Service applications will have an integration option for Delve
  • The New Hybrid Extranet will allow many scenarios
  • New Repair button to restore Server Services depending on its role
  • The Cloud subscription of services such as Azure Rights Management Services to on-premises setup
  • The InfoPath 2013 application will work with SharePoint Server 2016
Analytics and Insights:

There is a huge improvement in how SharePoint records and displays data analytic and Reporting. SharePoint 2016 will provide monitoring form:

  • Services
  • Actions
  • Usage
  • Engagement
  • Diagnostics
  • In-Memory Analytics (SQL Server 2016 release). Let's see in the next few months what will be announced. Maybe some enhancement of the integration between BI and SP2016

Social:
  • Social improvements start with better Yammer integration to avoid social silos, better integration with the micro-blogging platform, with new capabilities..
Content Management capabilities:
  • Improvement in Content limits: removed Character limit in URLs, increased File size to 10GB, remove the 5000 item limit. The list threshold has been increased

  • Encryption enhancements
  • Multi-factor authentication
  • Data loss prevention
  • eDiscovery
  • Metadata management
  • Policy control
  • Mobile device management
Installation & Configuration Enhancements:
  • OS – Windows Server 2012 R2, Windows Server 2010
  • New Server Installation Wizard will allow users to perform a Role based Installations
  • Database– 64 bit of SQL 2014
A brief summary of SharePoint 2016 using one word would be: CLOUD

 

what will be coming with the next major version of postgresql ?

Yann Neuhaus - Sun, 2015-05-10 13:14

A PostgreSQL major version is usually released once a year. For the next major version, which is 9.5, this probably will be in the second half of 2015.

Variations on 1M rows insert (4): APPEND_VALUES

Yann Neuhaus - Sun, 2015-05-10 12:39

In the first variation I did a bulk insert with the PL/SQL FORALL statement. A comment suggests to add the APPEND_VALUES hint that appeared in 11gR2. APPEND_VALUES hint does the same than APPEND but it's for inserts with VALUES clause: It inserts in direct-path - directly into the datafile, bypassing the buffer cache. Of course, this makes sense only with bulk insert because you don't want to do that with only one row.

APPEND_VALUES

So, from the previous test, inserting 1 million rows in only one transaction (no intermediate commits) into a table with only one index (primary key) took 46 seconds. With bulk insert, the time is only 7 seconds. Let's do the same bulk, but with the APPEND_VALUES hint. I changed only the insert line:

SQL> create table DEMO ("id" number primary key, "text" varchar2(15), "number" number) ;
Table created.

SQL> set timing on
SQL> declare
  2   type people_array is varray(12) of varchar(15);
  3   people people_array := people_array( 'Marc', 'Bill', 'George', 'Eliot', 'Matt', 'Trey', 'Tracy','Greg', 'Steve', 'Kristina', 'Katie', 'Jeff' );
  4   people_count number :=people.COUNT;
  5   type DEMO_record_type is record ("user_id" number , "name" varchar2(15), "number" number) ;
  6   type DEMO_collection_type is table of DEMO_record_type index by binary_integer;
  7   DEMO_collection DEMO_collection_type;
  8  begin
  9   -- fill the collection
 10   for i in 1..1e6 loop
 11    DEMO_collection(i)."user_id":=i;
 12    DEMO_collection(i)."name":=people( dbms_random.value(1,people.count) );
 13    DEMO_collection(i)."number":=trunc(dbms_random.value(0,10000));
 14    null;
 15   end loop;
 16   forall i in 1..DEMO_collection.count insert /*+ append_values */ into DEMO values DEMO_collection(i);
 17   commit;
 18  end;
 19  /
and here is the execution time:
PL/SQL procedure successfully completed.
Elapsed: 00:00:05.28
It's a bit faster. However this testcase is not very well suited to show the benefit.

First, you must know that inserting in direct-path locks the table. We bypass the buffer cache and it's the buffer cache that manages concurrency when updating the blocks. Only one session can modify a block, and that's available also in RAC. So, when you bypass the buffer cache you must lock the table (or the partition if you insert into a partition) to be sure you are the only one that inserts after the high water mark.

Now about the testcase. Here I insert into an empty table. The direct-path insert is very efficient for index maintenance because instead of maintaining index row-by-row, the insert is build afterwards (data inserted after high water mark is read and sorted) and merged to the existing index. This optimization gives better performance on an index that has an higher depth.

APPEND

If you want to have an idea about how fast is direct-path insert, let's see how long it takes to insert the 1 million rows into another table - identical, with a primary key. This is easy with CTAS:

SQL> set timing on
SQL> create table DEMO2 ("id" primary key , "text" , "number" ) as select * from DEMO;
Table created.
Elapsed: 00:00:01.01
One second. That's the fastest way to insert 1 million rows in batch when you have them in a table (or file through external table), when you can lock the whole table and you don't need to do intermediate commits.
We can also do the same with INSERT /*+ APPEND */
SQL> truncate table DEMO2 reuse storage;
Table truncated.
Elapsed: 00:00:00.17

SQL> insert /*+ append */ into DEMO2 select * from DEMO;
1000000 rows created.
Elapsed: 00:00:01.36
An it's the same time (I'll not compare when difference is less than one second).

The index maintenance is fast, but you can go further by disabling the indexes (ALTER INDEX ... UNUSABLE) before and rebuild them afterwards.

However, my initial testcase was made thinking of OLTP, concurrent inserts from difference sessions. So direct-path writes is definitly not for that as it locks the table. But I wanted to show how 1 million rows is very small when we are doing thinks in bulk from one session.

Kerberos SSO for Documentum CS 6.7 SP1

Yann Neuhaus - Sun, 2015-05-10 11:00


In a previous post, I shared some tips to configure the Kerberos SSO with Documentum D2 3.1 SP1. Since that day, I worked on different projects to also setup the Kerberos SSO on some other components of Documentum. In this post I will try to explain in detail what need to be done to configure the Kerberos SSO for the Content Server. Actually it's not that hard to do it but you may face some issues if you try to follow the official documentation of EMC.


So what are the pre-requisites to setup the Kerberos SSO for the Content Server? Well in fact you just need a Content Server of course and an Active Directory to generate the keytab(s). Just to let you know, I used a Content Server 6.7 SP1 and an Active Directory on a Windows Server 2008 R2. Let's define the following properties:

  • Active Directory - user = cskrb
  • Active Directory - password = ##cskrb_pwd##
  • Active Directory - domain = DOMAIN.COM
  • Active Directory - hostname1 = adsrv1.domain.com
  • Active Directory - hostname2 = adsrv2.domain.com
  • Documentum - repository (docbase) = REPO


I. Active Directory prerequisites


As always when working with Kerberos on an Active Directory, the first thing to do is to create a user. So let's create this user with the following properties:

  • User name: cskrb
  • Support AES 128 bits encryption
  • WARNING: This account MUST NOT support AES 256 bits encryption
  • Trust for Delegation to any service (Kerberos Only)
  • Password never expires
  • Account never expires
  • Account not locked


Once the user has been created, you can proceed with the keytab creation using the comment prompt on the Active Directory host:

cs_keytab.png


According to the documentation of EMC, you can create one keytab with several keys inside for the Documentum repositories. Actually, that's wrong! It's not possible in the Microsoft world to generate a keytab with more than one Service Principal Name (SPN) in it, only the Linux implementations of Kerberos allow that. If you try to do so, your Active Directory may loop forever trying to add a second SPN to the keytab. That will considerably slow down your Active Directory and it may even crash...


If you want to setup the Kerberos SSO for more than one repository, you will have to create one user per repository and generate one keytab per user. So just repeat these two steps above for each repository, replacing the user name, user password and repository name... What is possible with an Active Directory is to map more than one SPN to a user. That can be useful for a Load Balancer setting for example but the keytab will always contain one SPN and therefore it seems that this solution isn't suitable for the Content Server.


The second remark here is that the documentation of EMC often uses the DES encryption only for the keytab but as shown above, you can of course specify the encryption to use or simply specify "ALL" to add all possible encryptions in this keytab. By default Kerberos will always use the stronger encryption. In our case as the Content Server doesn't support AES 256 bits encryption, the AES 128 bits encryption will be used instead.


II. Configuration of the Content Server side


So let's start the configuration of the Kerberos SSO for the Content Server. The first thing to do is of course to transfer the keytab created previously (REPO.keytab) from the Active Directory to the Content Server's host. This CS's host can be a Windows Server or a Linux Server, it doesn't matter as long as the Linux Server is part of your enterprise network (well if it's properly configured). In this post, I will use a Linux server because we usually install Documentum on Linux.


During the installation of the Content Server, the installer creates some default authentication folders, some security elements, aso... Therefore, you have to put the newly created keytab in this specific location for the Content Server to automatically recognize it. Please make sure that the keytab belongs to the Documentum installation owner (user and group) on the file system with the appropriate permissions (640). The correct location is:

$DOCUMENTUM/dba/auth/kerberos/



Then create the file "/etc/krb5.conf" with the following content:

[libdefaults]
noaddresses = true
udp_preference_limit = 1
default_realm = DOMAIN.COM
default_tgs_enctypes = aes128-cts arcfour-hmac-md5 des-cbc-md5 des-cbc-crc rc4-hmac
default_tkt_enctypes = aes128-cts arcfour-hmac-md5 des-cbc-md5 des-cbc-crc rc4-hmac
permitted_enctypes = aes128-cts arcfour-hmac-md5 des-cbc-md5 des-cbc-crc rc4-hmac
dns_lookup_realm = true
dns_lookup_kdc = true
passwd_check_s_address = false
ccache_type = 3
kdc_timesync = 0
forwardable = true
ticket_lifetime = 24h
clockskew = 72000

[domain_realm]
.domain.com = DOMAIN.COM
domain.com = DOMAIN.COM
adsrv1.domain.com = DOMAIN.COM
adsrv2.domain.com = DOMAIN.COM

[realms]
DOMAIN.COM = {
master_kdc = adsrv1.domain.com:88
kdc = adsrv1.domain.com:88
kpasswd = adsrv1.domain.com:464
kpasswd_server = adsrv1.domain.com:464
kdc = adsrv2.domain.com:88
kpasswd = adsrv2.domain.com:464
kpasswd_server = adsrv2.domain.com:464
}

[logging]
default = /var/log/kerberos/kdc.log
kdc = /var/log/kerberos/kdc.log

[appdefaults]
autologin = true
forward = true
forwardable = true
encrypt = true



You can of course customize this content with what is suitable for you depending on your environment. Moreover, if the file "/etc/krb5.conf" already exist and you don't want to modify it or if you can't modify it, then you can still create this file wherever you want. For example, create the folder "$DOCUMENTUM/kerberos/" and put this file inside. Then edit the file "~/.bash_profile" and add the following line into it to reference this new location:

export KRB5_CONFIG=$DOCUMENTUM/kerberos/krb5.conf



Once done, simply restart your ssh session or source the file "~/.bash_profile" for the new environment variable to be available.


The last thing to do is to refresh the Kerberos configuration for the Content Server to know that a keytab is available on the File System (and therefore enable the dm_krb authentication plugin). This process is known as the re-initialization of the Content Server. There are two main ways to do it: with or without Documentum Administrator (DA). When using DA, you can simply click on a button while the Content Server is running and that's the main advantage of this method. If you don't have a DA installed, then I guess you will have to reboot the Content Server for the changes to take effect.


To re-initialize the Content Server using DA, here is what need to be done:

  • Open DA in Internet Explorer
  • Log in to the repository "REPO" with the account of the Documentum installation owner or any account with sufficient permissions
  • Expand the "Basic Configuration item"
  • Click on "Content Servers"
  • Right-click on the repository you are connected to
  • Click on "Properties"
  • Check "Re-Initialize Server"
  • Click on "Ok"


Once done, you should be able to confirm that the reload of the dm_krb plugin was successful by checking the log file of the repository:

$DOCUMENTUM/dba/log/REPO.log



If everything goes well, you will see some lines showing that the Content Server was able to parse the keytab successfully. On a next blog, I will certainly explain how to configure the Kerberos SSO for the DFS part. Stay tuned!


bulk loading semi structured data in postgresql

Yann Neuhaus - Sun, 2015-05-10 10:00

The last post took a look at how to efficiently load 1m rows into a table in PostgreSQL. In this post I'll take a look on how to do the same with semi structured data.

D2 performance issues due to KB3038314 IE patch

Yann Neuhaus - Sun, 2015-05-10 08:58

I ran into a strange issue by a customer. When trying to open a huge VD on the D2’s right panel the browser freezes.

It seems to be due to an Internet Explorer security patch. It is introducing huge performance issues. So if you run into strange issues concerning your web browser check the patch version of IE. The security patch which causes issues is KB3038314.

Alfresco: some useful database queries

Yann Neuhaus - Sun, 2015-05-10 04:50


In my previous post, I talked about the Lifecycle of Alfresco Nodes. You may have noticed that I tried to insert in my explanations some elements that are specific to databases (tables, fields, aso...). These elements are quite essential to prepare a post like this one: more database oriented. I already explained what exactly are the consequences on the database side when a node is removed and I will try in this post to share some useful queries regarding these points but not only!


For this post, I used my local Alfresco Community 4.2.c installation with a PostgreSQL database. For your information, it just take 30 minutes to get this test environment ready with the Alfresco's installer (Windows, Mac or Unix). Of course, use the Database only for your daily administration work is certainly not the best idea but in some cases, it can really be faster and easier to just run some SQL commands at the DB level...


I. Document information


So let start this post with some generic queries that can be used to retrieve some information about documents. In this part, all columns of the results will be the same because I just pick up the same fields in my queries but the filter part (the WHERE clause) changes a little bit to be able to retrieve some information from different elements.

The first command I would like to show you is how to retrieve some information about documents based on the size of the content. Here, I just uploaded the document "Test_Lifecycle.docx" with a size of 52MB. So based on that, let's say that I want to retrieve all elements on my Alfresco installation with a content that is bigger than 40MB. In the same approach, you can select all elements with a content that is smaller than or between XX and YYMB. The conversion in MB is done using the round() function. Therefore, if you want this value to be in KB instead, just remove one division by 1024 in each round() function:

All documents bigger than 40MB
SELECT n.id AS "Node ID",
       n.store_id AS "Store ID",
       round(u.content_size/1024/1024,2) AS "Size (MB)",
       n.uuid AS "Document ID (UUID)",
       n.audit_creator AS "Creator",
       n.audit_created AS "Creation Date",
       n.audit_modifier AS "Modifier",
       n.audit_modified AS "Modification Date",
       p1.string_value AS "Document Name",
       u.content_url AS "Location"
FROM alf_node AS n,
     alf_node_properties AS p,
     alf_node_properties AS p1,
     alf_namespace AS ns,
     alf_qname AS q,
     alf_content_data AS d,
     alf_content_url AS u
WHERE n.id=p.node_id
  AND ns.id=q.ns_id
  AND p.qname_id=q.id
  AND p.long_value=d.id
  AND d.content_url_id=u.id
  AND p1.node_id=n.id
  AND p1.qname_id IN (SELECT id FROM alf_qname WHERE local_name='name')
  AND round(u.content_size/1024/1024,2)>40
ORDER BY u.content_size DESC;



I will just put it once but here is the result of this command in this case:

 Node ID | Store ID | Size (MB) |          Document ID (UUID)          | Creator |         Creation Date         | Modifier |       Modification Date       |    Document Name    |                            Location                            
---------+----------+-----------+--------------------------------------+---------+-------------------------------+----------+-------------------------------+---------------------+----------------------------------------------------------------
131856 | 6 | 52.00 | eb267742-c018-4ba5-8ca4-75ca23c860f0 | Morgan | 2015-04-30T12:05:50.613+02:00 | Morgan | 2015-04-30T12:05:50.613+02:00 | Test_Lifecycle.docx | store://2015/4/30/12/5/0e111f05-7fcf-4a44-b719-b94cd04dd5ab.bin


So why did I selected these fields?!

  • Node ID: can be useful to join different tables
  • Store ID: a Store ID of 6 means that your document is in its active life. A Store ID of 5 means that this document has been deleted by a user and is now in the global trashcan
  • Size (MB): what we are searching for...
  • Document ID (UUID): the unique identifier of this document. The simplest way to preview this document is just to open the following url in any browser: http://HOSTNAME:PORT/share/page/document-details?nodeRef=workspace://SpacesStore/eb267742-c018-4ba5-8ca4-75ca23c860f0 (workspace://SpacesStore for store_id=6)
  • Creator, Modifier, Dates: well...
  • Document Name: can be useful to know the type of document without opening an URL (file extension)
  • Location: the actual location of the content's file on the File System. The "store://" refers to $ALF_DATA/contentstore/


The second command I would like to show you is how to retrieve some information based on the actual UUID of a document. As explained above, the UUID of a document can be found in the URL of its detail's page:

A document using its UUID
SELECT n.id AS "Node ID",
       n.store_id AS "Store ID",
       round(u.content_size/1024/1024,2) AS "Size (MB)",
       n.uuid AS "Document ID (UUID)",
       n.audit_creator AS "Creator",
       n.audit_created AS "Creation Date",
       n.audit_modifier AS "Modifier",
       n.audit_modified AS "Modification Date",
       p1.string_value AS "Document Name",
       u.content_url AS "Location"
FROM alf_node AS n,
     alf_node_properties AS p,
     alf_node_properties AS p1,
     alf_namespace AS ns,
     alf_qname AS q,
     alf_content_data AS d,
     alf_content_url AS u
WHERE n.id=p.node_id
  AND ns.id=q.ns_id
  AND p.qname_id=q.id
  AND p.long_value=d.id
  AND d.content_url_id=u.id
  AND p1.node_id=n.id
  AND p1.qname_id IN (SELECT id FROM alf_qname WHERE local_name='name')
  AND n.uuid='eb267742-c018-4ba5-8ca4-75ca23c860f0';



Another possible command would be to find some information based on the File System location. That can be useful for example if there is a big document on the File System and you want to know the type of this document with the extension, its name or maybe some other information about the creator/modifier:

A document using its path on the File System
SELECT n.id AS "Node ID",
       n.store_id AS "Store ID",
       round(u.content_size/1024/1024,2) AS "Size (MB)",
       n.uuid AS "Document ID (UUID)",
       n.audit_creator AS "Creator",
       n.audit_created AS "Creation Date",
       n.audit_modifier AS "Modifier",
       n.audit_modified AS "Modification Date",
       p1.string_value AS "Document Name",
       u.content_url AS "Location"
FROM alf_node AS n,
     alf_node_properties AS p,
     alf_node_properties AS p1,
     alf_namespace AS ns,
     alf_qname AS q,
     alf_content_data AS d,
     alf_content_url AS u
WHERE n.id=p.node_id
  AND ns.id=q.ns_id
  AND p.qname_id=q.id
  AND p.long_value=d.id
  AND d.content_url_id=u.id
  AND p1.node_id=n.id
  AND p1.qname_id IN (SELECT id FROM alf_qname WHERE local_name='name')
  AND u.content_url='store://2015/4/30/12/5/0e111f05-7fcf-4a44-b719-b94cd04dd5ab.bin';



II. Number of...


From a reporting point of view, let's say that you need some information regarding the number of... something. In this case and if you want to use your DB directly, then there is a really simple solution (simple but is it the best?) because Alfresco provide a Database architecture that is quite simple to understand and to use to get what you need. Indeed, if you take a look at the "alf_qname" table, you will see that every element that is part of Alfresco has its QName listed here. A QName is the Qualified Name of a repository item. This can be seen as a kind of "Super-Type":

alfresco=> SELECT * FROM alf_qname;
 id | version | ns_id | local_name
----+---------+-------+------------
  1 |       0 |     1 | store_root
  2 |       0 |     1 | aspect_root
  3 |       0 |     1 | container
  4 |       0 |     1 | children
  5 |       0 |     2 | user
...
 24 |       0 |     6 | folder
...
 51 |       0 |     6 | content
...
133 |       0 |     6 | thumbnail
134 |       0 |    13 | rendition
...


As you can see above, if you are searching for something that has a content, it can be done quite easily using the id or the local_name that correspond to that. So based on this table, here are some queries that can be useful:

Retrieve the number of users in the Repository
SELECT count(*)
FROM alf_node AS n,
     alf_qname AS q
WHERE n.type_qname_id=q.id
  AND q.local_name='user';

 

Retrieve the number of elements with a content in the Repository (include system's documents)
SELECT count(*)
FROM alf_node AS n,
     alf_qname AS q
WHERE n.type_qname_id=q.id
  AND q.local_name='content';

 

Retrieve the number of thumbnails in the Repository
SELECT count(*)
FROM alf_node AS n,
     alf_qname AS q
WHERE n.type_qname_id=q.id
  AND q.local_name='thumbnail';

 

Retrieve the number of renditions in the Repository
SELECT count(*)
FROM alf_node AS n,
     alf_qname AS q
WHERE n.type_qname_id=q.id
  AND q.local_name='rendition';



Of course you can do that for all QNames but you can also be more precise! So based on the query to retrieve the number of elements with a content, if you only want the number of documents of a specific type, then you can simply complete your query:

Retrieve the number of XML documents in the Repository
SELECT count(*)
FROM alf_node AS n,
     alf_qname AS q,
     alf_node_properties AS p
WHERE n.type_qname_id=q.id
  AND p.node_id=n.id
  AND p.qname_id IN
    (SELECT id
     FROM alf_qname
     WHERE local_name='name')
  AND q.local_name='content'
  AND p.string_value LIKE '%.xml';

 

Retrieve the number of PDF documents in the Repository
SELECT count(*)
FROM alf_node AS n,
     alf_qname AS q,
     alf_node_properties AS p
WHERE n.type_qname_id=q.id
  AND p.node_id=n.id
  AND p.qname_id IN
    (SELECT id
     FROM alf_qname
     WHERE local_name='name')
  AND q.local_name='content'
  AND p.string_value LIKE '%.pdf';


As the creation date, creator, modification date and modifier information are also stored on the "alf_node" table, you can also very easily filter your query based on the creation/update date of an Alfresco Node. That's pretty cool, right?! ;)


III. Lifecycle specific


To complete the relation between this blog post and the previous one, I wanted to share some queries that can be used to identify the current state of a document. As explained in my previous post, a document that is not yet deleted will be in the store named "workspace://SpacesStore". A document that has been deleted by a user will be in the sotre named "archive://SpacesStore" and when this document is removed from the global trashcan, the orphan_time is set to the current timestamp. With all these information and with the "alf_node" and "alf_content_url" tables we can easily build our own queries to find what is needed.

alfresco=> SELECT * FROM alf_store;
 id | version | protocol  |       identifier        | root_node_id
----+---------+-----------+-------------------------+--------------
  1 |       1 | user      | alfrescoUserStore       |            1
  2 |       1 | system    | system                  |            5
  3 |       1 | workspace | lightWeightVersionStore |            9
  4 |       1 | workspace | version2Store           |           10
  5 |       1 | archive   | SpacesStore             |           11
  6 |       1 | workspace | SpacesStore             |           12
(6 rows)


So let's find all documents that have been created and aren't deleted yet:

All documents created in their active life
SELECT *
FROM alf_node
WHERE store_id=6
  AND type_qname_id=51;


The next step on the lifecycle is when the documents have been deleted by a user but aren't deleted from the global trashcan (orphan_time is still NULL):

All documents created that are in the global trashcan (deleted by users)
SELECT *
FROM alf_node
WHERE store_id=5
  AND type_qname_id=51;


Finally, when the documents are removed from the global trashcan, some references/fields are removed, the QName of these documents change from "content" (51) to "deleted" (140) on the "alf_node" table and the orphan_time is set to the current timestamp on the "alf_content_url" table:

All elements that have been removed from the global trashcan and that are now orphaned
SELECT *
FROM alf_content_url
WHERE orphan_time IS NOT NULL;



I hope you enjoyed this blog post because it was quite hard for me to write something about database queries without giving up my soul to the DB world! See you soon ;).


Variations on 1M rows insert (1): bulk insert

Yann Neuhaus - Fri, 2015-05-08 09:19

I think you already have read the interesting series of blog posts of my colleagues Franck and Daniel about inserting on 1 million rows for Oracle and PostGreSQL. So it's time to write the first of the same series concerning SQL Server. First of all, just to clarify, the idea is not to make a direct comparison between Oracle, PostGreSQL and SQL Server but just to see variations that exist for each of them to insert quickly 1 million rows.

So I will perform the same basic test that my colleagues with roughly the same environment, one virtual machine on Hyper-V including only one processor, 512MB of memory and one SQL Server 2014 instance enterprise edition capped to 512 MB of memory:

 

Get-WmiObject –Class Win32_processor | ft Manufacturer, Name, NumberOfCores, NumberOfLogicalProcessors –Autosize

 

blog_42_-_1_-cpu_config

 

SELECT @@VERSION

 

blog_42_-_2_-_sql_version

 

SELECT        name,        value_in_use FROM sys.configurations WHERE name = 'max server memory (MB)'

 

blog_42_-_3_-_sql_config_mem

 

Row-by-row method

 

Let's start by using the same test tables with one heap table, clustered table and the same kind of script as well. I just modified the original script written by Franck but translating PL-SQL in T-SQL implies often using a completely different syntax but anyway, we will produce roughly the same bunch of data.

My user objects are stored to an user database called DEMO for this first test:

 

if object_id('DEMO', 'U') is not null        drop table DEMO;   create table DEMO("id" int , "text" varchar(15), "number" int);   if object_id('DEMO_PK', 'U') is not null        drop table DEMO_PK;   create table DEMO_PK("id" int , "text" varchar(15), "number" int,                  constraint demo_pk_pk primary key (id) );

...

DECLARE @i INT = 1; WHILE @i &lt= 1000000
BEGIN        INSERT INTO DEMO VALUES (@i, CASE ROUND(RAND() * 10, 0) WHEN 1 THEN 'Marc' WHEN 2 THEN 'Bill' WHEN 3 THEN 'George' WHEN 4 THEN 'Eliot' WHEN 5 THEN 'Matt' WHEN 6 THEN 'Trey' ELSE 'Tracy' END, RAND() * 10000);        SET @i += 1; END

 

Here my first result for both tables:

-- 00:02:29 – Heap table

-- 00:02:25 – table with clustered index

 

There are no big differences between inserting data into a heap table and a clustered table in this scenario because we insert rows basically in the same manner (always in the last page). At this point it is important to keep in mind that by default SQL Server uses implicit transaction mode. It means that each insert statement represents a transaction which has to be commited to the transaction log.

If we take a look at the specific wait statistics we can expect that the most waits will concern the log writes activity.

 

blog_42_-_4_-_waitstats

 

That’s it! The average values are pretty low but our results are far away from those of my colleagues. Let's motivated and let's talk about a kind of workaroud to speed-up the insert query. In fact, putting the user objects on tempdb database might be a kind of workaround. The main drawback is that tempdb database is temporary by design. Thus, our user objects will be persisted until the restart of the SQL Server but let's perform the previous test on tempdb. 

Here the results I get from this test:

-- 00:00:16 – Heap table

-- 00:00:15 – table with clustered index


 

So, a big improvement here.  Furthermore we may notice that related wait statistics have also changed as follows:

 

blog_42_-_5_-_waitstats


 

This main wait type is just related to a sustained CPU usage … so our final result is not so bad. At this point we may wonder why putting user objects on tempdb database increases the global procedure? In fact, we're using the special logging mechanism used by tempdb database that includes the lazy commit feature and the nonlogged after image feature for insert and update statements.


Go back to the user database DEMO and let's finish this row-by-row section by inserting data in an single transaction (or explicit mode) and let's take a look at the following results:

 

begin transaction   declare @i int = 1;     while @i <= 1000000
begin               insert DEMO values (@i, case cast(rand() * 10 as tinyint) when 1 then 'Marc' when 2 then 'Bill' when 3 then 'George' when 4 then 'Eliot' when 5 then 'Matt' when 6 then 'Trey'                                                                 when 7 then 'Tracy' when 8 then 'Greg' when 9 then 'Steve' else 'Patricia' end, rand() * 1000)          set @i = @i + 1; end   commit transaction

 

-- 00:00:10 – Heap table

-- 00:00:09 – table with clustered index

 

As excepted, we may notice a drastic drop of the duration value of both tests.

 

Bulk-insert method

Row-by-row commit is not the strength of SQL Server becauseeach commit requires to flush data to the transaction log. So let’s switch to bulk insert mode now. There are several ways to bulk insert data with SQL Server (either from client or server side by using for instance bcp or SSIS tool, BULK INSERT, SELECT INTO or OPENROWSET command and so on). In this test, I will use bcp to export data to a file before importing this file to my two tables.

To export my data I used the bcp command as follows:

 

blog_42_-_6_-_bcp_heap

 

...

 

blog_42_-_6_-_bcp_heap

 

I used native data types (-n option) in this case because my test concerns only transferring data from and to SQL Server. This option can improve performance but to be honest with this bunch of data the difference is not relevant.

Let’s bulk import our data to the two tables DEMO and DEMO_PK in my user database (not tempdb this time). At this point I want to be sure to be more efficient and I will use minimal logging for bulk-import operations in order to reduce the possibility to fill the log space and the potential contention (as a reminder writing to the transaction log file is always synchronous by default). Moreover, don’t forget that in this mode writing to the data file switches from asynchronous to synchronous mode. So becareful about your storage performance to avoid facing some unexpected behaviours during your import process.

So for my tests, the database DEMO is configured to SIMPLE recovery model and I will use BUKL INSERT command with TABLOCK option (which is a requirement to use minimally logging). Using options is possible but after some testing they appear to be not helpful in this context.

Concerning my heap table:

 

bulk insert demo.dbo.DEMO from 'C:bcpDEMO.dat' with (        DATAFILETYPE = 'native',        TABLOCK )

 

SQL Server Execution Times:

   CPU time = 703 ms, elapsed time = 725 ms.

 

Concerning my clustered table:

 

bulk insert demo.dbo.DEMO_PK from 'C:bcpDEMO_PK.dat' with (        DATAFILETYPE = 'native',        ORDER (id ASC),        TABLOCK )

 

SQL Server Execution Times:

   CPU time = 1437 ms, elapsed time = 1489 ms.

 

A little bit higher execution time than bulk import to a heap table. My table with a clustered index seems to introduce some overheads.

 

The bottom line is pretty the same than my colleagues. Insert and committing data rows by rows is not an optimized way if you plan to import a lot of data. So let’s continue on the same way than my colleagues with the next post of this series until the famous In-Memory feature. Don't forget that if you are in Switzerland in June, our experts from Oracle, Microsoft and SAP technologies will talk about In-Memory as implemented by SQL Server, Oracle and SAP HANA. All the same day. It's free and you can register now: Event In-Memory: boost your IT performance

Rebuild index on increasing values after deletes?

Yann Neuhaus - Fri, 2015-05-08 00:39

Yesterday while giving our Oracle tuning workshop I discussed with the customer about a job they have that rebuilds indexes every Sunday. Except in very rare cases Oracle indexes are maintained so that free space is reused by further inserts. But an index is an ordered structure. When we insert from a sequence, the value is always increasing, and go at the end of the index. And when we delete old data we delete index entries at the beginning of the index.
Is this a case where we need to manage it ourselves?

Test case

As usual I reproduce the issue. Here is my DEMO table with a DEMOPK index on the primary key:

SQL> create table DEMO (id number constraint DEMOPK primary key);
Table created.

I insert 10000 rows:

SQL> begin
  2   for i in 1..1e4 loop
  3    insert into DEMO values(i);
  4    commit;
  5   end loop;
  6  end;
  7  /
PL/SQL procedure successfully completed.

gather and check the stats:

SQL> exec dbms_stats.gather_table_stats(user,'DEMO');

PL/SQL procedure successfully completed.

SQL> select blocks,blevel,leaf_blocks from user_indexes join user_tables using(table_name) where index_name='DEMOPK';

    BLOCKS     BLEVEL LEAF_BLOCKS
---------- ---------- -----------
        20          1          18

So I have 1 branch and 18 leaf blocks.

 

Fragmentation

I'll check fragmentation from a user point of view. Having too much free space in leaf blocks is a problem with index range scan only. So let's fo an index range scan from the beginning to the end of the index:

SQL> alter session set statistics_level=all;
Session altered.

SQL> select /*+ index(DEMO) */ count(*) from DEMO;

  COUNT(*)
----------
     10000

SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
SQL_ID  7b6qc9m1cw3zd, child number 0
-------------------------------------
select /*+ index(DEMO) */ count(*) from DEMO

Plan hash value: 3019291478

-------------------------------------------------------------------------------------
| Id  | Operation        | Name   | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |        |      1 |        |      1 |00:00:00.01 |      19 |
|   1 |  SORT AGGREGATE  |        |      1 |      1 |      1 |00:00:00.01 |      19 |
|   2 |   INDEX FULL SCAN| DEMOPK |      1 |  10000 |  10000 |00:00:00.01 |      19 |
-------------------------------------------------------------------------------------

Exactly what we expected: 19 blocks reads is 1 branch and 18 leaves.

 

I have a script that does the same - range scan on an index - and shows how many index entries we have in each block. The script is here: How to measure Oracle index fragmentation. Let's run it on my index, with a bucket size large enough to see all blocks:

SQL> @ index_fragmentation.sql

        ID ->         ID rows/block bytes/block %free space     blocks free
---------- -- ---------- ---------- ----------- ----------- ---------- -----
         1 ->          1        578        8566                      1
       579 ->        579        571        8559                      1
      1150 ->       1150        571        8559                      1
      1721 ->       1721        571        8560                      1
      2292 ->       2292        571        8559                      1
      2863 ->       2863        571        8559                      1
      3434 ->       3434        571        8559                      1
      4005 ->       4005        571        8560                      1
      4576 ->       4576        571        8559                      1
      5147 ->       5147        571        8559                      1
      5718 ->       5718        571        8560                      1
      6289 ->       6289        571        8559                      1
      6860 ->       6860        571        8559                      1
      7431 ->       7431        571        8559                      1
      8002 ->       8002        571        8560                      1
      8573 ->       8573        571        8559                      1
      9144 ->       9144        571        8559                      1
      9715 ->       9715        286        4287          47          1 oo

Here are our 18 leaf blocks, covering values from 1 to 10000 (the ID displayed is the first one in each bucket - blocks here). The blocks are full (size is an approximation so this is why it's a bit higher than 8k), with about 570 entries per block. This is expected because when we insert increasing values, the block split fills the block instead of doing a 50-50 split.

 

delete insert lifecycle

Here is what I want to reproduce: delete old rows at the beginning of the index and insert new rows at the end. I'll do that for the same number of rows:10000 so I'm sure I've delete rows from all those 18 leaf blocks.

SQL> begin
  2   for i in 1..1e4 loop
  3    delete from DEMO where id=i;
  4    commit;
  5    insert into DEMO values(i+1e4);
  6    commit;
  7   end loop;
  8  end;
  9  /

PL/SQL procedure successfully completed.

Then run my index range scan:

SQL> select /*+ index(DEMO) */ count(*) from DEMO;

  COUNT(*)
----------
     10000

SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
SQL_ID  7b6qc9m1cw3zd, child number 0
-------------------------------------
select /*+ index(DEMO) */ count(*) from DEMO

Plan hash value: 3019291478

-------------------------------------------------------------------------------------
| Id  | Operation        | Name   | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |        |      1 |        |      1 |00:00:00.01 |      24 |
|   1 |  SORT AGGREGATE  |        |      1 |      1 |      1 |00:00:00.01 |      24 |
|   2 |   INDEX FULL SCAN| DEMOPK |      1 |  10000 |  10000 |00:00:00.01 |      24 |
-------------------------------------------------------------------------------------

Did I double the number of blocks to read? No.

 

Do you think that we lost some space because we read 24 blocks instead of 19? Look at the numbers. The new numbers are above 10000 and are larger than the initial ones. It's 4 bytes vs. 3 bytes.
Don't believe me?

SQL> select min(rownum),max(rownum),sum(vsize(rownum)) from (select * from dual connect by 1000>=level),(select * from dual connect by 20>=level) group by ceil(rownum/10000);

MIN(ROWNUM) MAX(ROWNUM) SUM(VSIZE(ROWNUM))
----------- ----------- ------------------
          1       10000              29801
      10001       20000              39899

Yes... No place for guesses and myth... Everything can be measured... Do you know how many block we need when data in 18 blocks are increased by that ratio? 18 * (4/3) = 24 so we are not bad at all.

 

The fact is that the 18 leaf blocks has only been increased to 20 leaf blocks:

SQL> @ index_fragmentation.sql

        ID ->         ID rows/block bytes/block %free space     blocks free
---------- -- ---------- ---------- ----------- ----------- ---------- -----
     10001 ->      10001        266        4254          47          1 oo
     10267 ->      10267        533        8523                      1
     10800 ->      10800        533        8522                      1
     11333 ->      11333        533        8523                      1
     11866 ->      11866        533        8523                      1
     12399 ->      12399        533        8522                      1
     12932 ->      12932        533        8523                      1
     13465 ->      13465        533        8523                      1
     13998 ->      13998        533        8522                      1
     14531 ->      14531        533        8523                      1
     15064 ->      15064        533        8523                      1
     15597 ->      15597        533        8522                      1
     16130 ->      16130        533        8523                      1
     16663 ->      16663        533        8523                      1
     17196 ->      17196        533        8522                      1
     17729 ->      17729        533        8523                      1
     18262 ->      18262        533        8523                      1
     18795 ->      18795        533        8522                      1
     19328 ->      19328        533        8523                      1
     19861 ->      19861        140        2237          72          1 ooo

and they are all full - except first and last one.

 

This is optimal. Do the same test case and you will see that if you coalesce or shrink the index then the number of blocks will not change. More info about it in a previous blog post: index coalesce vs. shrink vs rebuild

Is it new?

Ok, I've run my tests on 12c and you want to know if it's something new. No it's not new.
Oracle 7.3.3 reuses the deleted space as well:

b2ap3_thumbnail_CaptureORA73index.JPG

It's the same test case except that here I'm with 2k block size.

Index rebuild

Do you think index rebuild can help, or at least is not harmful?

SQL> alter index DEMOPK rebuild;
Index altered.

SQL> @ index_fragmentation.sql

        ID ->         ID rows/block bytes/block %free space     blocks free
---------- -- ---------- ---------- ----------- ----------- ---------- -----
     10001 ->      10001        478        7644           5          1
     10479 ->      10479        479        7659           5          1
     10958 ->      10958        479        7659           5          1
     11437 ->      11437        479        7659           5          1
     11916 ->      11916        478        7644           5          1
     12394 ->      12394        479        7659           5          1
     12873 ->      12873        479        7659           5          1
     13352 ->      13352        479        7659           5          1
     13831 ->      13831        479        7659           5          1
     14310 ->      14310        478        7644           5          1
     14788 ->      14788        479        7659           5          1
     15267 ->      15267        479        7659           5          1
     15746 ->      15746        479        7659           5          1
     16225 ->      16225        479        7659           5          1
     16704 ->      16704        478        7644           5          1
     17182 ->      17182        479        7659           5          1
     17661 ->      17661        479        7659           5          1
     18140 ->      18140        479        7659           5          1
     18619 ->      18619        478        7644           5          1
     19097 ->      19097        479        7659           5          1
     19576 ->      19576        425        6794          15          1

The index rebuild has increased the size of the index. One more leaf block here. Because it has left 5% of free space in each block. And that free space will never be reused because there are no future rows that will go there.

 

Conclusion

Is the free space reused in an index on a sequence - always increasing - when we are purging old data?
Answer is: yes... unless to are doing regular index rebuilds.

EMC World 2015 - Last day at Momentum

Yann Neuhaus - Thu, 2015-05-07 22:52

So Momentum is over, Philippe Schweitzer and I finished with a 4 hours hackaton session. For Philippe the subject was "Developing an EMC Documentum application with REST, AngularJS, Bootstrap, Node.js and Socket.io=" and I choosed "From the Ground Up - Developing an EMC InforArchive Solution".

But the main subject in this post is more to thank all people we met during these four enriching days, EMC people - who hold the sessions, who made demos on the booths, Catherine Weiss - Partner Sales Manager, who introduced us to great people. Also thank you to people from fme, Reveille Software, Flatiron with whom we had good discussions.

The atmosphere was amazing, not only during the working days but also in the evening events organized by EMC Smile

So to be short, Momemtum 2015 was a great and successful journey.

Philippe and Gerard

the fastest way to load 1m rows in postgresql

Yann Neuhaus - Thu, 2015-05-07 13:00
postreslogo.png

There have been several posts on how to load 1m rows into a database in the last days:

Variations on 1M rows insert (1): bulk insert
Variations on 1M rows insert(2): commit write
Variations on 1M rows insert (1): bulk insert - PostgreSQL
Variations on 1M rows insert(2): commit write - PostgreSQL
Variations on 1M rows insert (3): TimesTen

In this post I'll focus on how to prepare a PostgreSQL database for bulk loading in more detail.

Concrete5 CMS

Yann Neuhaus - Thu, 2015-05-07 03:30

Today, a lot of CMS are existing, WordPress, Joomla, Magento, and others, in this blog I will share my experience Concrete5 through a web agency specialized based in Geneva: 8 Ways Media


What Concrete5?

alt


Not only a CMS (Content Management System) open source based on a webserver, it is coded in PHP using a MySQL database, but also it's a great Framework for developers. A simplified system (optional), URL rewriting is present to increase the performance of indexing sites from search engines.

C5 can also be used in the development of Web applications.


C5 also provides, through its content management and user rights, create intranets for companies (small scale in my opinion, it is better for an intranet, stay on a SharePoint or Alfresco).

This CMS is designed to make life easier for the end user, the handling is simple and intuitive.

Advanced management of dynamic websites with modern design, the editing mode is made directly via the FrontEnd, the possibilities are numerous: Drag and Drop, templates, etc ...

INSTALLATION


Verifying Prerequisites


The following components are required to enable concrete5 run correctly: http://www.concrete5.org/documentation/developers/5.7/installation/system-requirements/


To install the tutorial, please refer to the main site: http://www.concrete5.org/documentation/developers/5.7/installation/installation/


FEATURES


The Dashboard manages the properties related to:
 

alt

  • Rights management

alt

 

  • Management of imported content


  alt alt

  • Templates

alt

  • Block, pages

alt

  • Features: video, forms, presentations, blog, guestbook, etc ...

  alt alt

The GUI is configurable, C5 is based on a system with access to a punctilious customization.
The editing of pages, texts and other is done via the FrontEnd as soon as you are logged on as an administrator or with the writing rights on the site. The editing mode has two modes: HTML or "Composer".
      
Versioning is an asset, it means that in case of error, a trace of the old version before changes is easily restorable.
The updates are performed through a simple upload, followed by a single click.

CONCLUSION


Despite the trio "Wordpress - Joomla - Drupal" according to studies, having discovered Concrete5, I recommend it for its very intuitive look and ease of use, on the other hand the developer of communities seem to be active and growing, what facilitates the resolution of "small issues" in cases. Also, exports of all site content in HTML is possible, this could help if you have to change the web server. However, the majority of bases useful plugins have a high cost, the most "design" themes are not free (expense of taste of each) and the support service is paid if the host is not through their bias.
I highly recommend this CMS! Its simplicity and amplitude adaptation to different needs allows the creation of a site with confidence and ease.

EMC World 2015 - Day 3 at Momentum

Yann Neuhaus - Wed, 2015-05-06 18:53

In this post I would like to relay some advices around the upgrade to Documentum 7.2 I have got in the session from Patrick Walsh about "What's New, what's Next: EMC Documentum Platform".

With Documentum 7.2 the processes for the upgrade is more cleaner and there are less restrictions. For instance more upgrade scenarios are documented, that can help us to define the best upgrade path to use.

There was also a slide which listed the following points to take into account when we have to define the right time to upgrade the current installation.

- Features
- Costs
- Dates

Is there not something missing?
On my point of view, there is at least an additional point to consider. When I do the impact assessment to know if we have to upgrade or not and when, I study the list of issues that have been fixed with this new version - if we are impacted or not by them - but also which open issues (who knows an application without bugs? ) are coming with it that are acceptable.

Another helpful information - which can give an insight to customers - is the time to achieve a typical upgrade project.
Documentum considers 6 to 12 months for the planning (evaluation of the release, gathering business requirements, budget approval aso) and 6 to 24 months for the implementation and testing.
Based on that, customers still using the version 6.7 (End Of Support is April 30 2018) should think to upgrade to version 7.x

To facilitate the upgrade, the client and platform do not have to be upgraded in one big bunch. the D6.x clients can be upgraded first and then the 6.7 platform.

Documentum introduced also "Phased Service Upgrades". For instance, we have the possibility to upgrade xPlore from version 1.3 to 1.5 in phase one and a couple of months later in phase two the platform from 6.7 to 7.2.
Or vice-versa, we start with the platform and later on we upgrade xPlore.
With this approach, having de-coupled services, we have more flexibility and less downtime.

And now, last but not least, the aim for the future is to have no downtime at all during the upgrade. THIS would be wonderfull !

 

EMC World Las Vegas – Momentum 2015 third day D2 news

Yann Neuhaus - Wed, 2015-05-06 17:36

This was a more day of networking with EMC partner contact and third party software editors. On the other side I attended a session about D2 news and what is comming next.

 

EMC divided D2 enhancements by 3 main major themes.

 

First was about productivity and a modern look and feel with:

 

    • graphical workflow widget

    • drag and drop from D2 to Desktop

    • better browsing with enhanced facet navigation

    • multi-document support in workflows

    • faster content transfer with new D2-BOCS for distributed environments

 

Second was about Information integrity with

    • more SSO implementation support, like Tivoli

    • folder import with inner documents as virtual document

 

Then finally about software agility with

 

    • ease of new user on-boarding with default configuration settings

    • PDF export of D2 configuration for multi environment comparison

 

I hope you enjoyed reading this summary of today at EMC world – Momentum 2015. Thanks for your attention.

getting started with postgres plus advanced server (4) - setting up the monitoring server

Yann Neuhaus - Wed, 2015-05-06 04:26

If you followed the first, second and the third post the current ppas infrastructure consists of a primary database, a hot standby database and a backup and recovery server.

EMC World Las Vegas – Momentum 2015 second day

Yann Neuhaus - Tue, 2015-05-05 17:05

For second day of conferences, I attended a first session about migration from Webtop to D2 or xCP. Then we attended 2 about Documentum platform performances tuning.

 

We could see EMC actually putting efforts to take benefits of Open Sources software. They started to package whole platform component by component into Dockers containers. Hope is to try to simplify upgrades of them from one version to another.

 

They also invited audience to migrate to Documentum 7.1/7.2 because lot of performances enhancements were done, especially for multi-core CPU support and better session pooling management, which last topic took us some maintenance time last months.

 

Key advantage of migration from Webtop to D2 or xCP is they capability to integrate a lot of business case scenarios out of the box. For instance, when a customer wants to move customizations into D2, by experience they said we could reach up to 50% features coverage by software configuration instead of coding. A great saving of time and money as well as maintenance costs over the time and upon further upgrades.

 

Finally they also stated EMC is providing few tools to ease processes of migration for former live science appliance to actual one and webtop to D2 and xCP.

 

For Documentum platform performances tuning, I invite you to read Gérard's blog following this link.

 

I hope you enjoyed reading summary of today at EMC world – Momemtum 2015. Thanks for your attention.

EMC World 2015 - Day 2 at Momentum

Yann Neuhaus - Tue, 2015-05-05 16:53

Second day in this amazing event. There are not only general and presentation sessions you can also participate on a so called "Hands-on Lab". The subject was the "EMC Documentum Platform Performance Tuning". So learning by doing is also a good opportunity you can use at Momentum to enhance your skills.


The session covered the performance tuning using Fiddler for the HTTP requests, the DFC trace, sql traces and how to use the related execution plan, and at the end, how to tune xPlore but only for the ingestion phase meaning the indexing of the documents.The tuning of the fulltext search was not addressed.

Most of the tips I learned was on the xPlore side, which parameters to set to increase the performances or to avoid some errors due timeouts for instance. I skipped more or less the Database tuning, why? that's not acceptable would you say. Because we have the experts at dbi service to do this kind of tuning!

So let's me give you some information.

DFC trace

In df.properties add :

dfc.tracing.enable=true

dfc.tracing.verbose=true

dfc.tracing.max_stack_deph=0

dfc.tracing.include_rpcs=true

dfc.tracing.mode=compact

dfc.tracing.include_session_id=true

dfc.tracing=c:\dctm\trace


dfc.tracing.enable can be set from false to true and after a couple of seconds the trace file will be created. Of course once the value is set to false, the tracing is stopped.

xPlore tuning

I would recommend to rely to the documentation but here you can find some tips:
- disable fulltext for a specific format by setting can_index=false
- filter which document is not indexed by excluding cabinets or folder or even document types
- reduce the online rebuild index max queue size
- reduce the number of indexing threads and CPS request threads
- reduce the number of documents processed simultaneously by the CPS internal processing queue
- increase the number of CPS daemon processes

WARNING: each parameter can have some drawbacks, so take care when you change this values

So this was on a lab, let's apply this on real case at customer sites!

SQL Server vNext becomes SQL Server 2016

Yann Neuhaus - Tue, 2015-05-05 15:33

 

Satya Nadella, CEO of Microsoft has announced at the Microsoft Ignite keynote in Chicago that SQL Server 2016 public preview will come this summer and you can be sure that at dbi services we’re looking forward to take a look at under the hood and test this next major release.

Amid all the good news, here those who have focus my attention:

 

blog_41_-_1_-_always_encrypted

 

First of all, because we’re working in Switzerland, security is a main concern at some customer places and security features like Always Encrypted arouses all of our interest. Indeed, a big improvement here concerns the protection of sensitive data performed at rest and also in motion. In addition, the encryption key is not anymore stored in the SQL Server instance and now resides with the application in the customer environment in a secure location. Microsoft will also provide support for row level security and dynamic data masking. The latter will probably focus the eyes of some of our customers. I Think we will hear about these new features the next few months ...

 

blog_41_-_2_-_stretch_tables

 

Another interesting feature is certainly stretch tables but in fact this is not really a news because it was firstly announced during the previous Pass summit. This is clearly a hyper-scale cloud feature, mainly focused on hybrid scenarios. You can think this feature as a kind of partitioning table where historical data (cold data) are dynamically stretched in the cloud.

 

blog_41_-_3_-_inmemory_analytics

 

When we talked about SQL Server 2014 we automatically think about hekaton. Is that will be the same with SQL Server 2016 and new real-time operational Analytics & In-Memory OLTP feature? At the moment it is too early to speculate but let’s say that this feature will interest a lot of folks including the dbi Microsoft team :-)   The idea seems to be pretty simple: combining the existing OLTP In-memory tables feature and the in-memory columnstore capabilities to obtain a satisfying mix that will increase the scope of possible scenarios (OLTP and analytics). This feature needs to be tested for sure!

 

 

blog_41_-_4_-aags

 

Another good news that concerns high-availability topic and especially availability groups which will now support MsDtc and SSIS. AGs will able to use native load balancing for reporting workload against the secondaries. In addition, it will be possible to use 3 synchronous replicas and we will be able to configure a failover policy based on database health.

 

alt

Finally, you will note the integration of Polybase into SQL Server (previously Polybase was available only with AFS) to extended access to both unstructured and structured data by using T-SQL language. Thus, we come full circle: transaction processing, data warehousing, BI activities and now big data capabilities handled by the same engine.

You can download the SQL Server 2016 datasheet here.

Probably other good news will come until the final release of SQL Server 2016 and we will have the opportunity to blog on them. So stay connected!

 


When tempdb database may cause indirect issues

Yann Neuhaus - Tue, 2015-05-05 13:23

A couple of weeks ago, I had an interesting discussion with one of my friend that faced a weird issue with a SQL Server instance that lacked worker threads. Because I can’t use his own information I decided to reproduce the same issue in order to share with you some interesting information. So the next part of this blog post refers exclusively to my own test but it represents exactly the real issue encountered by my friend.

Let’s set the scene: in the SQL Server error log we found the following records related to our issue:

 

2015-04-16 14:06:16.54 Server     New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 300 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%. System Idle: 94%.

2015-04-16 14:11:16.74 Server     New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 600 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%. System Idle: 92%.

 

In the same time, we found an event related to the generation of a SQL Server dump file triggered by the deadlock schedulers monitor as follows:

 

2015-04-16 14:06:04.66 Server     **Dump thread - spid = 0, EC = 0x0000000000000000 2015-04-16 14:06:04.83 Server     ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Log 2015-04-16 14:06:04.83 Server     * ******************************************************************************* 2015-04-16 14:06:04.83 Server     * 2015-04-16 14:06:04.83 Server     * BEGIN STACK DUMP: 2015-04-16 14:06:04.83 Server     *   16/04/15 14:06:04 spid 6392 2015-04-16 14:06:04.83 Server     * 2015-04-16 14:06:04.83 Server     * Deadlocked Schedulers 2015-04-16 14:06:04.83 Server     * 2015-04-16 14:06:04.83 Server     * ******************************************************************************* 2015-03-27 01:05:02.83 Server     * ------------------------------------------------------------------------------- 2015-03-27 01:05:02.83 Server     * Short Stack Dump

 

I began my investigation by having a first look at the SQL Server error log file but I didn’t find any relevant information. So, I decided to move on the analysis of the dump file.

My first idea was to look at the runaway threads at the moment of the dump file generation (!runaway command). In fact, I suspected first a high-CPU workload because I didn’t get other information from my friend. Here what I found:

 

blog_39_-_1_-_runaway

 

I began my investigation by having a first look at the SQL Server error log file but I didn’t find any relevant information. So, I decided to move on the analysis of the dump file.

My first idea was to look at the runaway threads at the moment of the dump file generation (!runaway command). In fact, I suspected first a high-CPU workload because I didn’t get other information from my friend. Here what I found:

 

blog_39_-_2_-_nb_of_running_threads

 

557 threads in use at the moment of the generation of the dump file. According to the Microsoft documentation here and the concerned SQL Server infrastructure (4 logical processors and max worker thread option = 0), the number of running threads are greater than maximum of configured SQL Server threads equal to 512 in this case.

 

SQL Server detected 2 sockets with 2 cores per socket and 2 logical processors per socket, 4 total logical processors;

 

We found a lot of thread patterns that seem blocked by a database lock.

 

blog_39_-_3_-_call_stacks

 

Note in this case that SQL Server has decided to switch off the “tied” thread but it is still waiting to be signaled that the database lock is available to continue …

Another piece of interesting call stack is the following:

 

blog_39_-_4_-_call_stacks_2

 

 

This above call stack seems to concern the SQL Server start-up process. Let’s take a look at the functions like sqlmin!DBMgr::OpenUnstartedDatabase, sqlmin!DBMgr::StartupDB or sqlmin!IMEDUtil::SynchronzeWithResourceDB .We can also notice the use of the function sqlmin!StartUp::WaitForTempdbReady that seems to be related to tempdb database readiness. After that, the concerned thread is also waiting on the database lock to be available and we don’t see any functions that state tempdb is ready at the moment of the dump file generation.

 

Well, to summarize we are facing a general locking issue inside SQL Server process that has consumed all of available threads and we may think that tempdb readiness has something to do with this our blocking issue.

So now, let’s go back to the SQL Server error log and let’s have a deeper look at the SQL Server error log event. At this point, I remembered the famous Bob Ward’s session at Summit 2011 about tempdb database in which he explained the importance of tempdb database during the SQL Server start-up process because even if SQL Server is ready for user connections. Indeed, they have to wait until tempdb has really started and it makes sense because we can wonder what’s going on if tempdb database cannot start correctly. Do we allow user connections to perform their work? As a reminder, the unavailability of tempdb is considered as a SQL Server shutdown event.

In my context we found effectively that the tempdb database was ready only 40 minutes after SQL Server was ready for user connections as follows:

 

2015-04-16 13:52:38.43 spid5s     Starting up database 'master'. 2015-04-16 13:53:05.76 spid9s     Starting up database 'model'. 2015-04-16 13:53:25.02 spid9s     Clearing tempdb database. 2015-04-16 13:51:48.13 spid13s     SQL Server is now ready for client connections. This is an informational message; no user action is required. 2015-04-16 14:32:12.28 spid9s     Starting up database 'tempdb'. 2015-04-16 14:32:15.71 spid9s     Recovery completed for database tempdb (database ID 2) in 2 second(s)

 

The $100 dollar question: Why tempdb started so slowly? Well, I didn’t have all information but my friend told me that they had a storage maintenance at the time … maybe we may correlate these two events together but without any further investigation, it is not possible to respond correctly about it.

Anyway it was very interesting to see how the tempdb database may trigger an escalade of issues in this case. There are probably a lot of ways to reproduce this issue.

Finally let me finish my blog post by explaining my test scenario. In fact to simulate the same issue than my friend, I created first a very big user table inside the model database to considerably slow down the start-up of my tempdb database. Then I configured tempdb with big database files in size without instant file initialization (zeroing is mandatory in this case for all of database files). In the same time, I used ostress tool to simulate roughly 600 running threads. Finally I let the magic happens ...

I hope this blog will help you in your database administrator daily work.

Understand the Lifecycle of Alfresco Nodes

Yann Neuhaus - Tue, 2015-05-05 12:19


I guess you all already know what a lifecycle is. We born, we live and we die... In fact, it's exactly the same for Aflresco Nodes! Well, at least from an end user point of view. But what is really going on behind that? This is what I will try to explain in this post.

First of all what is an Alfresco Node? For most people, a Node is just a document stored in Alfresco but in reality it's much more than that: everything in Alfresco is a Node! A Node has a type that defines its properties and it also have some associations with other Nodes. This is a very short and simplified description but we don't need to understand what exactly a Node is to understand the lifecycle process. So as said above, Alfresco Nodes have their own lifecycle but it's a little bit more complicated than just three simple steps.

Please note that in this post, I will use $ALF_HOME as a reference to the location where alfresco has been installed (e.g. /opt/alfresco-4.2.c) and $ALF_DATA as a reference to the alf_data location. By default the alf_data folder is $ALF_HOME/alf_data/.


I. Creation


In this post I will use a document as an Alfresco Node to easily understand the process. So this lifecycle start with the creation of a new document named "Test_Lifecycle.docx" with the following creation date: Febrary the 1st, 2015 at 16:45.

When a document is created in Alfresco, three things are done:

  • File System: the content of this file is stored on the Alfresco "Content Store". The Content Store is by default under $ALF_DATA/contentstore. This file is actually put somewhere under this folder that depends on the creation time and an ID is given to this file. For our file, it would be: $ALF_DATA/contentstore/2015/2/1/16/45/408a6980-237e-4315-88cd-6955053787c3.bin.
  •  Database: the medatada of this file are stored on the Alfresco Database. In fact in the DB, this document is mainly referenced using its NodeRef or NodeID. This NodeRef is something we can see on the Alfresco Web Interface from the document details page (web preview of a document): http://HOSTNAME:PORT/share/page/document-details?nodeRef=workspace://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6. Please be aware that the NodeRef contains an UUID but it's not the same that the ID on the Content Store side... Moreover, there is a property in the DB that link the NodeRef to the Content Store's ID for Alfresco to be able to retrieve the content of a file.
  • Index: an index is created for this file in the Search engine (can be Lucene for older versions of Alfresco or Solr for newer versions). This index is in the "workspace" store.



II. Update, Review, Approve, Publish, aso...


Once the document is created, his life really begins. You can update/review/approve/publish it manually or automatically with different processes. All these actions are part of the active life of a document. From an administration point of view, there isn't that much to say here.


III. Deletion - User level

 
When a document isn't needed anymore, for any reason, a user with sufficient permissions is able to delete it. For our example, let's say that a user deleted our file "Test_Lifecycle.docx" using the Alfresco Share Web Interface on Febrary the 20th, 2015 at 15:30 (19 days after creation). When using the Web Interface or Web Services to delete a document, the "nodeService.deleteNode" method is called. So what happened to our "three things"?

  • FS: nothing changed on the Content Store. The file content is still here.
  • DB: on the DB side, the NodeRef changed from workspace://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6 to archive://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6 (the "store_id" field changed on the "alf_node" table).
  • Index: same thing for the search index: the index is moved from the "workspace" store to the "archive" store.


Actually, when a user deletes a document from a Web Interface of Alfresco, the document is just moved to a "global trashcan". By default all users have access to this global trashcan in Alfresco Explorer to restore the documents they may have deleted by mistake. Of course they can't see all documents but only the ones related to them. On Alfresco Share, the access to this global trashcan is configured in the share-config.xml file and by default on most versions, only administrators have access to it.

The only way to avoid this global trashcan is to programmatically delete the document by applying the aspect "cm:temporary" to the document and then call the "nodeService.deleteNode" on it. In that way, the document is removed from the UI and isn't put in the global trashcan.


 IV. Deletion - Administration level

 
What I describe here as an "Administration level" is the second level of deletion that happen by default. This level is the deletion of the document from the global trashcan. If the document is still in the global trashcan, Administrators (or users if you are using Alfresco Explorer) can still restore the document. If the document is "un-deleted", then it will return exactly where it was before and of course metadata & index of this document will be moved from the "archive" store to the "workspace" store to return in an active life.

On April the 1st, 2015 at 08:05 (40 days after deletion at user level), an administrator decides to remove "Test_Lifecycle.docx" from the global trashcan. This can be done manually or programmatically. Moreover, there are also some existing add-ons that can be configured to automatically delete elements in the trashcan older than XX days. This time, the "NodeArchiveService.purgeArchiveNode" method is called (archiveService.purge in some older Alfresco versions). So what happened to our "three things" this time?

  • FS: still nothing changed on the Content Store. The file content is still here.
  • DB: on the DB side, the document is still there but some references/fields (not all) are removed. All references on the "alf_content_data" table are removed when only some fields are emptied on the "alf_node" table. For Alfresco 4.0 and below, the "node_deleted" field on the table "alf_node" is changed from 0 to 1. On newer versions of Alfresco, the "node_deleted" doesn't exist anymore but the QNAME of the node (field "type_qname_id") on the "alf_node" table is changed from 51 ("content") to 140 ("deleted"). So the Node is now deleted from the global trashcan and Alfresco knows that this node can now be safely deleted but this will not be done now... Once the "node_deleted" or "type_qname_id" is set, the "orphan_time" field on the "alf_content_url" table for this document is also changed from NULL to the current unix timestamp (+ gmt offset). In our case it will be orphan_time=1427875500540.
  • Index: the search index for this Node is removed.


As you can see, there are still some remaining elements on the File System and in the DB. That's why there is a last step in our lifecycle...


V. Deletion - One more step.


As you saw before, the document "Test_Lifecycle.docx" is now considered as an "orphaned" Node. On Alfresco, by default, all orphaned Nodes are protected for 14 days. That means that during this period, the orphaned Nodes will NOT be touched at all. Of course this value can be changed easily on Alfresco configuration files. So what happen after 14 days? Well in fact every day at 4am (again, by default... can be changed), a scheduled job (the "contentStoreCleaner") scan Alfresco for orphaned Nodes older than 14 days. Therefore on April the 15th, 2015 at 04:00, the scheduled job runs and here is what it does:

  • FS: the content file is moved from $ALF_DATA/contentstore/ to $ALF_DATA/contentstore.deleted/
  • DB: on the DB side, the document is still here but the line related to this document on the "alf_content_url" table (this table contains the orphan_time and reference to the FS location) is removed.
  • Index: nothing to do, the search index was already removed.


You can avoid this step where documents are put on the "contentstore.deleted" folder by setting "system.content.eagerOrphanCleanup=true" in the alfresco-global.properties configuration file. If you do so, after 14 days, the document on the File System is not moved but will be deleted instead.


VI. Deletion - Still one more step...!


As said before, there are still some references to the "Test_Lifecycle.docx" document on the Alfresco Database (especially on the "alf_node" table). Another scheduled job, the nodeServiceCleanup runs every day at 21:00 to clean everything that is related to Nodes that has been deleted (orphaned nodes) for more than 30 days. So here is the result:

  • FS: the content file is still on the $ALF_DATA/contentstore.deleted/ folder
  • DB: the DB is finally clean!
  • Index: nothing to do, the search index was already removed.



VII. Deletion - Oh you must be kidding me!?


So many steps, isn't it! As saw before, the only remaining thing to do is to remove the content file from the $AL_DATA/contentstore.deleted/ folder. You probably think that there is also a job that do that for you after XX days but it's not the case, there is nothing in Alfresco that deletes the content file from this location. In consequences, if you want to clean the File System, you will have to do it by yourself.

On Unix for example, you can simply create a crontab entry:

50 23 * * * $ALF_HOME/scripts/cleanContentStoreDeleted.sh


Then create this file with the following content:

#!/bin/sh

CS_DELETED=$ALF_DATA/contentstore.deleted/

# Remove all files from contentstore.deleted older than 30 days
find $CS_DELETED -type f -mtime +30 | xargs rm 2> /dev/null

# Remove all empty folders from contentstore.deleted older than 60 days
find $CS_DELETED -type d -mtime +60 -empty | xargs rm -r 2> /dev/null


Please be aware that you should be sure that the folder "$ALF_DATA/contentstore.deleted" exist... And when I say that, I mean YOU MUST ABSOLUTELY BE SURE that it exists. Please also never remove anything under the "contentstore" folder and never remove the "contentstore.deleted" folder itself!

You may wonder why the DB isn't cleaned automatically when the trashcan is cleaned and why the file content is also kept 14 days... Well I can assure you that there are several reasons and the principal one is for backup/restore performance concerns. I will not explain it in details but basically as the file content isn't touched for 14 days by default, that means that you can restore your database up to 14 days in the past and your database will still be consistent with the File System without to restore the FS! Of course if you just do that you will lose the documents uploaded/changed in the last 14 days because your database wasn't aware of these files 14 days ago. But you can just backup/restore the content files created in the last 14 days with an incremental backup.

E.g.: Today (05-May-2015), I want to backup the FS (only the last 14 days), then I will have to backup all folders inside $ALF_DATA/contentstore/2015/5 AND I will also have to backup all folders inside $ALF_DATA/contentstore/2015/4 with a folder name bigger or equal than 30 (days in Apr) + 5 (days in May) - 14 = 21.


I hope this post was clear enough because it's true that it can be hard to understand everything regarding the lifecycle of Alfresco Node and to deal with it properly.