Reading large XML files. Performance comparison of different methods

For really large XML documents larger than 100 MB the idea to load entire document in memory could be not the best choice. Now let’s consider the other ways for reading those large files.

For our test lab we will create console project. Now let’s add simple domain class:

public class CustomDiscount
    {
        public string Name { get; set; }
        public int Type { get; set; }
        public decimal Offer { get; set; }
    }

In order to get test results let’s first create target Xml file and fill it with some discounts using following method:

private static void FillLargeXml(string filePath, int elementsCount)
{
    var random = new Random();
    var discounts = new CustomDiscount[elementsCount];
    for (int i = 0; i < discounts.Length; i++)
    {
        discounts[i] = new CustomDiscount()
            {
                Offer = (decimal) (random.NextDouble()*200),
                Type = random.Next(1, 2),
                Name = Guid.NewGuid().ToString("N")
            };
    }

    var root = new XElement("root");
    Array.ForEach(discounts, d => root.Add(new XElement("CustomDiscount",
        new XElement("Name", d.Name),
        new XElement("Type", d.Type),
        new XElement("Offer", d.Offer))));
    root.Save(filePath);
}

Parsing large xml using XmlSerialiser:

private static CustomDiscount[] DeserialiseLargeXml(string filePath)
{
    var xs = new XmlSerializer(typeof(CustomDiscountRoot));
    var res = (CustomDiscountRoot)xs.Deserialize(XmlReader.Create(filePath));
    return res.Discounts;
}

CustomDiscountRoot class:

[XmlRoot("root")]
public class CustomDiscountRoot
{
    [XmlArrayItem("CustomDiscount")]
    public CustomDiscount[] Discounts { get; set; }
}

Parsing with LinqToXml

private static CustomDiscount[] ParseLargeXmlWitmXLinq(string filePath)
{
    var xElements = XElement.Load(filePath).Elements("CustomDiscount").ToArray();
    return xElements.Select(ConvertFrom).ToArray();
}

private static CustomDiscount ConvertFrom(XElement e)
{
    var value = e.Element("Name").Value;
    var s = e.Element("Type").Value;
    var value1 = e.Element("Offer").Value;

    return new CustomDiscount()
    {
        Name = value,
        Type = int.Parse(s),
        Offer = decimal.Parse(value1, CultureInfo.InvariantCulture)
    };
}

Parse using XmlReader:

private static CustomDiscount[] ParseLargeXmlWithXmlReader(string filePath)
{
    var discounts = new List<CustomDiscount>();
    using (var reader = XmlReader.Create(filePath))
    {
        while (reader.ReadToFollowing("CustomDiscount"))
        {
            reader.ReadToFollowing("Name");
            string name = reader.ReadElementContentAsString();

            reader.ReadToFollowing("Type");
            int type = reader.ReadElementContentAsInt();

            reader.ReadToFollowing("Offer");
            decimal offer = reader.ReadElementContentAsDecimal();
            discounts.Add(new CustomDiscount
                {
                    Name = name,
                    Type = type,
                    Offer = offer
                });
        }
    }
    return discounts.ToArray();
}

Now let’s see the performance time (in milliseconds) comparison for different approaches of dealing with large XML docs.

Objects in XML 100.000 200.000 400.000 800.000 1.600.000
XmlReader 738 809 2294 3677 8379
LinqToXml 1092 1800 2581 5245 14888
XmlSerialiser 574 432 733 2962 3605

*Time counters were specified in milliseconds
*Each computation was repeated for 5 times
*.NET framework v4.0

It is easy to see that the fastest approach to parse large XML documents is using XmlSerialiser class which is also the most elegant and concise. It uses instance of XmlReader as input parameter so does not load entire document into memory.
The second for performance is XmlReader. Theoretically it should have the same performance as XmlSerialiser.Deserialise() method with more control over the process of reading the document. ( I really like its syntax now, it is much cleaner and handy in comparison for instance with OpenXmlReader that is used for reading Excel documents without its loading in memory)
The “looser” is using LinqToXml. It loads the entire xml into memory so that it document is really huge that OutOfMemory exception is possible there. But on not really large amount of data (something below 1.000.000 entire objects packed inside xml) this is neglectible.

Summary

For extremely large documents use either XmlSerialiser.Deserialise() method if you need the entire objects from xml, or use XmlReader if you need only part of the data.  It will prevent you from OutOfMemory exception and performance bottlenecks.

This entry was posted in XML and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.