Ask MAMA what the web is made of

Last year Opera released data from MAMA (Metadata Analysis and Mining Application), a search engine that trawls web pages and returns results detailing page structures, what HTML, CSS, and script is used.

MAMA examined 3,509,180 URLs in 3,011,668 domains  and returned results on how many pages validate (only 4.13%), how many use Flash (33.5%), how FRAMES were used, images, CSS and so on.

This has given us such useful insights into how web developers are using code that Brian Wilson, who runs MAMA, is planning a second run and is interested in knowing what more we can check for. Some of the things I’d like to see assessed are support for ARIA, HTML5 attributes and how headings are structured.

If you have ideas leave a comment and we’ll look at getting it included. I can’t promise it will be possible to add everything but we’ll do what we can. Don’t forget to check MAMA and see what we already cover first.

36 thoughts on “Ask MAMA what the web is made of

  1. Gez

    MAMA is a great project, Henny, and thank you for opening it up for suggestions. If possible, I would like to see the following accessibility attributes included:

    * The headers attribute (in general)

    * The scope attribute (in general)

    * The headers attribute referencing a td

    * The scope attribute on a td

    * The longdesc attribute

    * The summary attribute

  2. Gez

    It would also be good to do some analysis on current alt text usage, such as concise, overly verbose, and appropriateness (if that’s possible – for example, alt text provided for spacer.gif).

  3. Smiffy

    What would I like to see?

    * I would like the crawl to pick up how many sites are using Dublin Core metadata and how many terms (irrespective of what they are) are used on a site.
    * Ditto for AGLS metadata, but lower priority than Dublin Core.
    * I would like the dataset to be made available for mining, even if anonymised. The more people that are able to analyse the data, the more valuable it becomes.

  4. Martin Kliehm

    Apart from the heading structure (h2 navigation -> h1 article headline, or h1 logo -> h2 navigation | h2 article headline?) I would be interested in the usage of DTD and the XML declaration:

    How many use them at all, how many use a common DTD (HTML 4 / XHTML 1.0 transitional or strict, XHTML 1.1, XHTML 2 or HTML 5 anybody?), home many use a customized DTD (like XHTML 1.1 + ARIA), how many make common errors (like “-//W3C//DTD XHTML 1.0 Transitional//DE” instead of EN).

    Regarding JavaScript it would be nice to know how popular some frameworks are (jQuery, YUI, dojo, prototype, mootools), how much of the total JavaScript filesize they are responsible for, and which framework extensions are most popular (like jQuery UI, YUI Calendar etc.). The same could be analyzed for CSS frameworks and reset styles.

  5. Raph de Rooij

    Some suggestions:

    1. Semantic mark-up (use of tables, headings, form fields etc. in accordance with the intended purpose)

    2. Deprecated elements and attributes.

    3. Use of the new RDFa DTD: and other semantic web applications, such as Microformats. BTW: only the most recent versions of the W3C validator include the RDFa DTD.

    4. Use of graceful degradation techniques versus progressive enhancement techniques

    5. Separation of content from style and logic: use of in-line style and in-line event handlers.

    6. Make the new version of MAMA will be available for local install and use.

    In the Netherlands, we do some automated testing that may be also useful for MAMA, see http://www.webguidelines.nl/test/. A downloadable version of this tool is available at http://www.webrichtlijnen.nl/toetsen/download/voorwaarden/bestand/ (web site and installation instructions are in Dutch; this should be no problem for some of the Opera employees ;-).

  6. John F Croston III

    I would like to see more information on the following items.

    Use of table elements – how many people use TH, TR, CAPTION, SUMMARY, SCOPE with ROW and COL, ID and HEADERS, etc.
    Headings – how many websites use H1, H2, H3, H4, H5, H6 and where and how many H1′s are used on a page.
    FORM fields – how many websites use FIELDSET, LEGEND, LABEL, etc.
    Use of UL or OL in FORMs to help identify how many items in a given section of the FORM.

  7. iheni Post author

    Thanks all for your input. It’s interesting, but not surprising, to see that quite a few of us want to see mire data on structure and semantics.
    Adding to my own list I’d like to see use of canvas and SVG. If there is anything else being debated by the HTML5 WG (such as summary, canvas etc) then we should probably try and see if we can get that added too (although I think Gez captured all of that).

  8. iheni Post author

    And another thing…can we specify an algorithm for deciding whether a table is a layout table or not then look at how TH, summary, caption and so on are treated.

  9. bruce

    I’d like to look for instances of microformats to see how much they’re really used and, in particular, uses of the abbr element for dates/ times, and of those, how many dates and times were imprecise (eg, no DD to go with the MMYYYY), and how many are negative dates – that is (BC) before christ?

    I think that microformats in an html 5 world should move to the html 5 time element from abbr.

    Some worry that time doesn’t allow you to specify “January” or “next year”, but requires a full date and that you can’t specify B.C. dates. I suspect that no-one uses them in the wild, anyway.

  10. ppk

    Inline event handlers.
    Amount of script tags per page.
    Embedded JS.
    The defer attribute.
    Position of script tags in page (head, body, end of body).

  11. Brian Wilson

    Hi and thanks for the input so far – this is all great. I’ll try to respond to each comment individually. Generally, MAMA already tracks all evidence of element and attribute names used in a document. Attribute *values* on the other hand is a different story. Worried that storing individual attribute values might overwhelm the database, I only added attribute detection on a case-by-case basis where they had been previously requested (or where I thought they would be useful). This means that some pretty obvious attributes may have been overlooked so far. Tracking specific attributes is pretty easy for me to add, but the more popular the attribute is, the harder it can be to store.

  12. Brian Wilson

    Gez wrote:
    > Tracking attributes: headers, scope, longdesc, summary

    Headers, scope and summary are now on the feature to-do list. Longdesc has already been explored in the writeup:
    http://dev.opera.com/articles/view/mama-common-attributes/#longdesc

    > – The headers attribute referencing a td

    This will be a little more work to track, and may not make it this time around due to time constraints versus the size of the proven sample space currently using it – it will be a bit more involved, but I will put it on the to-do list.

  13. Brian Wilson

    Gez wrote:
    > It would also be good to do some analysis on current alt text
    > usage, such as concise, overly verbose, and appropriateness
    > (if that’s possible – for example, alt text provided for spacer.gif).

    Ack! So vague! :)

    Henny and I were talking about how best to address this one, and one way would be to track the length of all the Alt attributes. I think I can do that pretty easily, but judging any value as “concise, overly verbose and appropriateness” are probably too fuzzy to boil down to an algorithm. If you have any ideas, I’d love to hear them on how I could judge them.

    Some ideas of Alt characteristics that might be trackable (from Henny):
    - Alt that is the image file path (or the image file name). Problem: Does this situation exist if the Alt is just the file name (minus path info, or even minus file extension)? If something is a picture of a boat, and the image is src=”../images/boat.gif” alt=”boat”, is this a good or bad thing?
    - Alt that says “image” (many variants here, including “[image]” and localized variants), “spacer”, or the file type (“Jpeg”).

    To summarize, I think looking at Alt is a good idea, I just don’t know how to examine it well enough to be both valid and produce stats that you’d find interesting.

  14. Brian Wilson

    Smiffy wrote:
    > What would I like to see? Dublin Core metadata and how many
    > terms (irrespective of what they are) are used on a site.
    > Ditto for AGLS metadata.

    I’m not familiar with AGLS – can you provide a link so that I can school myself? As for Dublin Core, from what I remember, that is mostly covered by META use and some HEAD Profile:
    http://devfiles.myopera.com/articles/575/headprofile-url.htm
    http://devfiles.myopera.com/articles/575/metanamelist-url.htm
    http://devfiles.myopera.com/articles/575/metahttpequivlist-url.htm
    I didn’t track values for those META attribute values though. Would they be interesting, and if so, any particular ones I could/should track?

    Smiffy also wrote:
    > I would like the dataset to be made available for mining, even
    > if anonymised. The more people that are able to analyse the
    > data, the more valuable it becomes.

    I agree this would be a very useful thing. We’re working on improving the usability of a crude interface I originally came up with. By crude I mean “overwhelming”. :D I’m also working to make access to the data faster. I have patience to wait for queries, but most others might not with a data set this size/complexity.

  15. Gez

    Hi Brian,

    Thank you for considering evaluating alt text.

    For concise and overly-verbose, I was thinking of something arbitrary: concise is less than 20 characters, and overly-verbose is greater than 60 characters.

    Inappropriate alt text is more difficult to find. I was thinking of examples like “image”, “spacer”, “logo”, and so on. “Boat” could be good alt text depending on the context, so I would limit it to looking for lazy values that an author might provide. I appreciate that’s still vague, but it would be good if we could get some data on lazy values provided to pass a validator.

    Another thing for appropriateness would be to search for alt text applied to decorative images, such as spacer.gif. I’ve seen examples in the past where developers have provided alt text for spacer images as a misguided attempt of search engine optimisation. I don’t know how prevalent that practice was/is, but as it has a negative impact on accessibility, it would be good to know how wide-spread it is.

    Thanks,

  16. Brian Wilson

    John F Croston III wrote:
    > Use of table elements – how many people use TH, TR, CAPTION,
    > SUMMARY, SCOPE with ROW and COL, ID and HEADERS, etc.

    General information about table elements/attributes, and some of their values:
    http://dev.opera.com/articles/view/mama-tables/
    (I could do more here. Someone suggested I use a heuristic like the one mentioned here: http://html.cita.uiuc.edu/nav/dtable/dtable-rules.php ) to judge whether a table is a data table or complex data table. That will probably have to be next version as I’ll have to add a lot to track all that data interacting).

    > Headings – how many websites use H1, H2, H3, H4, H5, H6 and
    > where and how many H1’s are used on a page.

    This should cover some of that:
    http://dev.opera.com/articles/view/mama-phrase-block-list/#block
    MAMA already tracks tag quantity, but I didn’t analyze that in the last go-round.

    > FORM fields – how many websites use FIELDSET, LEGEND, LABEL,
    > etc. Use of UL or OL in FORMs to help identify how many items in
    > a given section of the FORM.

    This covers basic form element usage:
    http://dev.opera.com/articles/view/mama-forms/
    but I admit there is a lot more that could be examined. Nesting situations like lists within forms is good to look at, and something MAMA *can* do (but hasn’t done yet) with some post-processing.

  17. Brian Wilson

    ppk wrote:
    > Inline event handlers, Amount of script tags per page., Embedded
    > JS, The defer attribute.

    This *should* cover the quantity and size issue:
    http://dev.opera.com/articles/view/mama-scripting-quantities-and-sizes/
    For SCRIPT[Defer]:
    http://dev.opera.com/articles/view/mama-head-structure/#script

    > Position of script tags in page (head, body, end of body).

    This hasn’t been looked at yet, but that’s a really interesting question to me too. MAMA already has the data necessary to look at this, but it will need to be done with an additional post-analysis script.

  18. Brian Wilson

    Henny wrote:
    > Detecting SVG and Canvas usage

    I built in a number of detection methods for SVG, but I just didn’t really find any in the last phase with the URLs analyzed (less than 10). Perhaps that has changed? A dedicated crawl of SVG documents would be interesting too and has been suggested before. Now that MAMA is going to be integrating some link crawling (whereas before it was based on published URL lists), this will hopefully broaden the variety of content it uncovers.

    Canvas usage should be as simple as looking for the CANVAS element, correct? MAMA can already do that. :)

  19. Andy Mabbett

    @Brucel – BC dates in microformats are an issue on Wikipedia. I second your call for research in that area, though. BC dates generally are used a lot on the web, and so should be catered for. Who knows what people will use microformats for, next?

  20. Brian Wilson

    Bruce wrote:
    > I’d like to look for instances of microformats to see how much
    > they’re really used

    I track use of the Class, Rel and Rev attribute values, which I think takes care of many microformat cases.
    http://dev.opera.com/articles/view/mama-common-attributes/#class
    http://dev.opera.com/articles/view/mama-hyperlinks/#a
    (hmm. Not sure if I put up any data on the Rev attribute, although MAMA does track its values.)

    > In particular, uses of the abbr element for dates/ times, and of
    > those, how many dates and times were imprecise (eg, no DD to
    > go with the MMYYYY), and how many are negative dates – that
    > is (BC) before christ?

    I’m not really familiar with this usage. Can you offer a pointer for more info on this? Thanks Bruce!

  21. Andy Mabbett

    Postscript: There are already timeline applications which display hCalendar microformat events visually. There’s no reason why these shouldn’t be used to plot, say, geological eras or the reigns of Roman Emperors.

  22. ppk

    Brian,

    Thanks for the pointers; I missed these reports. Script position would be interesting, but if it’s difficult to implement, never mind.

    Would it be possible to measure *which* events are most often used inline?

  23. Matijs

    Number of h1′s on a page
    Whether subsequent h’s are properly nested
    The ammount of ul/ol in non-content areas
    Usage of Asp.net forms (encasing a whole page in an asp form)

  24. bruce

    Gez, Brian – I wonder if there’s a list of default camera filenames? I know of dsc***, img***, pic**** etc. A category in the report lumping those together as “camera default alt” would be very useful.

    Brian – microformats have a pattern using abbr – see http://microformats.org/wiki/abbr-design-pattern. It’s generally identifable by an abbreviation element with an attribute class=”dtstart” or class=”dtend”. It’s used for encoding dates. I’d love to know how many of those refer to BCE dates (I believe that’s the marker that microformates use for negative dates) and how many are “fuzzy dates”, that is “July 2006″ rather than a specific date in July 06

    That’s what I personally am interested in, though would love for people more familiar with the microformats concept to provide info on all uses of abbr in microformats.

    Brian, I’d also love to know how often the small element appears in the wild, and how many times it’s in footer (if you can deduce that) and how many times it contains block level elements (eg, paragraphs/ lists etc inside the small tags).

  25. Andy Mabbett

    The Nokia N95 camera uses file names like this one: 200220091909.jpg where the first 8 digits are DDMMYYYY and the remaining 4 sequential, regardless of date. Why not just search for alt="[*].jpg", alt="[*].png" etc?

    Not all microformats use abbr for dates, and there are other classes to identify dates, such as bday (in hCard), published (in hAtom & hAudio), updated (in hAtom). The only way microformats identify BCE dates is that the year value is negative (per ISO8101). Fuzzy dates, in microformats, would have values like YYYY or YYYY-MM, instead of YYYY-MM-DD. Use of BCE dates in microformats is not yet well understood, let alone supported.

    But surely it’s more interesting to find BCE & fuzzy dates which aren’t, but could potentially, use microformats? After all, there are not many instances, yet of HTML5 in the wild, but that doesn’t mean we won’t try to make it work well when there are.

    For all uses of ABBR in microformats, the search patten would be to find ABBR elements nested inside an element carrying one of the microformat root class names (vcard, vevent, hatom, biota, haudio, hrecipe, hresume, hreview, xfolk, hproduct & adr). There would be some false positives, but they could be weeded out by checking that the ABBR caries a class, if not a specific microformat class.

    Examples of ABBR in a microformats might simply be [span class="vcard"][abbr class="fn" title="David Bowie"]Bowie[/abbr][/span] or [span class="adr"][abbr class="locality" title="Birmingham"]B'ham[/abbr][/span]

  26. Brian Wilson

    *sigh* I think I’m overthinking this treatment of Alt. I’ve spent a lot of time this weekend trying to explore the parameters of this one and I don’t think I’m hitting the right points on it.

    Currently, I’ve leaned toward separating “common” Alt usage into 5 categories:
    - spaces only
    - “icons” for typical bullets or other uses (like “*”, “-” and “?”)
    - “generic” labels, like: “image|spacer|logo|home|google|search|contact|rss|counter”
    I built this list based on Philip’s great table at: http://philip.html5.org/data/common-alt-values.txt although you can see that these can easily be localized, making my naive search moot.
    - camera files
    - file extensions

    The generic types were based on what we already discussed, plus some key “obvious” ones from Philip’s list. The Camera files seem to have a huge list of permutations, so my candidate list has sort of ballooned out of control. I’m also not sure if the file extension list is feasible either. I’ve expanded some of these so that there are some extra common uses that would also be covered, like using encapsulating “[", "]” or parens. I keep finding holes in my strategy that miss really obvious ones, like Andy’s “200220091909.jpg” above.

    I’ll keep plugging away at this until I get it right…or at least “right enough” to be useful. :)

  27. Brian Wilson

    ppk wrote:
    > Thanks for the pointers; I missed these reports. Script position
    > would be interesting, but if it’s difficult to implement, never mind.

    I don’t think this would be a major imposition, other than I’ll have to roll it into some post-processing of the condensed tag list for each document.

    > Would it be possible to measure *which* events are most often
    > used inline?

    I think this covers what you are asking for:
    http://dev.opera.com/articles/view/mama-event-handler-attributes/
    If there are other interesting things about events I can have MAMA mine, please let me know.

  28. bruce

    Brian, not so interested in abbr contents, no – as it’s a human-only job to verify that the contents and the title attributes match up. And I’m lazy.

    Andy Mabbett asked “Why not just search for alt=”[*].jpg”, alt=”[*].png” etc?”. Because alt=”my-yellow-house.jpg” isn’t perfect alt text, but it’s human-friendly compared with “DSC123443.jpg”. I’m interested in the prevalence of totally useless alt text, probably inserted by a CMS or user-generated content website.

  29. Isofarro

    Look for pages that contain links to destination anchors ( [a name="..."][/a] ) or @id elements on the page they are currently on (e.g. skip links).

    So look for occurances of links that in a normalised form look like this: [a href="#NAME"]…[/a] — Normalised meaning looking for these anchors that point to somewhere in the current page, not another page.

    Collate the following:
    1.) The link text used (e.g. Skip to content, Search this site)
    2.) the name of the fragment (href=”#name”)
    3.) Whether the destination is an element with id, or an anchor with a name attribute, or whether the destination point doesn’t exist on the page
    4.) Where the id method is used, categorise which HTML element is being used (e.g. DIV, H1, A, FORM etc.)
    5.) Where the method is an anchor with a name attribute, catalogue the destination ‘link’ text (well, link text didn’t sound right for a destination anchor)

    It would be interesting to compare this with the usage of the ARIA role element. Both are used for signposting/destinations.

  30. iheni Post author

    Phew, thanks all, that’s quite the list. I’ve now tidied it up and sent it to Brian so stay tuned for updates. If you have other stuff to add though please don’t hold back…this is an ongoing project.

    I’va also added in a request to search for media queries, handheld stylesheets, ActiveX and browser sniffing as well as expanded on a couple of the things above.
    Good effort all round, thank you!

  31. Mykalining

    Удивительно, как свойственно человеку оберегать свой образ от поклонения, которое сделало бы его смешным или черезчур далёким от оригинала, а потому неправдоподобным.

Comments are closed.