Issues Indexing Youtube Videos - What parser do I need with the Youtube filter in Standard Analyzer?

  • L

    L - 2014-04-01

    Hi everyone,

    I have Youtube videos embeded on a few pages and I am looking for a way to index the videos. I did a little digging and found OSS's Youtube filter.

    I updated the StandardAnalyzer schema to include the Youtube filter for both title and descriptions and then recrawled the page with the video embeded - which didn't work. I guess I expected this result since the documentation said to crawl on Youtube.

    So I tried a manual crawl of the video itself using the direct Youtube url (e.g. I disabled robots.txt in case that was causing any issues and added the exact URL into the inclusion rules. I got the following results:

    Fetch status: Fetched
    Parser status: Parsed non-canonical
    Index status: Not indexed
    Response Code: 200
    Content length: -1

    I am using OSS v1.5.2.

    Which parser would I need to modify/add in order for the videos to be indexed from Youtube?

    Is there a way for oss to get the youtube description and titles from the embed so that information is within that content page?

    Thanks in advance!

    Last edit: L 2014-04-02
  • L

    L - 2014-04-02

    I changed the HTML parser parameter "Ignore non canonical" to False and it indexed the youtube video. But the content shows "Upload Sign in, Search, Loading... This video is unavailable" which isn't really ideal... Is there something I am missing?

  • Alexandre Toyer

    Alexandre Toyer - 2014-04-03


    Here are some steps to implement a way to work with Youtube extraction:

    • Create a new field to index, for example, video's title
    • Create a new analyzer as shown in sf_youtube_analyzer.png
      • Take care to choose KeywordTokenizer for indexation
      • You can see in the test section at the bottom how it works
    • In the HTML parser add a mapping between htmlSource and your new field, and use the new created analyzer on this mapping (sf_youtube_parser.png)

    You can then index some pages containing links to youtube's videos. Videos's titles will be extracted and indexed in new field.

    As you can see in sf_youtube_search.png there are two titles in one document, because the page I indexed contained two links to Youtube.
    If one search directly for a video title, document is found as well, since it has been properly tokenized by the StandardTokenizer set on the field (sf_youtube_search_chip_and_dale.png)



Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks