ruamel.uploadeuropythonstreams Code

Brought to you by: anthon

Tree [4398e0] default tip /

History

Read Only access

File	Date	Author	Commit
img	2015-08-12	Anthon van der Neut	[4398e0] readme update
test	2015-08-02	Anthon van der Neut	[87739d] - sanitize video file
.hgignore	2015-08-02	Anthon van der Neut	[87739d] - sanitize video file
LICENSE	2015-08-09	Anthon van der Neut	[d28dc0] before archive upload from ruamel
Makefile	2015-08-11	Anthon van der Neut	[9787d0] - stop sets STOP_ files
README.rst	2015-08-12	Anthon van der Neut	[4398e0] readme update
__init__.py	2015-08-11	Anthon van der Neut	[9787d0] - stop sets STOP_ files
archiveorg.py	2015-08-09	Anthon van der Neut	[d28dc0] before archive upload from ruamel
editor.py	2015-08-09	Anthon van der Neut	[d28dc0] before archive upload from ruamel
metadata.py	2015-08-11	Anthon van der Neut	[9787d0] - stop sets STOP_ files
setup.py	2015-08-11	Anthon van der Neut	[9787d0] - stop sets STOP_ files
tox.ini	2015-08-02	Anthon van der Neut	[87739d] - sanitize video file
uploadeuropythonstreams.py	2015-08-11	Anthon van der Neut	[9787d0] - stop sets STOP_ files
video.py	2015-08-11	Anthon van der Neut	[9787d0] - stop sets STOP_ files
youtube.py	2015-08-09	Anthon van der Neut	[d28dc0] before archive upload from ruamel

Read Me

UploadEuroPythonStreams

This package, provides routines and a utility to prepare and automate upload of videostreams from EuroPython conferences to the accounts, owned by the EuroPython Society (EPS) to archive.org and YouTube (the video sites).

Background

During a EuroPython conference recordings are made from talks, keynotes and other events. This results in video streams that might need additional handling (editing, leader insertion, being made conforming to the Code-of-Conduct, cutting, conversion). Eventually this should result in multiple video files, one per event to be uploaded to YouTube and archive.org

The event scheduling software has metadata that should be associated with the above mentioned video files. The metadata determines title and other artefacts that can be associated with an uploaded stream on the video site.

Input for 2014

In 2014 the video were originally uploaded to a non-EuroPython owned account on YouTube only, with metadata that was incompatible with how previous EuroPython steams (2013 and earlier) were uploaded. Re-uploading these streams to the video sites would have resulted in quality loss, but an FTP site with the higher quality original streams was found, combined with metadata scraped from the website and uploaded to the video sites. This software was not published and for a large part specific to the special circumstances for 2014.

Input for 2015

The video streams for the 2015 process were delivered as the 430Gb content of two NTFS formatted hard-drives. The videos were (mostly) split and stored in a directory hierarchy depicting room, date, am/pm, and talk slot (1-4). Luis J. Salvatierra provided the USB 3 disc from the company hired to record the conference talks. The discs also contained some non-relevant material (such as Western Digital provided backup software). Unfortunately the discs were not exact copies, on one them two directory names were changed (in a failed attempt to correct the spelling).

The metadata was delivered by Alexandre M. Savio in a single JSON file. This was done after the conference was over to take into account rescheduled, cancelled or otherwise changed info. The information in the JSON file consists of a toplevel event type dictionary with keys: Keynotes, Talks, etc. and values being a mapping from event number to event specific metadata the start of the file looked like:

{
  "Keynotes": {
    "364": {
      "track_title": "Google Room",
      "speakers": "Carrie Anne Philbin",
      "abstracts": [
        "The problem of introducing children to programming ....."
        "",
        ""
      ],
      "tags": [
        "python"
      ],
      "duration": 60,
      "title": "Keynote: Designed for Education: A Python Solution",
      "timerange": "2015-07-23 09:30:00, 2015-07-23 10:30:00",
      "have_tickets": [
        true
      ],
      "id": 364,
      "emails": "*******@raspberrypi.org"
    },
    "365": {
    .
    .

Processing overview

The processing has as target the combination of the relevant part of a video stream input (or inputs in case a talk was split over multiple files) with appropriate metadata and then upload both to the video sites.

For that there are three main inputs that need to be combined:

video files
metadata about the talks
in and out points for each talk related to one or more video files

Not all of this data is avaialble in its final form from the start, and the in-out point info cannot be done until the other two are available and have been combined.

Given the number of videos, this process should be automated as much as possible and manual intervention should be restricted to preparatory steps, so that lengthy processes, like the actual uploads, can be done automatically. The processing should be able to deal with that and add newly available info as it becomes available. In particular manualy added/changedq data should never be lost/overwritten

Why processing partial info?

While the conference is going on there is only partial info available, future talks have no streams and their metadata might still change (reschedule, cancel)

If the live streams stay online for a long time, people will start to link to these files on YouTube. Bringing individual talk videos with metadata online and unlinking/removing the live streams enhances the chances that people link to the final version. This is especially important for any comments made on the live stream, which cannot be transferred to the individual talk uploads.

Why two video upload sites?

YouTube is currently arguably the most popular site for watching videos. Videos uploaded there will be scaled and processed which makes it impossible to download the original quality. YouTube is geared towards watching the videos online, not to downloading.

Archive.org has a long term preservation mission, as well as handling the videos in their original quality (as long as the file is 10Gb or less). It also supports downloading the videos, in addition to watching them on the site.

Commandline utilities

ueps

The whole process of aligning metadata with videos, selecting, and uploading, is supported by the various subcommands of the ueps utility, short for uploadeuropythonstreams. The actual steps to bring video and metadata together as an uploaded whole consists of:

setting up the ueps configuration file for the year
convert JSON metadata to YAML and flatten
copy original video data
create cleaned up video tree structure with links to originals
clean up video data
relate event id to videos
upload

archive-upload

This is a separate utility to upload single files to archive.org. Originally this was part of ueps, but the packages used by the internetarchive package that is used for uploading have dependencies on specific versions of other packages (such as six) that clash with, e.g. the ones used by youtube-upload. Therefore this was split of into its own utility to be installed into a separate virtualenv and run from there.

Third party

Apart from the home-brew utilities ueps and archive-upload, the ueps utility relies on youtube-upload, mediainfo and avconv (a special version including AAC support) to be available.

I also used youtube-dl to get live streams from YouTube for those events for which no data was available on disc.

Preparation

The following assumes installation of virtualenvs under some path (~/bin) with explicit paths to the virtualenvs set up for the utilities.

Installing ueps

The ueps utility should be setup in a python2.7 virtual environment on Linux using:

virtualenv /home/venv/ueps /home/venv/ueps/bin/pip install uploadeuropythonstreams

The source is maintained on bitbucket

The original name of the utility was upeuros (for UPloadEUROpythonStream), but I was not sure this would collide with the CoC (try to pronounce it...).

I have a file ~/bin/ueps, with execute permission bits set, that looks like:

#!/home/venv/ueps/bin/python

from ruamel.uploadeuropythonstreams import main

main()

Installation of archive-upload

The archive-upload utility should be setup in a python2.7 virtual environment on Linux using:

virtualenv /home/venv/epau /home/venv/epau/bin/pip install europythonarchiveupload

The source is maintained on bitbucket

The original name of the utility was upeuros (for UPloadEUROpythonStream), but I was not sure this would collide with the CoC (try to pronounce it...).

I have a file ~/bin/ueps, with execute permission bits set, that looks like:

#!/home/venv/epau/bin/python

from ruamel.europythonarchiveupload import main

main()

Installation of other utilities

mediainfo

Compiling and installing the latest version might not be necessary, but it is relatively straightforward. I used the information from the download page to download the commandline (CLI) version 0.7.76

Extract the file, change to the newly created directory and run:

./CLI_Compile.sh

afterwards install mediainfo using the instructions prompted by the script.

avconv with AAC support

# yasm # http://yasm.tortall.net/Download.html mkdir yasm cd yasm wget http://www.tortall.net/projects/yasm/releases/yasm-1.2.0.tar.gz tar xvf yasm*.gz cd yasm* ./configure && make && sudo make install cd ../..

# from: http://stackoverflow.com/a/11236763/1307905

# x264 git clone git://git.videolan.org/x264.git x264 cd x264 mkdir avconv-source ./configure --enable-static make sudo make install cd ..

http://wiki.hydrogenaud.io/index.php?title=Fraunhofer_FDK_AAC#Libav.2Favconv

git clone git://github.com/mstorsjo/fdk-aac.git fdk-aac cd fdk-aac ./autogen.sh ./configure --prefix=/usr --disable-static make sudo make install cd ..

git clone git://git.libav.org/libav.git avconv cd avconv ./configure --enable-libx264 --enable-libfdk-aac --enable-nonfree --enable-gpl make make install

avconv -codecs | grep aac avconv -codecs | grep 26

The configuration file

Both utilities use the configuration directory path ~/.config/uploadeuropythonstreams, and a YAML config file name ueps.yaml is automatically created when you run:

ueps config --edit

Edit this file as specified below and adjust values according to the EOL comments.

When specifying directories/files in this YAML file, and if the name starts with /, the path is assumed to be absolute, otherwise it is normally relative to the base or other parent directory.

Example for 2015, edit with ueps edit --config:

global:
  verbose: 0
  year: 2015
  base_dir: /data0/DATA/EuroPython
  # any dir/file if not starting with "/" is relative to base_dir
  original: original              # original video dir
  map-file: map_video.yaml
  video_dir: video             # here we'll have A1, A2, A3, B1, B2
  metadata-dir: metadata       # directory for the individual files
  flat-yaml: flatten.yaml      # used to generate individual metadata files
metadata:
  json: talk_abstracts.json    # file from Alexandre
video:
  splitpoints: video_split_points.yaml
2015:
  location: Bilbao, Euskadi, Spain  # this is used during uploading
  coordinates:                      # for YouTube
  - 43.26679
  - 2.94361
  video_ignore:                     # files/dirs under original to ignore
  - autorun/.*
  - autorun.inf$
  - ReadMe.pdf$
  - \$RECYCLE.BIN/.*
  - System Volume Information/.*
  - WD Smartware Pro Free Trial/.*

  # mappings from regex to new path/filename, processed until one matches
  video_map_path:
     # [^\.]  at the beginning of the filename to filter out hidden files
     # that are probably rsync residues
    .*ANYWEAR/.* .* (?P<day>\d*) .*/(?P<ampm>[A|P]M)/[^\.].* (?P<num>\d)( \(Output 1\))?\.(?P<ext>\w{3,4}):
       A3/2015-07-{day}/{ampm}_{num}.{ext}
    .*GOOGLE/(?P<day>\d*) .*(?P<ampm>[A|P]M)/[^\.].* (?P<num>\d)( ?\(PGM\))\.(?P<ext>\w{3,4}):
       A1/2015-07-{day}/{ampm}_{num}.{ext}
    .* A2/\w* \d (?P<day>\d*) .*/(.* )?(?P<ampm>[A|P]M)/[^\.].*\.(?P<num>\d)\.(?P<ext>\w{3,4}):
       A2/2015-07-{day}/{ampm}_{num}.{ext}
    .* A2/\w* \d (?P<day>\d*) .*/(?P<ampm>[A|P]M)/[^\.].*\.(?P<ext>\w{3,4}):
       A2/2015-07-{day}/{ampm}.{ext}
    .* A2/\w* \d (?P<day>\d*) .*/.* (?P<ampm>[A|P]M)/[^\.].*\.(?P<ext>\w{3,4}):
       A2/2015-07-{day}/{ampm}.{ext}
    .*BARRIA 1/\w* \d (?P<day>\d*) .*/(?P<ampm>[A|P]M)/[^\.].* (?P<num>\d)( ?\(Output 1\))?\.(?P<ext>\w{3,4}):
       B1/2015-07-{day}/{ampm}_{num}.{ext}
    .*BARRIA 2/\w* \d (?P<day>\d*) .*/(?P<ampm>[A|P]M)/[^\.].* (?P<num>\d)( \(Output 1\))?\.(?P<ext>\w{3,4}):
       B2/2015-07-{day}/{ampm}_{num}.{ext}

If you name things consistently you should be able to reuse parts of the file from year to year. But in general this is a once run program that is adapted for next year without necessary backwards compatibility (unless we need to upload to another video site at some point, but then checking out old version might be good enough).

The use of video_map_path is described as part of the steps to massage the data.

The above is the configuration on the server at my home. On the server used to upload to archive.org, filled using rsync by Luis, the original directory is absolute (/home/luis/video) and not under .../EuroPython/2015

Passwords

In order to be able to upload to the account user-names and passwords for the account need to be available. YouTube upload also needs secrets which were partly pre-generated and partly generated on first run (with --auth-browser). The relevant files are stored next to ueps.yaml:

archive_org.yaml
youtube_com_secrets.json (downloaded from YouTube)
youtube-upload-credentials.json (generated with --auth-browser and moved)

Directory structure

The "normal" layout is a base directory with a subdirectory for each year and that "year" directory holding various year specific subdirs and data files:

/data0/DATA/EuroPython/
+-- 2014
|   +-- metadata
|   +-- videos
|   |   +-- ...
|   |   ...
|   ...
`-- 2015
    +-- flatten.yaml
    +-- map_video.yaml
    +-- metadata
    |   +-- 2015_07_20_AM_A1_0_999_Welcome.yaml
    |   ...
    |   `-- 2015_07_24_PM_B2_5_053_Speeding_up_search_with_locality_sensitive_hashing.yaml
    +-- original
    |   +-- autorun
    |   |   `-- wdlogo.ico
    |   +-- autorun.inf
    |   +-- ReadMe.pdf
    |   +-- $RECYCLE.BIN
    |   |   ...
    |   +-- ROOM A2
    |   |   +-- DIA 1 20 Julio
    |   |   |   ...
    |   |   +-- DIA 2 21 Julio
    |   |   |   ...
    |   |   +-- DIA 3 22 Julio
    |   |   |   +-- AM
    |   |   |   |   `-- Live Streaming Room A2 2015-07-22 AM.f4v
    |   |   |   `-- PM
    |   |   |       `-- Live streaming from room A2 2015-07-22 PM.f4v
    |   |   +-- DIA 4 23 Julio
    |   |   |   ...
    |   |   `-- DIA 5 24 Julio
    |   |       ...
    |   +-- ROOM BARRIA 1
    |   |   +-- DIA 1 20 Julio
    |   |   |   +-- AM
    |   |   |   |   +-- BARRIA 1 - ponencia 1(Output 1).mp4
    |   |   |   |   +-- BARRIA 1 - ponencia 2 (Output 1).mp4
    |   |   |   |   `-- BARRIA 1 - ponencia 3(Output 1).mp4
    |   |   |   `-- PM
    |   |   |       +-- BARRIA 1 - ponencia 4(Output 1).mp4
    |   |   |       +-- BARRIA 1 - ponencia 5 (Output 1).mp4
    |   |   |       +-- BARRIA 1 - ponencia 6 (Output 1).mp4
    |   |   |       `-- BARRIA 1 - ponencia 7 (Output 1).mp4
    |   |   +-- DIA 2 21 Julio
    |   |   ...
    |   +-- ROOM BARRIA 2
    |   |   +-- DIA 1 20 Julio
    |   |   |   ...
    |   |   +-- DIA 2 21 Julio
    |   |   |   ...
    |   |   +-- DIA 3 22 Julio
    |   |   |   +-- AM
    |   |   |   |   +-- Live Streaming from Barria 2 2015-07-22 ponencia  1 (Output 1).mp4
    |   |   |   |   +-- Live Streaming from Barria 2 2015-07-22 ponencia  2 (Output 1).mp4
    |   |   |   |   `-- Live Streaming from Barria 2 2015-07-22 ponencia 3 (Output 1).mp4
    |   |   |   `-- PM
    |   |   |       +-- Live Streaming from Barria 2 2015-07-22 ponencia  4 (Output 1).mp4
    |   |   |       +-- Live Streaming from Barria 2 2015-07-22 ponencia  5 (Output 1).mp4
    |   |   |       +-- Live Streaming from Barria 2 2015-07-22 ponencia  6 (Output 1).mp4
    |   |   |       `-- Live Streaming from Barria 2 2015-07-22 ponencia  7 (Output 1).mp4
    |   |   ...
    |   +-- ROOM GOOGLE
    |   |   +-- 20 Julio AM
    |   |   |   ...
    |   |   +-- 20 Julio PM
    |   |   |   +-- Directos Euskalduna101 0(PGM).mov
    |   |   |   +-- Directos Euskalduna101 1 (PGM).mov
    |   |   |   ...
    |   |   +-- 21 Julio PM
    |   |   |   ...
    |   |   +-- 21 Jullio AM
    |   |   |   +-- Europython  21 AM 1 (PGM).mp4
    |   |   |   +-- Europython  21 AM 2 (PGM).mp4
    |   |   |   +-- Europython  21 AM 3 (PGM).mp4
    |   |   |   `-- Europython  21 AM 4 (PGM).mp4
    |   |   +-- 22 Julio AM
    |   |   |   +-- Europython  22 AM 1 (PGM).mp4
    |   |   |   +-- Europython  22 AM 2 (PGM).mp4
    |   |   |   +-- Europython  22 AM 3 (PGM).mp4
    |   |   |   `-- Europython  22 AM 4 (PGM).mp4
    |   |   ...
    |   +-- ROOM PHYTON ANYWEAR
    |   |   +-- DIA 1 20 Julio
    |   |   |   ...
    |   |   +-- DIA 2 21 Julio
    |   |   |   +-- AM
    |   |   |   |   +-- Sala Python 1 (Output 1).mp4
    |   |   |   |   +-- Sala Python 2 (Output 1).mp4
    |   |   |   |   `-- Sala Python 3 (Output 1).mp4
    |   |   |   `-- PM
    |   |   |       +-- Sala Python 4 (Output 1).mp4
    |   |   |       +-- Sala Python 5 (Output 1).mp4
    |   |   |       `-- Sala Python 6 (Output 1).mp4
    |   |   ...
    |   +-- System Volume Information
    |   |   +-- IndexerVolumeGuid
    |   |   +-- MountPointManagerRemoteDatabase
    |   |   `-- _restore{10BF4F30-BD90-46CF-AFA6-76DD512DBC6C}
    |   |       `-- RP532
    |   |           +-- change.log
    |   |           `-- S0083239.Acl
    |   `-- WD Smartware Pro Free Trial
    |       +-- WDSmartWareProFreeTrial.exe
    |       `-- WDSmartWareProFreeTrial.tmx
    +-- talk_abstracts.json
    +-- video
    |   +-- A1
    |   |   +-- 2015-07-20
    |   |   |   ...
    |   |   +-- 2015-07-21
    |   |   |   +-- AM_1.mp4 -> ../../../original/ROOM GOOGLE/21 Jullio AM/Europython  21 AM 1 (PGM).mp4
    |   |   |   +-- AM_2.mp4 -> ../../../original/ROOM GOOGLE/21 Jullio AM/Europython  21 AM 2 (PGM).mp4
    |   |   |   +-- AM_3.mp4 -> ../../../original/ROOM GOOGLE/21 Jullio AM/Europython  21 AM 3 (PGM).mp4
    |   |   |   +-- AM_4.mp4 -> ../../../original/ROOM GOOGLE/21 Jullio AM/Europython  21 AM 4 (PGM).mp4
    |   |   |   +-- PM_5.mp4 -> ../../../original/ROOM GOOGLE/21 Julio PM/Europython  21 PM 5 (PGM).mp4
    |   |   |   +-- PM_6.mp4 -> ../../../original/ROOM GOOGLE/21 Julio PM/Europython  21 PM 6 (PGM).mp4
    |   |   |   +-- PM_7.mp4 -> ../../../original/ROOM GOOGLE/21 Julio PM/Europython  21 PM 7 (PGM).mp4
    |   |   |   `-- PM_8.mp4 -> ../../../original/ROOM GOOGLE/21 Julio PM/Europython  21 PM 8 (PGM).mp4
    |   |   +-- 2015-07-22
    |   |   |   +-- AM_1.mp4 -> ../../../original/ROOM GOOGLE/22 Julio AM/Europython  22 AM 1 (PGM).mp4
    |   |   |   +-- AM_2.mp4 -> ../../../original/ROOM GOOGLE/22 Julio AM/Europython  22 AM 2 (PGM).mp4
    |   |   |   +-- AM_3.mp4 -> ../../../original/ROOM GOOGLE/22 Julio AM/Europython  22 AM 3 (PGM).mp4
    |   |   |   ...
    |   |   ...
    |   +-- A2
    |   |   +-- 2015-07-20
    |   |   |   ...
    |   |   +-- 2015-07-21
    |   |   |   +-- AM_0.f4v -> ../../../original/ROOM A2/DIA 2 21 Julio/Livestreaming Room A2 2015-07-21 AM/Livestreaming From Room A2 2015-07-21 AM.0.f4v
    |   |   |   +-- AM.f4v -> ../../../original/ROOM A2/DIA 2 21 Julio/Livestreaming Room A2 2015-07-21 AM/Livestreaming From Room A2 2015-07-21 AM.f4v
    |   |   |   +-- PM.f4v -> ../../../original/ROOM A2/DIA 2 21 Julio/Livestreaming Room A2 2015-07-21 PM/sample.f4v
    |   |   |   `-- PM.mpg -> ../../../original/ROOM A2/DIA 2 21 Julio/Livestreaming Room A2 2015-07-21 PM/MP2_Jul21_183859_0.mpg
    |   |   ...
    |   +-- A3
    |   |   ...
    |   +-- B1
    |   |   +-- 2015-07-20
    |   |   |   +-- AM_1.mp4 -> ../../../original/ROOM BARRIA 1/DIA 1 20 Julio/AM/BARRIA 1 - ponencia 1(Output 1).mp4
    |   |   |   +-- AM_2.mp4 -> ../../../original/ROOM BARRIA 1/DIA 1 20 Julio/AM/BARRIA 1 - ponencia 2 (Output 1).mp4
    |   |   |   +-- AM_3.mp4 -> ../../../original/ROOM BARRIA 1/DIA 1 20 Julio/AM/BARRIA 1 - ponencia 3(Output 1).mp4
    |   |   |   +-- PM_4.mp4 -> ../../../original/ROOM BARRIA 1/DIA 1 20 Julio/PM/BARRIA 1 - ponencia 4(Output 1).mp4
    |   |   |   +-- PM_5.mp4 -> ../../../original/ROOM BARRIA 1/DIA 1 20 Julio/PM/BARRIA 1 - ponencia 5 (Output 1).mp4
    |   |   |   +-- PM_6.mp4 -> ../../../original/ROOM BARRIA 1/DIA 1 20 Julio/PM/BARRIA 1 - ponencia 6 (Output 1).mp4
    |   |   |   `-- PM_7.mp4 -> ../../../original/ROOM BARRIA 1/DIA 1 20 Julio/PM/BARRIA 1 - ponencia 7 (Output 1).mp4
    |   |   ...
    |   +-- B2
    |   |   +-- 2015-07-20
    |   |   |   +-- AM_1.mp4 -> ../../../original/ROOM BARRIA 2/DIA 1 20 Julio/AM/Barria2 1 (Output 1).mp4
    |   |   |   +-- AM_2.mp4 -> ../../../original/ROOM BARRIA 2/DIA 1 20 Julio/AM/Barria2 2 (Output 1).mp4
    |   |   |   +-- AM_3.mp4 -> ../../../original/ROOM BARRIA 2/DIA 1 20 Julio/AM/Barria2 3 (Output 1).mp4
    |   |   |   +-- PM_4.mp4 -> ../../../original/ROOM BARRIA 2/DIA 1 20 Julio/PM/Barria2 4 (Output 1).mp4
    |   |   |   +-- PM_5.mp4 -> ../../../original/ROOM BARRIA 2/DIA 1 20 Julio/PM/Barria2 5 (Output 1).mp4
    |   |   |   +-- PM_6.mp4 -> ../../../original/ROOM BARRIA 2/DIA 1 20 Julio/PM/Barria2 6 (Output 1).mp4
    |   |   |   `-- PM_7.mp4 -> ../../../original/ROOM BARRIA 2/DIA 1 20 Julio/PM/Barria2 7 (Output 1).mp4
    |   |   ...
    `-- video_split_points.yaml

Hopefully the relationship to the configuration file is clear. The material under 2014 is old and not used. The configuration entry year makes only the material under directory 2015 relevant.

The directory structure under original is irregular, but the one under video is regular and flattened and can be more easily used to determine in- and out-points for cutting the videos.

Copy/upload complete original video data disc

The video data disc contained some more info than just the video streams, but this extra data is insubstantial. To minimise the risk of deleting something from the original material use rsync to copy the data to your machine after mounting the drive read only.

If the drive gets mounted by plugging in, use:

mount -o remount,ro /path/to/mount/point
cd  /path/to/mount/point
rsync -av --progress . target

where target should correspond to your resulting video_dir specified in the configuration file (after combining with base_dir and year if applicable).

Make sure every user that needs to can the files:

find original/ -type d -exec chmod 755 {} +
find original/ -type f -exec chmod 644 {} +

Uploading to server

Uploading was done using rrsync on the server to restrict access. After that rsync was used to upload the data in one specific directory.

No normal ssh access was possible, because of restrictions in ~/.ssh/authorized_keys.

convert JSON metadata to YAML and flatten

This initial task can be done, after storing talk_abstracts.json using:

ueps metadata --flatten

assuming the configuration file values are set. It will generate the flatten.yaml file.

Inspect the file for correctness, but don't edit by hand just yet as this would be overwritten if some programmatic changes are made and the above command re-run.

With the 2015 data the abstracts values in the JSON file were a bit of a problem. It was a list with 3 entries of which the second often and the third was empty always. The second entry was, when available, a more detailed description, sometimes repeating the first entry. Uploading both (to archive.org which has enough space for metadata), would have lead to doubled text.

There was also the problem of the newline differences in the abstracts. Some files had newlines inserted about every 70 chars, and double newlines to indicate a new paragraph. Others had longer lines and used a single newline to create a new paragraph. The conversion process tries to do the smart thing with this.

The flattened file is a YAML file, with the large abstracts as literal scalars for its readability. A single top-level key-value mapping entry of this file look like:

361:
  track_title: Google Room
  speakers: Guido van Rossum
  tags:
  - python
  duration: 60
  title: 'Keynote: Python now and in the future'
  timerange: 2015-07-21 09:30:00, 2015-07-21 10:30:00
  have_tickets: [true]
  id: 361
  emails: guido@python.org
  type: Keynote
  abstract: |-
    This is *your* keynote! I will have some prepared remarks on the state
    of the Python community and Python's future directions, but first and
    foremost this will be an interactive Q&A session.

Revision control

I brough talk_abstracts.json and later flatten.yaml under revision control, just in case something caused a useful/final version to be overwritten.

Clean up video data and mapping

First specify which directories in the original video data to ignore, this is more flexible than deleting, as rsync-ing new data might get you those freshly deleted dirs/files back again. Use the 2015: video_ignore: sequence in the configuration file for this.

Then run:

ueps video --org

This will will show you any unmapped video data, make sure you either delete the original if it is broken off rsync residue (the dotted files), or adjust the video_map_path entries to map all files names. The AM/PM directory level is dropped to get half of the directories in the output, but that info is preserved in the file names.

When done (no unmatched files), run ueps video --org --save and check the map_video.yaml file before proceeding.

Creating extra entries

The welcome session, nor the Lightning Talks did have entries in the flattened YAML file. These were added as follows:

990:
  track_title: Google Room
  speakers: Fabio Pliger, Oier Beneitez
  tags:
  - EuroPython
  - conference
  duration: 45
  title: Welcome
  timerange: 2015-07-20 09:00:00, 2015-07-20 09:30:00
  id: 990
  type: Other session
  abstract: |-
    Welcome to EuroPython 2015
991: &lt
  track_title: Google Room
  speakers: Various speakers
  tags:
  - EuroPython
  - lightning talk
  duration: 45
  title: Lightning Talks
  timerange: 2015-07-20 17:15:00, 2015-07-20 18:00:00
  id: 991
  type: Other session
  abstract: |-
    Lightning talks, presented by Harry Percival
992:
  <<: *lt
  timerange: 2015-07-22 17:15:00, 2015-07-22 18:00:00
  id: 992
993:
  <<: *lt
  timerange: 2015-07-23 17:15:00, 2015-07-23 18:00:00
  id: 993
994:
  <<: *lt
  timerange: 2015-07-24 17:15:00, 2015-07-24 18:00:00
  id: 994
995:
  <<: *lt
  timerange: 2015-07-24 18:00:00, 2015-07-24 19:00:00
  id: 995

The << is usage of the merge facility.

Mapping video to original

Once the map_video.yaml file is OK, use it with:

ueps video --map

to create the directory hierarchy under the video directory with links to the original data.

You can remove the links, change the mapping file and rerun the command. I needed to do this as both drives did not have the exact same naming.

relate event id to videos

At this point the video names should be alphabetically ordered within the room/date structure. This is the point where you are going to split the flat-yaml file, so any global editing should be done now (e.g. check user names for correct casing).

Now run:

ueps metadata --relate

and for each of the directories under the videos that have non-associated videos it will try to find matching list of talk names/event ids from the flattened yaml file (by using track/date/time). If the number of tracks on a day corresponds to the number of videos this should be trivial.

Any remaining stuff needs checking. Either some talk was not recorded and the others need to be assigned by hand, or some live stream needs splitting first.

Once everything matches a metadata file is written in the metadata directory based on the flattened YAML data and the related video. This file has a unique id generated for archive.org. The unique id for YouTube is returned after successful uploading.

Determining the in- and out-points

Finding the cut points can be done with VLC. As the mouse is a bit course for finding starting/end points it is better to use Alt+Left/Right Arrow (10 second jump, use shift for 3 second jump, or use Ctrl instead of Alt for 1 minute jump).

Note that the videos for Room A2 included the keynotes (from A1), which was kind of confusing initially.

These in and out points are stored in single file.

TBD How to get this info, merge it and update the individual metadata files.

Cut and upload, or convert then upload

Cutting can be done for most files when uploading to archive.org, as there you can have files of up to 10Gb in size. Cutting (without conversion is fast enough (seconds) so that there is little waste in gaps in upload time.

Conversion, as necessary for YouTube was more of a problem this can take minutes (on my desktop machine) to hours (on my older co-located server).

So for youtube I first converted and in parallel started uploaded (which was slow from home as well), and for archive.org I cut and uploaded.

YouTube

TBD, this needs rethinking/reimplementing based on co-located server work.

archive.org

with the in- and out- points merged into the individual metadata files, you can start uploading by doing:

for i in 2015_07_20_*_A1_* ; do  europythonarchiveupload upload  "$i"; done

and do only the Monday videos from Google Room (A1). The metadata is update if a file has been uploaded, so trying to upload twice is caught and not a problem.

There is a delay in getting the videos to show up on archive.org, it can take hours for them to be processed. This can be very confusing as there is, as far as I know, no way to see what is "in the queue" on the website. (You cannot reuse the same handle)

Releasing data on YouTube

The videos on YouTube are uploaded as private, this gives you a chance to review before release and to correct/extend some metadata than cannot be set using the API (AFAIK).

First select all private videos:

https://www.youtube.com/my_videos?o=U&sq=is%3Aprivate&vmo=private

After the "is:private" in the search box add EuroPython2015 (adjust the year, it is one word) and search once more.

Select all videos, deselecting the top ones that have no timecode as they are still being processed (usually only one) if necessary

Now selects Actions -> More Actions, this will allow you to do multiple actions at the same time.

Select License and Privacy:

License: Creative Commons
Privacy: Public

(The image has location selected as well, this is now handle automatically as part of the upload)

Once all the files for a particular date are upload you can also associate the date with each file (select on "20 July 2015")

Problems 2015

A list of encountered problems: - 4 (Four!) different file types. - Room A2 sample.1.f4v double in AM and PM (same data), missing live stream for afternoon except for 19 minutes. Downloaded from YouTube with:

youtube-dl https://www.youtube.com/watch?v=PJS7aeZTOY8

Room A2, 07/20 file MP2_Jul20_135554_0.mpg had no sound
Event 272 had no track_title and timerange
EPS sessions in Barria 1 were not captured added video: False key/value pair. Also for S.Wirtel's talk
combined a minimal start of the first Lightning talk with the rest using:
```
mencoder -ovc copy -oac mp3lame 5.mov 6.mov -o /data1/7.mp4
```
converting of f4v to mp4 using ( -t specifies duration length):
```
avconv -ss 00:12:00  -i sample.1.f4v -t 15:00 -map 0 -c:v libx264 -c:a copy  1.mp4
```
-t has to come after -i !!! Worst than that avconv silently fails to parse -ss 00:10 as ten seconds. It has to be -ss 00:00:10
MP4 files were about 20x larger than corresponding files from 2014.

Pecurliarities of archive.org

Deleting

You cannot delete an item on archive.org without asking the operators. You can however delete all objects and refill the create unique ID.

If you use the web interface read about how to delete multiple files (make a directory, drag-and-drop all items in, remove directory). Also notice the [Update item!] button at the bottom of the page.

Renaming

You can rename the files using JSON patches

Older info

split streams by hand

If streams need to be split/converted (in 2015 this was necessary for all the videos from Room A2), you can use avconv (2015 version used on Ubuntu 12.04 was 0.8.17, install libavcodec-extra-53 to enable libx264 output). The basic conversion format is:

avconv -ss HH:MM:SS -i input.f4v -t HH:MM:SS -c:v libx264 -c:a copy out.mp4

The parameter for -t is a duration.

The ueps utility supports splitting based on information in the splitpoints file (specified under video in the config file). This YAML file should look like:

cmd: avconv -ss {start} -i "{src}" -t {length} -c:v libx264 -c:a copy {out}
Room A2/2015-07-20/AM/sample.1.f4v:
- from: 9:35
  to:   42:11
  out:  2.mp4
- from: 48:17
  to:   1:16:25
  out:  3.mp4

any entry with a "/" is considered a source file. The "to" entries are end points and the appropriate length for avconv is calculated. The target file is created in the directory of the source file, and if it already exists the entry is skipped.

You can check the commands to be executed by using:

ueps video --split

if the commands look good execute them by using:

ueps video --split --execute

Conversion runs in about double real-time if going from mp4 to mp4.

check videos

This step only became necessary after trying the first upload, realising that it was going to take 15 hours for a 9 minute opening session video. While checking the files I also noticed that one contained two talks (i.e. was 75 minutes instead of 45).

ueps video --check

ToDo

Some stuff that should be done:

proper utility install
figure out if youtube-upload can set license for YouTube (default only applies if going through browser) (Answer: no have to do that once) To edit all https://www.youtube.com/my_videos?o=U and select the videos then actions playlist https://www.youtube.com/playlist?list=PL8uoeex94UhGGUH0mFb-StlZ1WYGWiJfP&action_edit=1
some mechanism to gracefully stop long operations (i.e. multiple scheduled uploads or conversions), with uploads, as the program to upload is called multiple times, you can briefly make the upload directly return in the source code.
check actual upload speeds to archive.org/youtube from a hetzer based server
see if the announcers can mark the start and end timings, so there is no announcement, and also no afterwards blabla about "The next talk will be in five minutes ...."
takes about a minute on average per video to get the exact start and end points and enter them. Longer if the lead time is big (Anywhere Room videos, live streams)
add testing microphone loudness before announcement?
check on final output size being at least as certain length (in case input was wrongly specified as a minute of video or so).
tell the announcers to clap their hands away from the microphone?
use VLC --start-time (seconds) and --stop-time to verify cut on videos
should probably include track/room name in description on YouTube
should keep track of how much time taken for uploading and store
put the whole thing under docker?
should we include track name/conference room in the description of the talk:

[EuroPython 2015] [22 July 2015] [Bilbao, Euskadi, Spain] [En Español]
update the https://www.youtube.com/user/PythonItalia EuroPython editions
using UPDATE in a talk description is of no use.
using full talk name after conversion before uploading to archive.org, as this name is what is used for downloading as well (currently RR_YYYYMMDD_XM_#.mp4 if cut)

deleting the target of a link using find:

find B2/2015-07-2[012] -type l -printf '%l\0' | xargs -0 rm -v

describe the use tmux on the server, scrolling Ctrl+B, PageUp

use avconv's -metadata or map_metadata to include metadata in the files that are uploaded https://libav.org/documentation/avconv.html#Metadata using an .ini like file http://jonhall.info/how_to/dump_and_load_metadata_with_ffmpeg

Should include:

;FFMETADATA1
title=Configuration file readability: beyond ConfigFile and JSON.
artist=Anthon van der Neut
album=EuroPython2015
date=2016-07-20
genre=lecture

copyright=2014 Creative Commons Attribution
synopsis=bla bla bla
bla bla bla bla
bla bla bla

dump:

avconv -i "file.mp4" -f ffmetadata metadata.txt

load:

avconv -ss 00:00:10 -i "file.mp4" -t 00:20:15 -i metadata.txt \
  -map_metadata 1 -c:a copy -c:v copy -o out.mp4

mark old commands as deprecated and remove them.
some lock mechanism to prevent two processes uploading at the same time.
check upload tag set (two times EuroPython2015 on archive.org)