TIL about ANCI-C Quoting:

$ join -t $'\t'

$ sort -t $'\t'

http://bash.cyberciti.biz/bash-reference-manual/ANSI_002dC-Quoting.html#ANSI_002dC-Quoting

Advanced Bash-Scripting Guide

Tags: bash

On Screen

  • `screen -list` — list opened sessions
  • `screen -S s1` — open a new session named s1
  • `screen -r s1` — attach to the specified session
  • `Ctrl+a A` — annotate (title) window within the current session
  • `Ctrl+a “` — list windows 
  • `Ctrl+a c` or `screen -t somename`— create a new window within the session
  • `Ctrl+a a` — switch to the previous window
  • `Ctrl+a d` — detach from the session
  • `Ctrl+a k` — kills the current window

http://www.ibm.com/developerworks/aix/library/au-gnu_screen/index.html

Tags: bash tips screen

Nice article on how to effectively use super() in Python.

Delete All Tables in a MySQL Database

mysqldump --defaults-extra-file=conn.cnf \
          --add-drop-table --no-data DB_NAME | \
grep ^DROP | mysql --defaults-extra-file=conn.cnf

(via http://www.prudnikov.com/2009/02/mysql.html)

Tags: mysql

Про аннотирование и комментирование фрагментов текста

Почитал я тут про #гипотекст, и в голове завертелось несколько смежных вопросов.

1. Вот, например, все мы с детства знаем, что бывает URI, который может быть URL, URN или и тем и другим. В примерах мы обычно видим развесистый URL и какой-нибудь URN в виде номера ISBN какой-то книги. Вопрос, может ли у URN быть часть про фрагмент (то, что после #). Что-то я никогда такого в примерах не видел, хотя как иначе ссылаться на что-то внутри документа, который мы обозначаем при помощи URN или какого-то PURL (который запрещает использование #, но я пока не понял почему: то ли потому что корректных PURLом считается то, что до #, то ли его там вообще быть не должно, то ли URN и его аналоги - это вообще часть URI без фрагмента).

2. Интересно, как всякие слова и технологии, которые связаны с XML, мыслятся исключительно в контексте XML. Мне всегда отчего-то казалось (и причиной этому во многом книжка Майкла Кея про XSLT), что в большинстве стандартов про XML говорится про интерфейсы. Например, XSLT - это не способ из одного XML получить другой другой XML, а способ из одного дерева с определенным интерфейсом получить другое дерево с определенным интерфейсом. Понятно, что интерфейсом этим является DOM и единственная готовая имплементация DOM окунает тебя с головой в мир XML. Тем не менее, насколько я себе представляю, XPath навигирует не по XML, а по небольшой части DOM, а тривиальные XPath-выражения - по очень небольшой части DOM. И никто при этом не заставляет держать в памяти весь документ (очевидно, что об этом нигде не сказано в описании DOM).

3. Все это закрутилось в моей голове потому, что я вдруг вспомнил про XPointer, который позволяет описать некий диапазон документа двумя xpath-выражениями и всегда мне казался прекрасной идеей, но я никак не мог понять, как бы он мог мне пригодиться. Тем временем, прошло столько лет, когда я последний раз про него думал, что с ним теперь? Можно ли им где-то пользоваться? Как я понимаю, не все браузеры научились применять его даже к XML, а о прикручивании к HTML речь вообще ни идет. Да и XPath, не так чтобы сильной популярностью пользовался, судя по тому, что его потеснили в браузерах более простые выражения.

4. Тем не менее, мне кажется, многое могла бы изменить JS-библиотечка, которая для выделенного куска текста вычисляла бы XPointer-выражение, причем относительно полезных точек привязки: например, от начала второго P после заголовка с таким-то id до начала следующего P. После этого можно было бы уже гораздо проще привязывать и комментарии, и аннотации к конкретному куску текста.

MySQL: on implicit ORDER BY

During a conversation on SO I promised to show an example where implicit ORDER BY may make sense.

Here is an example based on the data I have at hand.

Let’s suppose we have two tables: corpora and requests within the corpora. Here are the definitions:

CREATE TABLE  `corpora` (
  `corpus_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `corpus_name` varchar(50) NOT NULL,
  PRIMARY KEY (`corpus_id`),
  KEY `corpus_name` (`corpus_name`)
) ENGINE=InnoDB COLLATE utf8_general_ci;

CREATE TABLE `requests` (
  `request_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `corpus_id` smallint(5) unsigned NOT NULL,
  `req_text` varchar(1024) NOT NULL,
  `references` varchar(4000) NOT NULL DEFAULT '',
  `sorter` char(5) NOT NULL,
  `req_hash` char(32) COLLATE latin1_general_ci NOT NULL,
  PRIMARY KEY (`request_id`) USING BTREE,
  UNIQUE KEY `unique_item` (`corpus_id`,`req_hash`),
  KEY `sorter` (`corpus_id`,`sorter`) USING BTREE,
  CONSTRAINT `FK_corpus_id` FOREIGN KEY (`corpus_id`)
    REFERENCES `corpora` (`corpus_id`)
      ON DELETE NO ACTION ON UPDATE NO ACTION  
) ENGINE=InnoDB COLLATE utf8_general_ci;

Corpora contains 25 rows and requests contain 20641 rows. Now, let’s look at the following query:

SELECT SQL_NO_CACHE corpus_name, req_text, `references`
FROM corpora c
JOIN requests r
USING (corpus_id)
ORDER BY corpus_name, sorter
LIMIT 10;

Sorter is a prefix for the req_text column, we do not need the exact sorting but requests are better observable when sorted. Thus, the query displays the first 10 rows of the first (alphanumeric) corpus.

Here is the EXPLAIN:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: c
         type: index
possible_keys: PRIMARY
          key: corpus_name
      key_len: 152
          ref: NULL
         rows: 25
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: r
         type: ref
possible_keys: unique_item,sorter
          key: sorter
      key_len: 2
          ref: test.c.corpus_id
         rows: 521
        Extra:

The average speed of the query on my PC is 0,7 secs. Unfortunatelly, we can not do much with any indexes to make it perform better, since only an index from the first table can positively affect the ORDER BY. MySQL reads corpora names from the coverage index, finds the corresponding rows in requests, fetches the required columns, puts everything into a temporary table, sorts and takes the first 10 rows. The more rows in the requests table, the quicker the temporary table grows, the slower our query is.

What we can do is to reduce our table to the information taken from the coverage indexes only and join to the table with the haviest fields. Most notable here is that we will not need to order the results since that data will be read in the order of the subquery rows (since for each row in the subquery there is only one row in requests found by the primary key):

SELECT SQL_NO_CACHE corpus_name, req_text, `references`
FROM (
  SELECT corpus_name, request_id
  FROM corpora
  JOIN requests USING (corpus_id)
  ORDER BY corpus_name, sorter
  LIMIT 10) ids
JOIN requests USING (request_id);

The query return the same but takes about 0,27 secs. Here is the EXPLAIN:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: 
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10
        Extra:
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: requests
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: ids.request_id
         rows: 1
        Extra:
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: corpora
         type: index
possible_keys: PRIMARY
          key: corpus_name
      key_len: 152
          ref: NULL
         rows: 25
        Extra: Using index; Using temporary; Using filesort
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: requests
         type: ref
possible_keys: unique_item,sorter
          key: sorter
      key_len: 2
          ref: test.corpora.corpus_id
         rows: 521
        Extra: Using index

But we would not need to do any ordering if the data would be read in the required order from the beginning. To make it possible we will need to give some hints to MySQL:

SELECT SQL_NO_CACHE
  STRAIGHT_JOIN corpus_name, req_text, `references`
FROM corpora
JOIN requests USE INDEX (sorter)
USING (corpus_id)
LIMIT 10;

The query takes about 0,009 secs. Here is the EXPLAIN:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: corpora
         type: index
possible_keys: PRIMARY
          key: corpus_name
      key_len: 152
          ref: NULL
         rows: 25
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: requests
         type: ref
possible_keys: sorter
          key: sorter
      key_len: 2
          ref: test.corpora.corpus_id
         rows: 521
        Extra:

Nevertheless, explicit is better than implicit. Much better is to avoid such sorts at all or adjust the schema so that all columns in ORDER BY would be handled by a single index.

Tags: mysql

MySQL: Load data from XML

Inspired by the question on SO: LOAD XML LOCAL INFILE with Inconsistent Column Names

It turns out that there are many restrictions on the format of the input XML for LOAD XML to work: attributes and element names should be the same as the field names in the database; there are problems with reading the content of the empty elements, etc.

Here is the schema of an example table and the XML file we would like to import from:

CREATE TABLE person (
    person_id INT NOT NULL PRIMARY KEY,
    fname VARCHAR(40) NULL,
    lname VARCHAR(40) NULL
) ENGINE=InnoDB COLLATE utf8_general_ci;

<?xml version="1.0" encoding="utf-8"?>
<list>
      <person>
          <person_id>1</person_id>
          <fname>Mikael</fname>
          <lname>Ronström</lname>
      </person>
      <person>
          <person_id>2</person_id>
          <fname>Lars</fname>
          <lname>Thalmann</lname>
      </person>
</list>

I was always interested if an XML can be loaded by the LOAD DATA INFILE. Now I tried and it worked.

LOAD DATA LOCAL INFILE '/tmp/xml/loaded.xml'
INTO TABLE person
CHARACTER SET binary
LINES STARTING BY '<person>' TERMINATED BY '</person>'
(@person)
SET
  person_id = ExtractValue(
    @person:=CONVERT(@person using utf8), 'PersonId'),
  fname = ExtractValue(@person, 'FirstName'),
  lname = ExtractValue(@person, 'LastName')
;

A couple of notes:

  • The data does not contain commas, that’s why the imaginary line is put in a single field. One should probably adjust the delimiter (to anything absent in the data)
  • There is some magic with proper data encoding when reading into a user-defined variable. You will anyway need to CONVERT the line to utf8, even if you use CHARACTER SET utf8. I explicitly read each line as binary and convert to utf8 on the first read.

Tags: mysql

MySQL: DISTINCT does not imply sorting

For a long time I have thought that DISTINCT in MySQL (and in SQL) orders data the same way GROUP BY does.

It turns out, it does not, at least in MySQL. Nice to know.

mysql> SELECT DISTINCT a
    -> FROM (
    ->   SELECT 3 as a
    ->   UNION ALL
    ->   SELECT 1 as a
    ->   UNION ALL
    ->   SELECT 2 as a
    ->   UNION ALL
    ->   SELECT 1 as a) as a;
+---+
| a |
+---+
| 3 |
| 1 |
| 2 |
+---+
3 rows in set (0.00 sec)

mysql> SELECT a
    -> FROM (
    ->   SELECT 3 as a
    ->   UNION ALL
    ->   SELECT 1 as a
    ->   UNION ALL
    ->   SELECT 2 as a
    ->   UNION ALL
    ->   SELECT 1 as a) as a
    -> GROUP BY a;
+---+
| a |
+---+
| 1 |
| 2 |
| 3 |
+---+
3 rows in set (0.00 sec)

UNION does not apply any order either:

mysql> SELECT 3 as a
    -> UNION
    -> SELECT 1 as a
    -> UNION
    -> SELECT 2 as a
    -> UNION
    -> SELECT 1 as a;
+---+
| a |
+---+
| 3 |
| 1 |
| 2 |
+---+
3 rows in set (0.00 sec)

Tags: mysql

MySQL: delete duplicates from a table

A question on StackOverflow which is ridiculous as is but rather interesting as an exersise: Delete duplicate records without creating a temporary table.

The question in my formulation: is it possible to write a series of DELETE statements that will remove duplicate rows from a two-column table without unique constraints?

Problems:

  1. rows do not have an identity key or a primary key, so one should think up a way to refer to a single row that should stay
  2. we will need to group rows somehow, that is to apply an order and then condition, but the form of DELETE that supports ORDER BY can only have a WHERE clause and does not support HAVING. That is the order is applied after a condition is met.
  3. we would not need to sort rows if the values would be arranged by the clustered primary key, but we do not have one.

Suppose we have a table:

CREATE TABLE  `tablename` (
  `a_id` int(10) unsigned NOT NULL,
  `b_id` int(10) unsigned NOT NULL,
  KEY `Index_1` (`a_id`,`b_id`)
) ENGINE=InnoDB COLLATE utf8_bin;

I added a key (not UNIQUE or PRIMARY) to make lookups faster and hoping to use it in groupings.

You can feed the table with some values:

INSERT INTO tablename (a_id, b_id)
  VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id)
  VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id)
  VALUES (2, 3), (1, 1), (2, 2), (1,4);

As a side effect, the key became a coverage index and when we make SELECTs from the table the values displayed are sorted, but when we make deletions the values are read in the order we inserted them.

Now, let’s look at the following query:

SELECT @c, @a_id as a, @b_id as b, a_id, b_id
FROM tablename, (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE (@c:=IF(LEAST(@a_id=(@a_id:=a_id),
                    @b_id=(@b_id:=b_id)), @c+1, 1)) >= 1
;

And its result:

@c, a, b, a_id, b_id
 1, 1, 1,    1,    1
 2, 1, 1,    1,    1
 3, 1, 1,    1,    1
 1, 1, 4,    1,    4
 2, 1, 4,    1,    4
 3, 1, 4,    1,    4
 1, 2, 2,    2,    2
 2, 2, 2,    2,    2
 3, 2, 2,    2,    2
 1, 2, 3,    2,    3
 2, 2, 3,    2,    3
 3, 2, 3,    2,    3

The results are automatically sorted using Index_1, and duplicate pairs (a_id, b_id) are enumerated in column @c. That is our task now is to remove all rows where @c > 1. The only problem we have is to force MySQL use Index_1 on deletion which is rather tricky without applying additional conditions. But we can do this by using an equality check or multiple equality checks on a_id:

DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (1)
  AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id),
                    @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;

DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (2)
  AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id),
                    @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;

SELECT * FROM tablename t;

a_id, b_id
   1,    1
   1,    4
   2,    2
   2,    3

I can not put all possible a_id in IN() because MySQL will understand that the index is useless in this case and the query will not remove all duplicates (only adjacent), but having say 10 different a_id I can remove duplicates in two DELETE statements, each IN will have 5 explicit ids.

Hope, this might be useful =)

Tags: mysql